DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Reporting in Microservices: How To Optimize Performance
  • Migrate HDFS Data to Azure
  • Why Round-Robin Won't Save You: Load Balancing Challenges in Data Streaming Services With Heterogeneous Traffic
  • Good Data, Bad Metric: A Mutation Testing Pattern for Analytics Engineering

Trending

  • The Missing `bandit` for AI Agents: How I Built a Static Analyzer for Prompt Injection
  • Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 2
  • Securing the AI Host: Spring AI MCP Server Communication With API Keys
  • Beyond Manual Annotation: Engineering Self-Correcting Pseudo-Labeling Pipelines
  1. DZone
  2. Data Engineering
  3. Big Data
  4. HBase Compaction and Data Locality With Hadoop

HBase Compaction and Data Locality With Hadoop

Compaction is a process by which HBase cleans itself, and data locality is a solution to data not being available to Mapper.

By 
Jitendra Bafna user avatar
Jitendra Bafna
·
Mar. 02, 17 · Opinion
Likes (5)
Comment
Save
Tweet
Share
31.1K Views

Join the DZone community and get the full member experience.

Join For Free

HBase is a distributed data store optimized for read performance. Optimal read performance comes from having one file per column family. It is not always possible to have one file per column family during the heavy writes. That is reason why HBase tries to combine all HFiles into a large single HFile to reduce the maximum number of disk seeks needed for read. This process is known as compaction.

Compaction is a process by which HBase cleans itself. It comes in two flavors: minor compaction and major compaction.

Minor compaction is the process of combining the configurable number of smaller HFiles into one Large HFile. Minor compaction is very important because without it, reading particular rows requires many disk reads and can reduce overall performance.

Major compaction is a process of combining the StoreFiles of regions into a single StoreFile. It also deletes remove and expired versions. By default, major compaction runs every 24 hours and merges all StoreFiles into single StoreFile. After compaction, if the new larger StoreFile is greater than a certain size (defined by property), the region will split into new regions.

Disable Automatic Major Compaction

Major compaction can be disabled by updating hbase-site.xml:

<property> 
  <name>hbase.hregion.majorcompaction</name> 
  <value>0</value> 
</property>

Decrease Region Server File Size

Region server file size can be decreased by updating hbase-site.xml :

<property> 
  <name>hbase.hregion.max.filesize</name> 
  <value>1073741824</value> 
</property>
  • Major compaction is a heavyweight operation, so run it when your cluster load is low.

  • Major compaction is not just about compacting the files. When the record is deleted or version is expired, you need to perform all that cleanup. Major compaction will help us in cleaning up the records.

  • Whenever you runs Major Compaction, please make sure you use hbase.hregion.majorcompaction.jitter to ensure the major compaction doesn't run on all the nodes at the same time.

Data Locality

Data sets in Hadoop is stored in HDFS. t is divided into blocks and stored across the data nodes in a Hadoop cluster. When a MapReduce job is executed against the dataset, the individual Mappers will process the blocks (input splits). When data is not available for Mapper in the same node, then data has to copied over the network from the data node that has data to the data node that is executing the Mapper task. This is known as a data locality.

Data locality in Hadoop is divided into three categories.

1. Data Local Data Locality

When data is located on the same node as the mapper working on the data, it is referred as data local data locality. In this case, the proximity of data is very near to computation. This is the most preferred scenario.

2. Intra-Rack Data Locality

It is always not possible to execute the Mapper on the same node as data due to resource constraints. In such cases, the Mapper is executed on another node within the same rack as the node that has data. It is referred as intra-rack data locality.

3. Inter-Rack Data Locality

It is always not possible to achieve data locality as well as intra-rack locality due to resource constraints. In such cases, we will execute the mapper on nodes on different racks, and the data is copied from the node that has data to the node executing mapper between racks. It is referred as inter-rack data locality. This is the least preferred scenario.

Data (computing) hadoop

Opinions expressed by DZone contributors are their own.

Related

  • Reporting in Microservices: How To Optimize Performance
  • Migrate HDFS Data to Azure
  • Why Round-Robin Won't Save You: Load Balancing Challenges in Data Streaming Services With Heterogeneous Traffic
  • Good Data, Bad Metric: A Mutation Testing Pattern for Analytics Engineering

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook