HBase Compaction and Data Locality With Hadoop
Compaction is a process by which HBase cleans itself, and data locality is a solution to data not being available to Mapper.
Join the DZone community and get the full member experience.Join For Free
HBase is a distributed data store optimized for read performance. Optimal read performance comes from having one file per column family. It is not always possible to have one file per column family during the heavy writes. That is reason why HBase tries to combine all HFiles into a large single HFile to reduce the maximum number of disk seeks needed for read. This process is known as compaction.
Compaction is a process by which HBase cleans itself. It comes in two flavors: minor compaction and major compaction.
Minor compaction is the process of combining the configurable number of smaller HFiles into one Large HFile. Minor compaction is very important because without it, reading particular rows requires many disk reads and can reduce overall performance.
Major compaction is a process of combining the StoreFiles of regions into a single StoreFile. It also deletes remove and expired versions. By default, major compaction runs every 24 hours and merges all StoreFiles into single StoreFile. After compaction, if the new larger StoreFile is greater than a certain size (defined by property), the region will split into new regions.
Disable Automatic Major Compaction
Major compaction can be disabled by updating
Decrease Region Server File Size
Region server file size can be decreased by updating
Major compaction is a heavyweight operation, so run it when your cluster load is low.
Major compaction is not just about compacting the files. When the record is deleted or version is expired, you need to perform all that cleanup. Major compaction will help us in cleaning up the records.
Whenever you runs Major Compaction, please make sure you use
hbase.hregion.majorcompaction.jitterto ensure the major compaction doesn't run on all the nodes at the same time.
Data sets in Hadoop is stored in HDFS. t is divided into blocks and stored across the data nodes in a Hadoop cluster. When a MapReduce job is executed against the dataset, the individual Mappers will process the blocks (input splits). When data is not available for Mapper in the same node, then data has to copied over the network from the data node that has data to the data node that is executing the Mapper task. This is known as a data locality.
Data locality in Hadoop is divided into three categories.
1. Data Local Data Locality
When data is located on the same node as the mapper working on the data, it is referred as data local data locality. In this case, the proximity of data is very near to computation. This is the most preferred scenario.
2. Intra-Rack Data Locality
It is always not possible to execute the Mapper on the same node as data due to resource constraints. In such cases, the Mapper is executed on another node within the same rack as the node that has data. It is referred as intra-rack data locality.
3. Inter-Rack Data Locality
It is always not possible to achieve data locality as well as intra-rack locality due to resource constraints. In such cases, we will execute the mapper on nodes on different racks, and the data is copied from the node that has data to the node executing mapper between racks. It is referred as inter-rack data locality. This is the least preferred scenario.
Opinions expressed by DZone contributors are their own.