What Is HBase in Hadoop NoSQL?
In this article, take a look at HBase in Hadoop NoSQL and see the characteristics and architecture of HBase.
Join the DZone community and get the full member experience.Join For Free
HBase is a column-oriented data store that sits on top of the Hadoop Distributed File System and provides random data lookup and updates for big data consultants. Hadoop Distributed File System is based on “Write Once Read Many” architecture which means that files once written to HDFS storage layer cannot be modified but only be read any number of times. However, HBase provides a schema on top of the HDFS files to access and update these files any number of times.
HBase provides strong consistency for both Read/Write which means you will always get the latest data in a read operation and also write operation will not be completed unless all the replicas have been updated.
HBase provides automatic sharding using the concepts of regions, which are distributed over the cluster. Whenever the table size becomes too large to accommodate the data, it is auto sharded and distributed among multiple machines.
HBase provides automatic region failover in case of failures.
HBase is based on top of HDFS and can be integrated with MapReduce programs to act as a source and sinks.
Java API/Rest/Thrift API
HBase provides Java APIs as well as Rest/Thrift APIs for non-java endpoints
HBase has an inbuilt block cache and bloom filter for query optimization.
When Not to Use HBase?
- When your data is not big enough. HBase is suited for data that can be represented in billions of rows that cannot be accommodated on the traditional RDBMS database.
- When your data is coming at a constant rate and not expected to grow in the future.
- When you don’t care about transaction controls, triggers, secondary indexing, and many other features that are supported by the traditional database.
HBase has Master-Slave architecture in which we have one HBase Master also known as HMaster and multiple slaves that are called region servers or HRegionServers.
Regions: Table in HBase are split over multiple regions and these regions are distributed over multiple machines in the cluster.
HBase Master: HBase is responsible for assigning regions to region servers, provide admin console (to create, update and delete table) and control the failures. In the case of reading requests, HMaster receives the client request and forwards it to the appropriate region server.
Region Server Slaves: Region servers run on all worker nodes and serves a set of regions. Region Servers consist of block cache which holds frequently access data to serve read requests more efficiently. Region Server also consists of memstore that is a write cache to cache new data that is not yet written to disk. Data is written to multiple Hfile on the disk of region servers.
ZooKeeper: HBase uses zookeeper for coordination and failure recovery. A zookeeper holds the configuration information about HBase Master and Region Servers. The client must access zookeeper first in order to connect with the HBase cluster. ZKquoram is a zookeeper daemon that monitors for failures and repair failed nodes. So Zookeeper is an integral part of HBase architecture that maintains all the coordination and synchronization in the HBase cluster.
HBase Data Model
HBase Tables: It is a collection of rows and these tables are spread over distributed regions.
HBase Row: it represents a single entity in an HBase table.
Row Key: It is just like a primary key that is used to uniquely identify each row in HBase table.
Columns: Columns represent the attributes of an entity. For example, In the customer HBase table, columns could be customer name, age, phone no, etc.
Column Family: All the columns that exhibit some kinds of same qualities can be clubbed together in same column family and these columns are stored on Hadoop Distributed File System as Hfile.
Getting Started With HBase
We will create the below table named as an employee using HBase shell command and then by using Java API. Employee table has two column families namely personal column family, which represents personal information such as name, age, and professional column family, which represent professional information such as salary and designation.
HBase Shell Commands
HBase is an ideal choice when your big data is already stored on Hadoop. HBase mitigates the drawbacks of HDFS system by providing random read/writes and updates. It is a distributed, horizontally scalable, fault-tolerant data store that works pretty well with Hadoop Cluster.
Opinions expressed by DZone contributors are their own.