An Introduction to HBase
int his article, let's take a look at an introduction to HBase and also explore how to create 3 node HBase clusters.
Join the DZone community and get the full member experience.Join For Free
in our last two articles, we talked about the hdfs cluster and zookeeper cluster. which is needed for deploying opentsdb in clustered mode. continuing to the series, we are going to talk about hbase, which will be used by opentsdb in the cluster to store data.
hbase is a column-oriented nosql database management system that runs on top of hadoop distributed file system (hdfs) .
it is a part of the hadoop ecosystem that provides random real-time read/write access to data in the hadoop file system.
one can store the data in hdfs either directly or through hbase. data consumer reads/accesses the data in hdfs randomly using hbase. hbase sits on top of the hadoop file system and provides read and write access.
it is well suited for sparse data sets, which are common in many big data use cases. like most of other apache projects, it is also mainly written in java. it can store the huge amount of data from terabytes to petabytes. hbase is not a relational database system. unlike the relational database system, it does not support a structured query language like sql. it is built for low latency operations, which is having some specific features compared to traditional relational models.
storage mechanism in hbase:
hbase is a column-oriented database. it stores data in tables and sorted by rowid. in table schema, only column family is defined. it is a key-value pair. a table has multiple column families and each column family can have any number of columns. hbase stores data on disk in a column-oriented format, it is distinctly different from traditional columnar databases.
in hbase, the tables are divided into regions and served by region servers.
the main component of hbase are:
- master server usage apache zookeeper and assigns region to the region server
- responsible for load balancing. it will reduce the load from busy servers and assign that region to less occupied servers.
- responsible for schema changes (hbase table creation, the creation of column families etc).
- interface for creating, deleting, updating tables
- monitor all the region servers in the cluster.
the hbase tables are the tables that are split horizontally into regions and are managed by region server.
hbase region server:
regions are assigned to a node in the cluster called region server. region server manages region. when data size grows beyond the limit, to reduce the load on one region server. hbase automatically splits the table and distributes the load to another region server. a single region server can server around 1000 regions.
the process of splitting tables into regions is called sharding and it is done automatically.
role of region server:
- it communicates with the client and handles data-related operation
- decide the size of the region
- splitting regions automatically
- handling read and writes requests
- handle the read and write request for all the regions under it.
hfile is a file-based data structure that is used to store data in hbase. it is key/value type of file data structure. a file of sorted key/value pairs. both keys and values are byte arrays. this data structure supports random read and writes operation on the table. using key it will update the values on the table.
memstore is a write buffer. before permanent write data is a buffered in memstore. when memstore is full it content is flushed to hfile. it doesn't write in existing hfile instead it creates a new one.
hbase uses hdfs to store data. for more info please refer our article: an introduction to hdfs .
hbase uses zookeeper as a centralized monitoring server to maintain configuration information. it also provides distributed synchronization. for more info, please refer to our last article: an introduction to zookeeper .
for deploying hbase, we will use the harisekhon/hbase:1.2 docker image.
create hbase-site.xml file in /root/hadoop/ location in all 3 vm's.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.rootdir</name> <value>hdfs://namenode:8020/hbase</value> </property> <property> <name>hbase.zookeeper.property.clientport</name> <value>2181</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>zoo1,zoo2,zoo3</value> </property> <property> <name>hbase.zookeeper.session.timeout</name> <value>60000</value> </property> <property> <name>hbase.status.published</name> <value>false</value> </property> <property> <name>hbase.region.replica.replication.enabled</name> <value>true</value> </property> </configuration>
replace zoo1,zoo2,zoo3 with respective zookeeper ip.
hbase on vm 1:
docker run -dit --name hbase1 -p 8080:8080 -p 8085:8085 -p 9090:9090 -p 9095:9095 -p 16000:16000 -p 16010:16010 -p 16201:16201 -p 16301:16301 -v /root/hadoop/hbase-site.xml:/hbase-1.2.6/conf/hbase-site.xml --env-file hbase_env --network generic-class-net -h hbase1.generic-class-net harisekhon/hbase:1.2
hbase on vm 2:
docker run -dit --name hbase2 -p 8080:8080 -p 8085:8085 -p 9090:9090 -p 9095:9095 -p 16000:16000 -p 16010:16010 -p 16201:16201 -p 16301:16301 -v /root/hadoop/hbase-site.xml:/hbase-1.2.6/conf/hbase-site.xml --env-file hbase_env --network generic-class-net -h hbase2.generic-class-net harisekhon/hbase:1.2
hbase on vm 3:
docker run -dit --name hbase3 -p 8080:8080 -p 8085:8085 -p 9090:9090 -p 9095:9095 -p 16000:16000 -p 16010:16010 -p 16201:16201 -p 16301:16301 -v /root/hadoop/hbase-site.xml:/hbase-1.2.6/conf/hbase-site.xml --env-file hbase_env --network generic-class-net -h hbase3.generic-class-net harisekhon/hbase:1.2
once all the services are deployed, you can see the hbase status on http://<vm1 | vm2 | vm3 ip>:16010/master-status .
Published at DZone with permission of Nitin Ranjan, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.