Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

HBase, Phoenix, and Java — Part 1

DZone's Guide to

HBase, Phoenix, and Java — Part 1

Read this article in order to learn more about HBase, Phoenix, and Java and how to access data stored in an HBase table using the HBase API.

· Database Zone ·
Free Resource

MariaDB TX, proven in production and driven by the community, is a complete database solution for any and every enterprise — a modern database for modern applications.

Introduction

HBase is one of the NoSQL data stores available in the Apache Hadoop ecosystem. It is an open-source, multidimensional, distributed, and scalable NoSQL data store written in Java. HBase runs on top of HDFS (Hadoop Distributed File System) and is designed to provide a fault tolerant mechanism of storing large collections of sparse data sets. HBase achieves high throughput and low latency by providing faster read/write access on huge datasets. Hence, it is the choice for applications that require fast and random access to large amounts of data.

Image title

Figure 1: Hadoop 1 Ecosystem

HBase is a column-oriented data store. In this database, data is stored in cells grouped in columns rather than rows. Columns are logically grouped into column families that can either be created during schema definition or at runtime. Column-oriented databases store all the cells corresponding to a column as continuous disk entries, making access and search much faster. Column-oriented databases are suitable for storing customer behaviors in e-commerce, stock market data, Google Maps, etc.

For comparison, let us consider some data and see how it is stored in an RDBMS and in HBase. Table 1 shows how data is stored in an RDBMS.

EMP ID

First Name

City

Country

Continent

Primary Skill

Secondary Skill

1

Ravi

Mumbai

India

Asia

C++

Oracle

2

Jack

London

UK

Europe

Java

Oracle

3

Harper

San Jose

USA

USA

Linux

Red Hat

Table 1: Data stored in an RDBMS

The same data can be stored in a column-oriented data store like HBase as shown in Table 2, where "Personal Details" and "Skill Details" are column families. "First Name" and "City" are specific columns within the "Personal Details" column family.

EMP ID Personal Details Skill Details
First Name City Country Continent Primary Secondary
1 Ravi Mumbai India Asia C++ Oracle
2 Jack London UK Europe Java Oracle
3 Harper San Jose USA North America Linux Red Hat

Table 2: Data stored in a column-store like HBase

The advantage of this column-oriented structure is that it is much faster and easier to fetch names of all cities in the database compared to performing the same operation in an RDBMS. In other words, if we wish to fetch specific columns from the database, a column-oriented store has far better performance than an RDBMS, where all the rows need to be fetched and specific columns must be extracted from each of the fetched rows.

Pros and Cons of Column-Oriented Databases

Column-oriented databases have the following advantages:

  • Have built-in support for efficient and data compression
  • Support fast data retrieval
  • Administration and configuration is simplified
  • Can be scaled out and hence is very easy to expand
  • Well-suited for high performance on aggregation queries (such as COUNT, SUM, AVG, MIN, and MAX)
  • Efficient for partitioning, as it provides features of automatic sharding to distribute bigger regions to smaller ones

Column-oriented databases have the following limitations:

  • Queries with JOINs and data from many tables are not optimized
  • Records and deletes a lot of updates and has to make frequent compaction and splits, which reduces its storage efficiency
  • Partitioning or indexing schemes can be difficult to design as a relational concept and is not implicit

HBase Architecture

HBase has several components, namely HBase HMaster, ZooKeeper, NameNode, and Region Servers. In HBase, tables are split into smaller chunks that are distributed across multiple servers. These chunks are called regions and the servers that host regions are called RegionServers. The master process handles the distribution of regions among the RegionServers, where each RegionServer typically hosts multiple regions.

Image title

Figure 2: HBase Architecture

The HRegionServer is responsible for creating a connection with the region (HRegion). Each HRegion has one or more StoreFile (A StoreFile is a wrapper around HFile) and MemStore. A MemStore accumulates the data edits as they happen and writes them into the memory.

Features of HBase

Some of the features of HBase are

  • Atomic read and write: On a row level, HBase provides atomic read and write. It can be explained as, during one read or write process, all other processes are prevented from performing any read or write operations.
  • Consistent reads and writes: HBase provides consistent reads and writes due to above feature.
  • Linear and modular scalability: As data sets are distributed over HDFS, thus it is linearly scalable across various nodes, as well as modularly scalable, as it is divided across various nodes.
  • Automatic and configurable sharding of tables: HBase tables are distributed across clusters and these clusters are distributed across regions. These regions and clusters split, and are redistributed as the data grows.
  • Easy to use Java API for client access: It provides easy to use Java API for programmatic access.
  • Thrift gateway and a REST-ful Web services: It also supports Thrift and REST API for non-Java front-ends.
  • Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization.
  • Automatic failure support: HBase with HDFS provides WAL (Write Ahead Log) across clusters, which provides automatic failure support.
  • Sorted row keys: As searching is done on a range of rows, HBase stores row keys in a lexicographical order. Using these sorted row keys and timestamp, we can build an optimized request.

HBase supports ordered partitioning in which rows of a Column Family are stored in RowKey order. It does not support read load balancing, one Region Server serves the read request and the replicas are only used in case of failure.

Using HBase

We can use HBase in the following (but not limited to) scenarios:

  • Where we have large data sets (millions or billions or rows and columns) and we require fast, random and real-time, read and write access over the data.
  • The data sets are distributed across various clusters and we need high scalability to handle data.
  • The data is gathered from various data sources and it is either semi-structured data, unstructured data, or a combination of all. It could be handled easily with HBase.
  • We have column-oriented data.
  • We have many versions of the data sets and we need to store all of them.

HBase VS HDFS

Though HBase is built on top of HBase, it differs from HDFS in the following ways

HDFS

HBase

HDFS provides high latency operations

HBase provides low latency access to small amounts of data within large data set

HDFS supports WORM (Write once Read Many or Multiple times) access


HBase supports random read and writes


HDFS is basically or primarily accessed through MapReduce or Tez jobs


HBase is accessed through shell commands, Java API, REST, Avro or Thrift API.


HDFS stores large data sets in a distributed environment and leverages batch processing on that data



Table 3: HDFS Vs HBase

Example

Let us see how we can access data stored in an HBase table, using the HBase API. For this purpose, we will use a Java application.

The following sample shows how a Java-based application can access data in an HBase table using its API. The table is named "BMARKS". The table has columns named "NAME", "ENG, "MATH", "SCI", "HIST", and "GEO". In other words, we can say that the HBase table is used to store the marks of students for specific subjects.

package hbase_bmarks;

import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.NavigableMap;
import java.util.Properties;
import java.util.Set;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.HTableFactory;
import org.apache.hadoop.hbase.client.HTableInterface;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.filter.Filter;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.filter.Filter;
import org.apache.hadoop.hbase.filter.FilterList;
import org.apache.hadoop.hbase.filter.FilterList.Operator;
import org.apache.hadoop.hbase.filter.PrefixFilter;
import org.apache.hadoop.hbase.filter.QualifierFilter;
import org.apache.hadoop.hbase.filter.BinaryComparator;
import org.apache.hadoop.hbase.filter.CompareFilter.CompareOp;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.MasterNotRunningException;
import com.google.protobuf.ServiceException;

import misc.Misc;

public class BMarks {
    private final static byte[] tableName = Bytes.toBytes("BMARKS");
    private final static byte[] cellData = Bytes.toBytes("cell_data");

    private final static byte[] colENG = Bytes.toBytes("ENG");
    private final static byte[] colMATH = Bytes.toBytes("MATH");
    private final static byte[] colSCI = Bytes.toBytes("SCI");
    private final static byte[] colHIST = Bytes.toBytes("HIST");
    private final static byte[] colGEO = Bytes.toBytes("GEO");

    private final static String HBaseAddress = "127.0.0.1";
    private final static String HBasePort = "2181";
    private final static String HBaseURL = "/hbase-unsecure";
    private final static String tableStr = "BMARKS";

    /** Drop tables if this value is set true. */
    static boolean INITIALIZE_AT_FIRST = false;

    /* use filters to select specific column, in this case the ëENGí column and specific row ër1í */
    private void filters(HBaseAdmin admin, HTableInterface table) throws IOException {
        System.out.println ("*** bmarks - filters ***");

        long start = System.currentTimeMillis();

        Filter filter1 = new PrefixFilter(Bytes.toBytes("r1"));
        Filter filter2 = new QualifierFilter(CompareOp.EQUAL, new BinaryComparator(colENG));

        List<Filter> filters = Arrays.asList(filter1, filter2);
        Filter filter3 = new FilterList(Operator.MUST_PASS_ALL, filters);

        Scan scan = new Scan();
        scan.setFilter(filter3);

        ResultScanner scanner = table.getScanner(scan);
        try {
            int i = 0;
            for (Result result : scanner) {
                System.out.println ("Filter " + scan.getFilter() + " matched row: " + result);
                i++;
            }

            long end = System.currentTimeMillis();
            System.out.println("Time in sec: " + Misc.getTimeInSeconds(start, end));
        } finally {
            scanner.close();
        }
        System.out.println ("Done. ");
    }

    /* Scan all the data. Equivalent to a ëselect *í */
    private void scan(HBaseAdmin admin, HTableInterface table) throws IOException {
        System.out.println("*** bmarks -- scan ***");

        long start = System.currentTimeMillis();

        Scan scan = new Scan();

        long count = 0;
        ResultScanner scanner = table.getScanner(scan);
        try {
            for (Result result : scanner) {
                System.out.println("Found row: " + result);
                count = count + Result.getTotalSizeOfCells(result);

                System.out.println("Row key: " + Misc.byteArrayToString(result.getRow()));
                System.out.println("Row.toString: " + result.toString());
                System.out.println("Row.value: " + Misc.byteArrayToString(result.value()));
                List<Cell> cells = result.listCells();
                byte[] family;
                for ( Cell c : cells ) {
                family = c.getFamilyArray();
                System.out.println("\t family: " + Misc.byteArrayToString(family));
                }
            }

            long end = System.currentTimeMillis();
            System.out.println("Time in sec: " + Misc.getTimeInSeconds(start, end));

            System.out.println("Number of bytes: " + count);
        } finally {
            scanner.close();
        }
        System.out.println("Done.");
    }

    /* same as ëscaní but displaying all information using different ways */
    private void scan2(HBaseAdmin admin, HTableInterface table) throws IOException {
        System.out.println("*** bmarks -- scan2 ***");

        long start = System.currentTimeMillis();

        Scan scan = new Scan();

        long count = 0;
        ResultScanner scanner = table.getScanner(scan);
        try {
            for (Result result : scanner) {
            System.out.println("Found row: " + result);
            System.out.println("result.toString(): " + result.toString());
            System.out.println("result.getRow(): " + Bytes.toString(result.getRow()));
                count = count + Result.getTotalSizeOfCells(result);

                //
                // method 1 -- depricated
                //
            for(KeyValue kv : result.raw()) {
                    System.out.print("\trow: '" + new String(kv.getRow()) + "'");
                    System.out.print(", family: '" + new String(kv.getFamily()) + "'");
                    System.out.print(", qualifier: '" + new String(kv.getQualifier()) + "'");
                    System.out.print(", timestamp: '" + kv.getTimestamSystem.out.println() + "'");
                    System.out.println(", value: '" + new String(kv.getValue()) + "'");
                }

            //
            // method 2 ñ what I tried
            //
            NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>>  map = result.getMaSystem.out.println();
            Set<byte[]> keys = map.keySet();
            for ( byte[] k : keys ) {
            System.out.println("\tKey: " + k + ", string: " + new String(k));
        System.out.println("\t\tcolumn family: " + new String(k) + ", column: null, column value: " + new String(result.getValue(k, null)));
            NavigableMap<byte[], NavigableMap<Long, byte[]>> map2 = map.get(k);
            Set<byte[]> keys2 = map2.keySet();
            for ( byte[] k2 : keys2 ) {
            System.out.println("\t\tKey: " + k2 + ", string: " + new String(k2));
            System.out.println("\t\tcolumn Key: " + k2 + ", string: " + new String(k2));
            System.out.println("\t\t\tcolumn family: " + new String(k) + ", column: " + new String(k2) + ", column value: " + new String(result.getValue(k, k2)));
            }
            }

            // method 3 -- from Internet
            // Traverse entire returned row
                NavigableMap<byte[], NavigableMap<byte[],NavigableMap<Long,byte[]>>> map2 = result.getMaSystem.out.println();
                for (Map.Entry<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>> navigableMapEntry : map2.entrySet()) {
                    String family = Bytes.toString(navigableMapEntry.getKey());
                    System.out.println("\t" + family);
                    NavigableMap<byte[], NavigableMap<Long, byte[]>> familyContents = navigableMapEntry.getValue();
                    for (Map.Entry<byte[], NavigableMap<Long, byte[]>> mapEntry : familyContents.entrySet()) {
                        String qualifier = Bytes.toString(mapEntry.getKey());
                        System.out.println("\t\t" + qualifier);
                        NavigableMap<Long, byte[]> qualifierContents = mapEntry.getValue();
                        for (Map.Entry<Long, byte[]> entry : qualifierContents.entrySet()) {
                            Long timestamp = entry.getKey();
                            String value = Bytes.toString(entry.getValue());
                            System.out.printf("\t\t\t%s, %d\n", value, timestamp);
                        }
                    }
                }
            }

            long end = System.currentTimeMillis();
            System.out.println("Time in sec: " + Misc.getTimeInSeconds(start, end));

            System.out.println("Number of bytes: " + count);
        } finally {
            scanner.close();
        }
        System.out.println("Done.");
    }

    /* get specific column */
    private void scanColumn(HBaseAdmin admin, HTableInterface table, byte[] columnFamily) throws IOException {
    System.out.println("*** bmarks -- scanColumn ***");

    long start = System.currentTimeMillis();

        Scan scan = new Scan();
        scan.addFamily(columnFamily);

        long count = 0;
        ResultScanner scanner = table.getScanner(scan);
        try {
            for (Result result : scanner) {
                System.out.println("Found row: " + result);
                count = count + Result.getTotalSizeOfCells(result);

                System.out.println("Row key: " + Misc.byteArrayToString(result.getRow()));
                System.out.println("Found row: " + result.toString());
                List<Cell> cells = result.listCells();
                byte[] family;
                for ( Cell c : cells ) {
                family = c.getFamilyArray();
                System.out.println("\t family: " + Misc.byteArrayToString(family));
                }
            }

            long end = System.currentTimeMillis();
            System.out.println("Time in sec: " + Misc.getTimeInSeconds(start, end));

            System.out.println("Number of bytes: " + count);
        } finally {
            scanner.close();
        }

        System.out.println("Done.");
    }

    public void run(Configuration config) throws IOException
    {
        HBaseAdmin admin = new HBaseAdmin(config);
        HTableFactory factory = new HTableFactory();

        HTableInterface table = factory.createHTableInterface(config, tableName);
        scan(admin, table);
        scan2(admin, table);
        scanColumn(admin, table, colENG);
        filters(admin, table);
        factory.releaseHTableInterface(table); // Disconnect
    }

    public static void main(String[] args) {
        Configuration config = HBaseConfiguration.create();
        config.addResource(new Path(args[0]));

   HBaseAdmin.checkHBaseAvailable(config);

        HBaseAdmin admin = new HBaseAdmin(config);
        HTableFactory factory = new HTableFactory();

        HTableInterface table = factory.createHTableInterface(config, tableName);

        BMarks bm = new BMarks();
        bm.run(config);
    }
}


MariaDB AX is an open source database for modern analytics: distributed, columnar and easy to use.

Topics:
hbase ,big data ,phoenix ,database ,java

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}