Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Hadoop 101: HBase and Client Access

DZone's Guide to

Hadoop 101: HBase and Client Access

Apache HBase is a NoSQL store that's known for scaling to massive size and supporting fast reads. With SQL on top of it, you get everything you need for big data.

· Big Data Zone
Free Resource

See how the beta release of Kubernetes on DC/OS 1.10 delivers the most robust platform for building & operating data-intensive, containerized apps. Register now for tech preview.

In the current stable version of Hadoop 2.7.3, there are a lot of projects, features, and parts. I will walk you through using the most important and widely used tools, as well as provide tips and techniques for maximizing your usage of your Hadoop Data Lake.

The first piece we'll look at is HBase. Apache HBase is a pretty awesome NoSQL store on top of Zookeeper and storing data in HDFS. It's based on the legendary Google BigTable and is known for scaling to massive size and supporting fast reads.

The easiest way to access data from HBase is using the HBase shell. This is very easy to use and it has built-in documentation available to help. To start the HBase shell, you merely type HBase shell. Then, you will be able to run commands against HBase. Users of Redis and simliar NoSQL tools will be familiar with this type of interface.

hbase(main):002:0> scan 'OSQUERY', {LIMIT => 10}
ROW COLUMN+CELL
01bf7fd8-b428-44f3-8347-0ed8fca8425a column=0:COMPUTER_NAME, timestamp=1485381624471, value=tspanndev13.field.hortonworks.com
01bf7fd8-b428-44f3-8347-0ed8fca8425a column=0:CPU_BRAND, timestamp=1485381624471, value=Intel Xeon E312xx (Sandy Bridge)
01bf7fd8-b428-44f3-8347-0ed8fca8425a column=0:CPU_PHYSICAL_CORES, timestamp=1485381624471, value=8
01bf7fd8-b428-44f3-8347-0ed8fca8425a column=0:FILENAME, timestamp=1485381624471, value=2356051780638113
01bf7fd8-b428-44f3-8347-0ed8fca8425a column=0:PHYSICAL_MEMORY, timestamp=1485381624471, value=15601471488
01bf7fd8-b428-44f3-8347-0ed8fca8425a column=0:_0, timestamp=1485381624471, value=x
02510fb9-0445-4355-9232-c5be9ace07dc column=0:COMPUTER_NAME, timestamp=1485384450120, value=tspanndev13.field.hortonworks.com
02510fb9-0445-4355-9232-c5be9ace07dc column=0:CPU_BRAND, timestamp=1485384450120, value=Intel Xeon E312xx (Sandy Bridge)
02510fb9-0445-4355-9232-c5be9ace07dc column=0:CPU_PHYSICAL_CORES, timestamp=1485384450120, value=8

The scan command will query table OSQUERY, shown above. Using LIMIT,we limit the number of rows returned. Filters can be applied to the query to limit what is returned.

The HBase Java API provides programmatic access to HBase in Java.

First, add the HBase Client to your pom.xml.

<dependency>
    <groupId>org.apache.hbase</groupId>
    <artifactId>hbase-client</artifactId>
    <version>1.2.4</version>
</dependency>

HBase Java Snippet:

import java.nio.charset.StandardCharsets;
import java.util.HashSet;
import java.util.Set;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Admin;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.client.Table;
import org.apache.hadoop.hbase.filter.CompareFilter.CompareOp;
import org.apache.hadoop.hbase.filter.FilterList;
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter;
import org.apache.hadoop.hbase.util.Bytes;

// ...
    Configuration conf = new Configuration(true);
    conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"));
    Connection conn = ConnectionFactory.createConnection(conf);
    TableName tableName = TableName.valueOf("MyTable");
    Table table = conn.getTable(tableName);
    Scan scan = new Scan();
    FilterList list = new FilterList(FilterList.Operator.MUST_PASS_ALL);
    SingleColumnValueFilter cvFilter1 = new SingleColumnValueFilter("COLFAM1", 
                    Bytes.toBytes("COL1"), CompareOp.EQUAL, Bytes.toBytes("YES"));
    cvFilter2.setFilterIfMissing(true);
    list.addFilter(cvFilter2);
    scan.addColumn("COLFAM1", Bytes.toBytes("COL1"));
    scan.setFilter(list);
    ResultScanner rs = t.getScanner(scan);
    for (Result result : rs) {
        for (Cell cell : result.listCells()) {
            String qualifier = Bytes.toString(CellUtil.cloneQualifier(cell));
            String value = Bytes.toString(CellUtil.cloneValue(cell));
            System.out.printf("  Qualifier: [%s] : Value: [%s] ", qualifier, value);
        }
    }

//...

That's a bit verbose and heavy to grab a few fields from a table.

Fortunately, there's another way to access HBase data: by enabling Phoenix to allow for SQL queries against your dataset. You can create tables directly in SQL DDL with Phoenix, or you can create views on top of your existing HBase tables.

Phoenix can be queried from the command line using the SQLLine Command line tool via  /usr/hdp/current/phoenix-client/bin/sqlline.py server:2181/hbase-unsecure.

0: jdbc:phoenix:server> select * from osquery limit 10;
+---------------------------------------+------------------------------------+--------------------+-------------------+---------------------+-----------------------------------+------------------+
|                 UUID                  |           COMPUTER_NAME            | CPU_LOGICAL_CORES  |     FILENAME      | CPU_PHYSICAL_CORES  |             CPU_BRAND             | PHYSICAL_MEMORY  |
+---------------------------------------+------------------------------------+--------------------+-------------------+---------------------+-----------------------------------+------------------+
| 01bf7fd8-b428-44f3-8347-0ed8fca8425a  | server                             |                    | 2356051780638113  | 8                   | Intel Xeon E312xx (Sandy Bridge)  | 15601471488      |
| 02510fb9-0445-4355-9232-c5be9ace07dc  | server                             |                    | 2359141786875592  | 8                   | Intel Xeon E312xx (Sandy Bridge)  | 15601471488      |
| 03956aca-9082-4fb5-9ab4-b4f962e537fe  | server                             |                    | 2357956780866673  | 8                   | Intel Xeon E312xx (Sandy Bridge)  | 15601471488      |
| 0566edd6-9010-493f-8bf1-57c56845378f  | server                             |                    | 2356996778377362  | 8                   | Intel Xeon E312xx (Sandy Bridge)  | 15601471488      |
| 06363d97-4785-42c6-a0c3-d3418bbb8dd7  | server                             |                    | 2358631778654353  | 8                   | Intel Xeon E312xx (Sandy Bridge)  | 15601471488      |
| 0798bcee-425c-41fb-a7c1-97cc7f72af63  | server                             |                    | 2357701779600106  | 8                   | Intel Xeon E312xx (Sandy Bridge)  | 15601471488      |
| 08118f67-c0b5-4e4b-bc58-ed66ba2fcef5  | server                             |                    | 2358826779384738  | 8                   | Intel Xeon E312xx (Sandy Bridge)  | 15601471488      |
| 08b3b700-8eba-4395-acc8-a853f613eaa4  | server                             |                    | 2358646778374063  | 8                   | Intel Xeon E312xx (Sandy Bridge)  | 15601471488      |
| 08bb9e32-b24c-4766-b542-0131f4b29d1f  | server                             |                    | 2357881777187953  | 8                   | Intel Xeon E312xx (Sandy Bridge)  | 15601471488      |
| 09151e7c-1cc7-4e4c-a26f-5d4adf651681  | server                             |                    | 2356711779234149  | 8                   | Intel Xeon E312xx (Sandy Bridge)  | 15601471488      |
+---------------------------------------+------------------------------------+--------------------+-------------------+---------------------+-----------------------------------+------------------+
10 rows selected (0.093 seconds)
0: jdbc:phoenix:server>

From Java, you call it via the JDBC driver and it works like a regular JDBC query.

package phoenixstuff;

import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.PreparedStatement;


/**
 * 
 * @author tspann
 *
 */
public class PhoenixQuery {


 public static void main(String[] args) throws SQLException {
  ResultSet rset = null;
  Connection con = DriverManager.getConnection("jdbc:phoenix:server:2181:/hbase-unsecure");

  PreparedStatement statement = con.prepareStatement("select * from osquery");
  rset = statement.executeQuery();
  while (rset.next()) {
   System.out.printf("\n COMPUTER_NAME [%s] FILENAME \n",
    rset.getString("COMPUTER_NAME"),
    rset.getString("FILENAME")
   );
  }
  statement.close();
  con.close();
 }
}

For Phoenix pom.xml , this is what you need:

  <dependencies>
    <dependency>
      <groupId>org.apache.hbase</groupId>
      <artifactId>hbase-client</artifactId>
      <version>1.2.4</version>
    </dependency>
    <dependency>
    <groupId>org.apache.phoenix</groupId>
    <artifactId>phoenix-core</artifactId>
<version>4.4.0-HBase-1.0</version>
               <exclusions>
                                <exclusion>
                                        <artifactId>slf4j-log4j12</artifactId>
                                        <groupId>org.slf4j</groupId>
                                </exclusion>
                                <exclusion>
                                        <artifactId>log4j</artifactId>
                                        <groupId>log4j</groupId>
                                </exclusion>
                                <exclusion>
                                        <artifactId>servlet-api</artifactId>
                                        <groupId>javax.servlet</groupId>
                                </exclusion>
<exclusion>
<groupId>org.mortbay.jetty</groupId>
<artifactId>servlet-api-2.5</artifactId>
</exclusion>
            <exclusion>
                <groupId>org.eclipse.jetty.aggregate</groupId>
                <artifactId>*</artifactId>
            </exclusion>
                        </exclusions>
</dependency>

You can query Phoenix from most tools via its JDBC driver including DBVisualizer, Apache NiFi/HDF (ExecuteSQL), and others.

For my Python people, you are not forgotten.

pip install phoenixdb

Here's a simple example usage:

import phoenixdb

database_url = 'http://server:8765/'
conn = phoenixdb.connect(database_url, autocommit=True)

cursor = conn.cursor()
cursor.execute("SELECT * FROM tweets")
print cursor.fetchall()

Phoenix is also queryable via Apache Zeppelin and its plain old SQL.

Image title

That's a good selection for now. We'll do a deep dive into HBase and Phoenix in the next set of articles. As you can see, HBase is a nice database option, but with SQL on top, you get everything you need for big data with fast analytics, microservices access, enterprise querying, and tools.

References

New Mesosphere DC/OS 1.10: Production-proven reliability, security & scalability for fast-data, modern apps. Register now for a live demo.

Topics:
hadoop ,nosql ,big data ,data lakes ,hbase ,client access

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}