Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Streaming ETL Lookups With Apache NiFi and Apache HBase

DZone's Guide to

Streaming ETL Lookups With Apache NiFi and Apache HBase

Learn how to stream microservices-style ETL lookups with Apache NiFi and Apache HBase — partially using a REST API called Bacon!

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

When we are ingesting tabular/record-oriented data, we often want to enrich the data by replacing IDs with descriptions or vice versa. There are many transformations that may need to happen before the data is in a happy state. When you are denormalizing your data in Hadoop and building very wide tables, you often want descriptions or other data to enhance its usability. Only one call to get everything you need is nice, especially when you have 100 trillion records.

We are utilizing a lot of things built already. Make sure you read Abdelkrim's first three lookup articles. I added some fields to his generated data for testing.

I want to do my lookups against HBase, which is a great NoSQL store for lookup tables and generating datasets.

First, I created an HBase table to use for lookups.

Create HBase table for lookups:

create 'lookup_', 'family'

Table with data:

Most people would have a pre-populated table for lookups. I don't, and since we are using a generator to build the lookup IDs, I am building the lookup descriptions with a REST call at the same time. We could also have a flow (if you don't find the lookup, add it). We could also have another flow ingesting the lookup values and add/update those when needed.

Here's a REST API to generate product descriptions.

I found this cool API that returns a sentence of meat words. I use this as our description because MEAT!

Call the Bacon API!

Let's turn our plain text into a clean JSON document:

Then, I store it in HBase as my lookup table. You probably already have a lookup table. This is a demo and I am filling it with my generator. This is not a best practice or a good design pattern. This is a lazy way to populate a table.

Example Apache NiFi flow (using Apache NiFi 1.5):

Generate some test data: 

Generate a JSON document (not the empty prod_desc):

{
"ts" : "${now():format('yyyymmddHHMMSS')}",
"updated_dt" : "${now()}",
"id_store" : ${random():mod(5):toNumber():plus(1)},
"event_type" : "generated",
"uuid" : "${UUID()}",
"hostname" : "${hostname()}",
"ip" : "${ip()}",
"counter" : "${nextInt()}",
"id_transaction" : "${random():toString()}",
"id_product" : ${random():mod(500000):toNumber()},
"value_product" : ${now():toNumber()},
"prod_desc": ""
}

Look up your record:

This is the magic. We take in our records; in this case, we are reading JSON records and writing JSON records. We could choose CSV, AVRO, or others. We connect to the HBase Record Lookup Service. We replace the current prod_desc field in the record with what is returned by the lookup. We use the id_product field as the lookup key. There is nothing else needed to change records in stream.

HBase record lookup service:

HBase client service used by HBase record lookup service:

We can use UpdateRecord to clean up, transform, or modify any field in the records in the stream.

Original file:

{
"ts" : "201856271804499",
"updated_dt" : "Fri Apr 27 18:56:15 UTC 2018",
"id_store" : 1,
"event_type" : "generated",
"uuid" : "0d16967d-102d-4864-b55a-3f1cb224a0a6",
"hostname" : "princeton1",
"ip" : "172.26.217.170",
"counter" : "7463",
"id_transaction" : "5307056748245491959",
"id_product" : 430672,
"value_product" : 1524855375500,
"prod_desc": ""
}

Final file:

[ {
  "ts" : "201856271804499",
  "prod_desc" : "Pork chop leberkas brisket chuck, filet mignon turducken hamburger.",
  "updated_dt" : "Fri Apr 27 18:56:15 UTC 2018",
  "id_store" : 1,
  "event_type" : "generated",
  "uuid" : "0d16967d-102d-4864-b55a-3f1cb224a0a6",
  "hostname" : "princeton1",
  "ip" : "172.26.217.170",
  "counter" : "7463",
  "id_transaction" : "5307056748245491959",
  "id_product" : 430672,
  "value_product" : 1524855375500
} ]

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
apache hbase ,apache nifi ,big data ,lookups ,etl ,tutorial ,rest api ,nosql

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}