DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Kafka in Action: 7 Steps to Real-Time Streaming From RDBMS to Hadoop

Kafka in Action: 7 Steps to Real-Time Streaming From RDBMS to Hadoop

Here is an in-depth example of using Flume with Kafka to stream real-time RDBMS data into a Hive table on HDFS.

Rajesh Nadipalli user avatar by
Rajesh Nadipalli
·
Aug. 25, 16 · Tutorial
Like (13)
Save
Tweet
Share
23.07K Views

Join the DZone community and get the full member experience.

Join For Free

For enterprises looking for ways to more quickly ingest data into their Hadoop data lakes, Kafka is a great option. What is Kafka? Kafka is a distributed, scalable and reliable messaging system that integrates applications/data streams using a publish-subscribe model. It is a key component in the Hadoop technology stack to support real-time data analytics or monetization of Internet of Things (IOT) data. 

This article is for the technical folks. Read on and I’ll diagram how Kafka can stream data from a relational database management system (RDBMS) to Hive, which can enable a real-time analytics use case. For reference, the component versions used in this article are Hive 1.2.1, Flume 1.6 and Kafka 0.9.

If you're looking for an overview of what Kafka is and what it does, check out my earlier blog published on Datafloq.

Where Kafka Fits: The Overall Solution Architecture

The following diagram shows the overall solution architecture where transactions committed in RDBMS are passed to the target Hive tables using a combination of Kafka and Flume, as well as the Hive transactions feature.

Diagram.png



7 Steps to Real-Time Streaming to Hadoop

Now let’s dig into the solution details and I’ll show you how you can start streaming data to Hadoop in just a few steps.

1. Extract Data From the Relational Database Management System (RDBMS)

All relational databases have a log file that records the latest transactions. The first step for our streaming solution is to obtain these transactions in a format that can be passed to Hadoop. Walking through the exact mechanisms of this extraction could take up a separate blog post — so please reach out to us if you’d like more information pertaining to that process. 

2. Set up the Kafka Producer

Processes that publish messages to a Kafka topic are called “producers.” “Topics” are feeds of messages in categories that Kafka maintains. The transactions from RDBMS will be converted to Kafka topics. For this example, let’s consider a database for a sales team from which transactions are published as Kafka topics. The following steps are required to set up the Kafka producer:

$ cd /usr/hdp/2.4.0.0-169/kafka
$ bin/kafka-topics.sh --create --zookeeper sandbox.hortonworks.com:2181 --replication-factor 1 --partitions 1 --topic SalesDBTransactions

Created topic "SalesDBTransactions".
$ bin/kafka-topics.sh --list --zookeeper sandbox.hortonworks.com:2181
SalesDBTransactions

3. Setting up Hive

Next, we will create a Hive table that is ready to receive the sales team’s database transactions.  For this example, we will create a customer table:

[bedrock@sandbox ~]$ beeline -u jdbc:hive2:// -n hive -p hive
0: jdbc:hive2://> use raj;
create table customers (id string, name string, email string, street_address string, company string)

partitioned by (time string)
clustered by (id) into 5 buckets stored as orc
location '/user/bedrock/salescust'
TBLPROPERTIES ('transactional'='true');

 To enable Hive to handle transactions, the following setting is required in Hive configuration:

 hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager  

4. Set up a Flume Agent to Stream From Kafka to Hive

Now let’s look at how to create a Flume agent that will source from Kafka topics and send data to the Hive table.

Follow these steps to set up the environment before starting the Flume agent:

$ pwd
/home/bedrock/streamingdemo
$ mkdir flume/checkpoint
$ mkdir flume/data
$ chmod 777 -R flume


$ export HIVE_HOME=/usr/hdp/current/hive-server2
$ export HCAT_HOME=/usr/hdp/current/hive-webhcat
 
$ pwd
/home/bedrock/streamingdemo/flume
$ mkdir logs

Next, create a log4j properties file as follows:

[bedrock@sandbox conf]$ vi log4j.properties

flume.root.logger=INFO,LOGFILE
flume.log.dir=/home/bedrock/streamingdemo/flume/logs
flume.log.file=flume.log

Then use the following configuration file for the Flume agent:

$ vi flumetohive.conf
flumeagent1.sources = source_from_kafka
flumeagent1.channels = mem_channel
flumeagent1.sinks = hive_sink


# Define / Configure source
flumeagent1.sources.source_from_kafka.type = org.apache.flume.source.kafka.KafkaSource

flumeagent1.sources.source_from_kafka.zookeeperConnect = sandbox.hortonworks.com:2181

flumeagent1.sources.source_from_kafka.topic = SalesDBTransactions
flumeagent1.sources.source_from_kafka.groupID = flume
flumeagent1.sources.source_from_kafka.channels = mem_channel
flumeagent1.sources.source_from_kafka.interceptors = i1
flumeagent1.sources.source_from_kafka.interceptors.i1.type = timestamp
flumeagent1.sources.source_from_kafka.consumer.timeout.ms = 1000
 
# Hive Sink
flumeagent1.sinks.hive_sink.type = hive
flumeagent1.sinks.hive_sink.hive.metastore = thrift://sandbox.hortonworks.com:9083

flumeagent1.sinks.hive_sink.hive.database = raj
flumeagent1.sinks.hive_sink.hive.table = customers
flumeagent1.sinks.hive_sink.hive.txnsPerBatchAsk = 2
flumeagent1.sinks.hive_sink.hive.partition = %y-%m-%d-%H-%M
flumeagent1.sinks.hive_sink.batchSize = 10
flumeagent1.sinks.hive_sink.serializer = DELIMITED
flumeagent1.sinks.hive_sink.serializer.delimiter = ,
flumeagent1.sinks.hive_sink.serializer.fieldnames = id,name,email,street_address,company



# Use a channel which buffers events in memory
flumeagent1.channels.mem_channel.type = memory
flumeagent1.channels.mem_channel.capacity = 10000
flumeagent1.channels.mem_channel.transactionCapacity = 100


# Bind the source and sink to the channel
flumeagent1.sources.source_from_kafka.channels = mem_channel
flumeagent1.sinks.hive_sink.channel = mem_channel

5. Start the Flume Agent

Use the following command to start the flume agent:

 $ /usr/hdp/apache-flume-1.6.0/bin/flume-ng agent -n flumeagent1 -f ~/streamingdemo/flume/conf/flumetohive.conf Black_1.png

6. Start the Kafka Stream

As an example, below is a simulation of the transactions messages, which in an actual system will need to be generated by the source database. For instance, the following could come from Oracle streams that replay the SQL transactions that were committed to the database, or they could come from GoldenGate.

$ cd /usr/hdp/2.4.0.0-169/kafka
$ bin/kafka-console-producer.sh --broker-list sandbox.hortonworks.com:6667 --topic SalesDBTransactions



1,"Nero Morris","porttitor.interdum@Sedcongue.edu","P.O. Box 871, 5313 Quis Ave","Sodales Company"

2,"Cody Bond","ante.lectus.convallis@antebibendumullamcorper.ca","232-513 Molestie Road","Aenean Eget Magna Incorporated"

3,"Holmes Cannon","a@metusAliquam.edu","P.O. Box 726, 7682 Bibendum Rd.","Velit Cras LLP"

4,"Alexander Lewis","risus@urna.edu","Ap #375-9675 Lacus Av.","Ut Aliquam Iaculis Inc."

5,"Gavin Ortiz","sit.amet@aliquameu.net","Ap #453-1440 Urna. St.","Libero Nec Ltd"

6,"Ralph Fleming","sociis.natoque.penatibus@quismassaMauris.edu","363-6976 Lacus. St.","Quisque Fringilla PC"

7,"Merrill Norton","at.sem@elementum.net","P.O. Box 452, 6951 Egestas. St.","Nec Metus Institute"

8,"Nathaniel Carrillo","eget@massa.co.uk","Ap #438-604 Tellus St.","Blandit Viverra Corporation"

9,"Warren Valenzuela","tempus.scelerisque.lorem@ornare.co.uk","Ap #590-320 Nulla Av.","Ligula Aliquam Erat Incorporated"

10,"Donovan Hill","facilisi@augue.org","979-6729 Donec Road","Turpis In Condimentum Associates"

11,"Kamal Matthews","augue.ut@necleoMorbi.org","Ap #530-8214 Convallis, St.","Tristique Senectus Et Foundation"

Black_2.png

7. Receive Hive Data

With all of the above accomplished, now when you send data from Kafka, you will see the stream of data being sent to the Hive table within seconds.

kafka Database hadoop Relational database Data (computing)

Published at DZone with permission of Rajesh Nadipalli, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • How to Rescue Your Magento 2 Project
  • 9 Ways You Can Improve Security Posture
  • Unit of Work With Generic Repository Implementation Using .NET Core 6 Web API
  • Multi-Cloud Database Deep Dive

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: