Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Import and Ingest Data Into HDFS Using Kafka in StreamSets

DZone's Guide to

Import and Ingest Data Into HDFS Using Kafka in StreamSets

Learn about reading data from different data sources such as Amazon Simple Storage Service (S3) and flat files, and writing the data into HDFS using Kafka in StreamSets.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

StreamSets provides state-of-the-art data ingestion to easily and continuously ingest data from various origins such as relational databases, flat files, AWS, and so on, and write data to various systems such as HDFS, HBase, Solr, and so on. Its configuration-driven UI helps you design pipelines for data ingestion in minutes. Data is routed, transformed, and enriched during ingestion and made ready for consumption and delivery to downstream systems.

Kafka, an intermediate data store, helps to very easily replay ingestion, consume datasets across multiple applications, and perform data analysis.

In this blog, let's discuss reading the data from different data sources such as Amazon Simple Storage Service (S3) and flat files, and writing the data into HDFS using Kafka in StreamSets.

Prerequisites

  • Install Java 1.8
  • Install streamsets-datacollector-2.6.0.1

Use Case

Import and ingest data from different data sources into HDFS using Kafka in StreamSets.

Data Description

Network data of outdoor field sensors is used as the source file. Additional fields, dummy data, empty data, and duplicate data were added to the source file. The dataset has total record count of 600K with 3.5K duplicate records.

Sample data:

{"ambient_temperature":"16.70","datetime":"Wed Aug 30 18:42:45 IST 
2017","humidity":"76.4517","lat":36.17,"lng":-
119.7462,"photo_sensor":"1003.3","radiation_level":"201","sensor_id":"c6698873b4f14b995c9e66ad0d8f29e3","
sensor_name":"California","sensor_uuid":"probe-2a2515fc","timestamp":1504098765}

Synopsis

  • Read data from the local file system and produce data to Kafka.
  • Read data from Amazon S3 and produce data to Kafka.
  • Consume streaming data produced by Kafka.
  • Remove duplicate records.
  • Persist data into HDFS.
  • View data loading statistics.

Reading Data From Local File System and Producing Data to Kafka

To read data from the local file system, perform the following:

  • Create a new pipeline.
  • Configure the File Directory origin to read files from a directory.
  • Set Data Format as JSON and JSON content as Multiple JSON objects.
  • Use Kafka Producer processor to produce data into Kafka. (Note: If there are no Kafka processors, install the Apache Kafka package and restart SDC.)
  • Produce the data under topic sensor_data.

reading-data-from-local-file-systemreading-data-from-local-file-system1

Reading Data From Amazon S3 and Producing Data to Kafka

To read data from Amazon S3 and produce data into Kafka, perform the following:

  • Create another pipeline.
  • Use Amazon S3 origin processor to read data from S3. (Note: If there are no Amazon S3 processors, install the Amazon Web Services 1.11.123 package available under Package Manager.)
  • Configure processor by providing Access Key ID, Secret Access Key, Region, and Bucket name.
  • Set the data format as JSON.
  • Produce data under the same Kafka topic, sensor_data.

reading-data-from-amazon-s3reading-data-from-amazon-s3-1

Consuming Streaming Data Produced by Kafka

To consume streaming data produced by Kafka, perform the following:

  • Create a new pipeline.
  • Use Kafka Consumer origin to consume Kafka produced data.
  • Configure processor by providing the following details:
    • Broker URI
    • ZooKeeper URI
    • Topic: Set the topic name as sensor_data (same data produced in previous sections)
  • Set the data format as JSON.

consuming-streaming-data-produced-by-kafka

Removing Duplicate Records

To remove duplicate records using Record Deduplicator processor, perform the following:

  • Under Deduplication tab, provide the following fields to compare and find duplicates:
    • Max. Records to Compare
    • Time to Compare
    • Compare
    • Fields to Compare (for example, find duplicates based on sensor_id and sensor_uuid)
  • Move the duplicate records to Trash.
  • Store the unique records in HDFS.

removing-duplicate-records

Persisting Data into HDFS

To load data into HDFS, perform the following:

  • Configure the Hadoop FS destination processor from stage library HDP 2.6.
  • Select data format as JSON. (Notecore-site.xml and hdfs-site.xml files are placed in Hadoop-conf directory (/var/lib/sdc-resources/hadoop-conf.) While installing StreamSets, the sdc-resources directory will be created.

persisting-data-into-hdfs

Viewing Data Loading Statistics

Data loading statistics, after removing duplicates from different sources, look as follows:viewing-data-loading-statisticsviewing-data-loading-statistics1

References

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
streamsets ,hdfs ,data ingestion ,streaming data ,kafka ,big data ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}