DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Reporting in Microservices: How To Optimize Performance
  • Why you Should and How to Archive your Kafka Data to Amazon S3
  • Effortless Concurrency: Leveraging the Actor Model in Financial Transaction Systems
  • Harnessing the Power of AWS Aurora for Scalable and Reliable Databases

Trending

  • Intro to RAG: Foundations of Retrieval Augmented Generation, Part 2
  • Using Java Stream Gatherers To Improve Stateful Operations
  • Implementing API Design First in .NET for Efficient Development, Testing, and CI/CD
  • Agile’s Quarter-Century Crisis
  1. DZone
  2. Coding
  3. Tools
  4. Import and Ingest Data Into HDFS Using Kafka in StreamSets

Import and Ingest Data Into HDFS Using Kafka in StreamSets

Learn about reading data from different data sources such as Amazon Simple Storage Service (S3) and flat files, and writing the data into HDFS using Kafka in StreamSets.

By 
Rathnadevi Manivannan user avatar
Rathnadevi Manivannan
·
Oct. 26, 17 · Tutorial
Likes (5)
Comment
Save
Tweet
Share
23.7K Views

Join the DZone community and get the full member experience.

Join For Free

StreamSets provides state-of-the-art data ingestion to easily and continuously ingest data from various origins such as relational databases, flat files, AWS, and so on, and write data to various systems such as HDFS, HBase, Solr, and so on. Its configuration-driven UI helps you design pipelines for data ingestion in minutes. Data is routed, transformed, and enriched during ingestion and made ready for consumption and delivery to downstream systems.

Kafka, an intermediate data store, helps to very easily replay ingestion, consume datasets across multiple applications, and perform data analysis.

In this blog, let's discuss reading the data from different data sources such as Amazon Simple Storage Service (S3) and flat files, and writing the data into HDFS using Kafka in StreamSets.

Prerequisites

  • Install Java 1.8
  • Install streamsets-datacollector-2.6.0.1

Use Case

Import and ingest data from different data sources into HDFS using Kafka in StreamSets.

Data Description

Network data of outdoor field sensors is used as the source file. Additional fields, dummy data, empty data, and duplicate data were added to the source file. The dataset has total record count of 600K with 3.5K duplicate records.

Sample data:

{"ambient_temperature":"16.70","datetime":"Wed Aug 30 18:42:45 IST 
2017","humidity":"76.4517","lat":36.17,"lng":-
119.7462,"photo_sensor":"1003.3","radiation_level":"201","sensor_id":"c6698873b4f14b995c9e66ad0d8f29e3","
sensor_name":"California","sensor_uuid":"probe-2a2515fc","timestamp":1504098765}

Synopsis

  • Read data from the local file system and produce data to Kafka.
  • Read data from Amazon S3 and produce data to Kafka.
  • Consume streaming data produced by Kafka.
  • Remove duplicate records.
  • Persist data into HDFS.
  • View data loading statistics.

Reading Data From Local File System and Producing Data to Kafka

To read data from the local file system, perform the following:

  • Create a new pipeline.
  • Configure the File Directory origin to read files from a directory.
  • Set Data Format as JSON and JSON content as Multiple JSON objects.
  • Use Kafka Producer processor to produce data into Kafka. (Note: If there are no Kafka processors, install the Apache Kafka package and restart SDC.)
  • Produce the data under topic sensor_data.

reading-data-from-local-file-systemreading-data-from-local-file-system1

Reading Data From Amazon S3 and Producing Data to Kafka

To read data from Amazon S3 and produce data into Kafka, perform the following:

  • Create another pipeline.
  • Use Amazon S3 origin processor to read data from S3. (Note: If there are no Amazon S3 processors, install the Amazon Web Services 1.11.123 package available under Package Manager.)
  • Configure processor by providing Access Key ID, Secret Access Key, Region, and Bucket name.
  • Set the data format as JSON.
  • Produce data under the same Kafka topic, sensor_data.

reading-data-from-amazon-s3reading-data-from-amazon-s3-1

Consuming Streaming Data Produced by Kafka

To consume streaming data produced by Kafka, perform the following:

  • Create a new pipeline.
  • Use Kafka Consumer origin to consume Kafka produced data.
  • Configure processor by providing the following details:
    • Broker URI
    • ZooKeeper URI
    • Topic: Set the topic name as sensor_data (same data produced in previous sections)
  • Set the data format as JSON.

consuming-streaming-data-produced-by-kafka

Removing Duplicate Records

To remove duplicate records using Record Deduplicator processor, perform the following:

  • Under Deduplication tab, provide the following fields to compare and find duplicates:
    • Max. Records to Compare
    • Time to Compare
    • Compare
    • Fields to Compare (for example, find duplicates based on sensor_id and sensor_uuid)
  • Move the duplicate records to Trash.
  • Store the unique records in HDFS.

removing-duplicate-records

Persisting Data into HDFS

To load data into HDFS, perform the following:

  • Configure the Hadoop FS destination processor from stage library HDP 2.6.
  • Select data format as JSON. (Note: core-site.xml and hdfs-site.xml files are placed in Hadoop-conf directory (/var/lib/sdc-resources/hadoop-conf.) While installing StreamSets, the sdc-resources directory will be created.

persisting-data-into-hdfs

Viewing Data Loading Statistics

Data loading statistics, after removing duplicates from different sources, look as follows:viewing-data-loading-statisticsviewing-data-loading-statistics1

References

  • Data Quality Checks With StreamSets Using Drift Rules
  • Sample Dataset in GitHub
  • Amazon S3
  • Kafka Producer
  • Kafka Consumer
  • Hadoop FS
kafka Data (computing) hadoop Database AWS

Published at DZone with permission of Rathnadevi Manivannan. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Reporting in Microservices: How To Optimize Performance
  • Why you Should and How to Archive your Kafka Data to Amazon S3
  • Effortless Concurrency: Leveraging the Actor Model in Financial Transaction Systems
  • Harnessing the Power of AWS Aurora for Scalable and Reliable Databases

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!