DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • How to Design Event Streams, Part 1
  • Handling Schema Versioning and Updates in Event Streaming Platforms Without Schema Registries
  • Building an Event-Driven Architecture Using Kafka
  • Building an Enterprise CDC Solution

Trending

  • Unlocking the Potential of Apache Iceberg: A Comprehensive Analysis
  • How to Perform Custom Error Handling With ANTLR
  • How to Ensure Cross-Time Zone Data Integrity and Consistency in Global Data Pipelines
  • Operational Principles, Architecture, Benefits, and Limitations of Artificial Intelligence Large Language Models
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Data Ingestion with Flume Sending Events to Kafka

Data Ingestion with Flume Sending Events to Kafka

The author, Rafael Salerno, takes us step-by-step through this process.

By 
Rafael Salerno user avatar
Rafael Salerno
·
Aug. 29, 16 · Tutorial
Likes (8)
Comment
Save
Tweet
Share
14.5K Views

Join the DZone community and get the full member experience.

Join For Free

What Is Data Ingestion?

Date ingestion is the process of getting data from somewhere like logging, database for immediate use, or for storage.

Image title

These data could be sent in real time or in Batches.

Ingestion by real-time data is sent one by one directly from the data source, as in batch mode data are taken in batches in pre-defined intervals.

Usually when ingestion date of talks is likely to be related to Big Data, a large volume of data and assigned to it a few concerns us may come in mind, such as volume, variety, velocity and Veracity.

For this work Data Ingestion, a consolidated and widely used tool is Apache Flume.

Apache Flume 

Apache flume is an open source tool, reliable, high-availability for aggregation, collection and able to move large amounts of data from many different sources into a centralized data storage. Based on data flow streaming is robust, fault-tolerant with high-reliability mechanisms.

A Flume event is defined as a data flow unit. A Flume agent is a process of the JVM that hosts the components through which events flow from an external source to the next destination.

The idea in this post is to represent the flow data model below focusing on the necessary settings for this flow happen, abstracting some concepts that should be checked in flume if necessary  for documentation. 
Image title


Tools Required to Test the Sample:

Apache Flume

Apache Kafka

Configuration Files:

After downloading the Apache flume should be created a configuration file in the folder conf/ "flume-sample.conf"

This file can basically be divided for better understanding of 6 parts: 

1.Agent Name:

a1.sources = r1
a1.sinks = sample 
a1.channels = sample-channel

2.Source configuration:

a1.sources.r1.type = exec
a1.sources.r1.command = tail -f /log-sample/my_log_file.log 
a1.sources.r1.logStdErr = true

3.Sink type

a1.sinks.sample.type = logger

4.Buffers events in memory to channel

a1.channels.sample-channel.type = memory
a1.channels.sample-channel.capacity = 1000
a1.channels.sample-channel.transactionCapacity = 100

5. Bind the source and sink to the channel

a1.sources.r1.channels.selector.type = replicating
a1.sources.r1.channels = sample-channel

6.Related settings Kafka, topic, and host channel where it set the source

a1.sinks.sample.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.sample.topic = sample_topic
a1.sinks.sample.brokerList = 127.0.0.1:9092
a1.sinks.sample.requiredAcks = 1
a1.sinks.sample.batchSize = 20
a1.sinks.sample.channel = sample-channel

Final result of the file below:

Image title

To run Apache Flume with this configuration you must run the following command within the Flume folder:

sh bin/flume-ng agent --conf conf --conf-file conf/flume-sample.conf  -Dflume.root.logger=DEBUG,console --name a1 -Xmx512m -Xms256m

Where:

  • Indicate where the file that was set flume-sample.conf
  • "- - Name" is the agent name equals to a1
  • Dflume.root.logger is the form that will be logged on the console 

Before it is necessary to raise the Apache Kafka (concepts related to Kafka are not part of this post the focus here is only the data ingestion with flume).

After downloading the Kafta with the default settings, you can see the flow work.

Execute the following commands: 

  1. START Zookepper ->  sudo bin/zookeeper-server-start.sh config/zookeeper.properties&
  2. START Kafka Server -> sudo bin/kafka-server-start.sh config/server.properties&
  3. CREATE TOPIC - bin/kafka-topics.sh --zookeeper 127.0.0.1:2181 --create --replication-factor 1 --partitions 1 --topic sample_topic
    In this command should show the data being extracted from the Source (my_log_file.log) and are being sent to the sample_topic topic.
  4. START CONSUMER -> bin/kafka-console-consumer.sh --zookeeper 127.0.0.1:2181 --topic sample_topic --from-beginning

Thus the proposed data flow is complete and providing data to other consumers in real time.

kafka Data (computing) Event Apache Flume

Opinions expressed by DZone contributors are their own.

Related

  • How to Design Event Streams, Part 1
  • Handling Schema Versioning and Updates in Event Streaming Platforms Without Schema Registries
  • Building an Event-Driven Architecture Using Kafka
  • Building an Enterprise CDC Solution

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!