DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • The Magic of Apache Spark in Java
  • Snowflake vs. Databricks: How to Choose the Right Data Platform
  • Big Data Realtime Data Pipeline Architecture
  • AI-Powered Knowledge Graphs

Trending

  • MySQL to PostgreSQL Database Migration: A Practical Case Study
  • System Coexistence: Bridging Legacy and Modern Architecture
  • Start Coding With Google Cloud Workstations
  • MCP Servers: The Technical Debt That Is Coming
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Migrating Apache Flume Flows to Apache NiFi: Kafka Source to Multiple Sinks

Migrating Apache Flume Flows to Apache NiFi: Kafka Source to Multiple Sinks

How-to move off of legacy Flume and into modern Apache NiFi for data pipelines.

By 
Tim Spann user avatar
Tim Spann
DZone Core CORE ·
Oct. 15, 19 · Tutorial
Likes (9)
Comment
Save
Tweet
Share
18.6K Views

Join the DZone community and get the full member experience.

Join For Free

bridge-over-small-stream

The world of streaming is constantly moving... yes I said it. Every few years some projects get favored by the community and by developers. Apache NiFi has stepped ahead and has been the go-to for quickly ingesting sources and storing those resources to sinks with routing, aggregation, basic ETL/ELT, and security. I am recommending a migration from legacy Flume to Apache NiFi. The time is now.

Below, I walk you through a common use case. It's easy to integrate Kafka as a source or sink with Apache NiFi or MiNiFi agents. We can also add HDFS or Kudu sinks as well. All of this with full security, SSO, governance, cloud and K8 support, schema support, full data lineage, and an easy to use UI. Don't get fluming mad, let's try another great Apache project.

As a first step, you can also start by migrating Flume sources and sinks to NiFi.

Source Code: https://github.com/tspannhw/flume-to-nifi

Big Dat$$anonymous$$rchitecture

Big Dat$$anonymous$$rchitecture

Consume/Publish Kafk$$anonymous$$nd Store to Files, HDFS, Hive 3.1, Kudu

Consuming/Publishing data with Kafka

Consuming/Publishing data with Kafka

Consume Kafka Flow

Consuming data with Kafka

Consuming data with Kafka

Merge Records And Store As AVRO or ORC

Merging and storing records

Merging and storing records

Consume Kafka, Update Records via Machine Learning Models In CDSW And Store to Kudu

Machine learning workflow

Machine learning workflow

Source: Apache Kafka Topics

Apache Kafka Topics

Apache Kafka Topics

You enter a few parameters and start ingesting data with or without schemas. Apache Flume had no Schema support. Flume did not support transactions.

Property and values

Property and values

Sink: Files

Files in Sink

Files in Sink


Files in Sink

Files in Sink


Ouput

Ouput

Storing to files in files systems, object stores, SFTP, or elsewhere could not be easier. Choose S3, Local File System, SFTP, HDFS, or wherever.

Sink: Apache Kudu/Apache Impala

Kudu/Apache Impala

Kudu/Apache Impala

Storing to Kudu/Impala (or Parquet for that manner could not be easier with Apache NiFi).

Storing to Kudu/Impala

Storing to Kudu/Impala

Sink: HDFS for Apache ORC Files

When this is completed, the ConvertAvroToORC and PutHDFS build the Hive DDL for you! You can build the tables automagically with Apache NiFi if you wish.

CREATE EXTERNAL TABLE IF NOT EXISTS iotsensors

(sensor_id BIGINT, sensor_ts BIGINT, is_healthy STRING, response STRING, sensor_0 BIGINT, sensor_1 BIGINT,

sensor_2 BIGINT, sensor_3 BIGINT, sensor_4 BIGINT, sensor_5 BIGINT, sensor_6 BIGINT, sensor_7 BIGINT, sensor_8 BIGINT,

sensor_9 BIGINT, sensor_10 BIGINT, sensor_11 BIGINT)

STORE$$anonymous$$S ORC

LOCATION '/tmp/iotsensors'










Sink: Kafka

Publishing to Kafka is just as easy! Push records with schema references or raw data. AVRO or JSON, whatever makes sense for your enterprise.

Write to data easily with no coding and no changes or redeploys for schema or schema version changes.

Checking schema or schema version changes

Checking schema or schema version changes

 Pick a Topic and Stream Data While Converting Types

Pick a Topic and Stream data while converting types

Pick a Topic and Stream data while converting types

Clean UI and REST API to Manage, Monitor, Configure and Notify on Kafka

Alerts overview

Alerts overview

Data explorer

Data explorer

Topic list

Topic list

Other Reasons to Use Apache NiFi Over Apache Flume

DevOps with REST API, CLI, Python API

https://community.cloudera.com/t5/Community-Articles/More-DevOps-for-HDF-Apache-NiFi-Registry-and-Friends/ta-p/248668.

Schemas!  We not only work with semi-structured, structured and unstructured data. We are schem$$anonymous$$nd schema version aware for CSV, JSON, AVRO, XML, Grokked Text Files and more. https://community.cloudera.com/t5/Community-Articles/Big-Data-DevOps-Apache-NiFi-HWX-Schema-Registry-Schema/ta-p/247963.

Flume Replacement Use Cases Implemented in Apache NiFi


 Sink/Source: JMS

https://community.cloudera.com/t5/Community-Articles/Publishing-and-Consuming-JMS-Messages-from-Tibco-Enterprise/ta-p/248157.

Source: Files/PDF/PowerPoint/Excel/Word Sink: Files

https://community.cloudera.com/t5/Community-Articles/Parsing-Any-Document-with-Apache-NiFi-1-5-with-Apache-Tika/ta-p/247672.

https://community.cloudera.com/t5/Community-Articles/Converting-PowerPoint-Presentations-into-French-from-English/ta-p/248974.

https://community.cloudera.com/t5/Community-Articles/Creating-HTML-from-PDF-Excel-and-Word-Documents-using-Apache/ta-p/247968.

Source: Files/CSV Sink: HDFS/Hive/Apache ORC

https://community.cloudera.com/t5/Community-Articles/Converting-CSV-Files-to-Apache-Hive-Tables-with-Apache-ORC/ta-p/248258.

Source: REST/Files/Simulator Sink: HBase, Files, HDFS. ETL with Lookups.

https://community.cloudera.com/t5/Community-Articles/ETL-With-Lookups-with-Apache-HBase-and-Apache-NiFi/ta-p/248243.

Flume Replacement - Lightweight Open Source Agents

If you need to replace local Log to Kafk$$anonymous$$gents or anything to Kafka or anything to anything with routing, transformation and manipulation. You can use Edge Flow Manager deployed MiNiFi Agents available in Jav$$anonymous$$nd C++ versions.


Further Reading

  • Understanding Apache Spark Failures and Bottlenecks.
  • Introduction to Apache Spark's Core API (Part I).
kafka Apache NiFi Apache Flume Big data Flow (web browser) Database Schema Machine learning Apache Spark

Opinions expressed by DZone contributors are their own.

Related

  • The Magic of Apache Spark in Java
  • Snowflake vs. Databricks: How to Choose the Right Data Platform
  • Big Data Realtime Data Pipeline Architecture
  • AI-Powered Knowledge Graphs

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!