DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Running Streaming ETL Pipelines with Apache Flink on Zeppelin Notebooks
  • Leveraging Apache Flink Dashboard for Real-Time Data Processing in AWS Apache Flink Managed Service
  • Real-Time Streaming Architectures: A Technical Deep Dive Into Kafka, Flink, and Pinot
  • How To Optimize the Salesforce CRM Analytics Dashboards Using SAQL

Trending

  • How to Submit a Post to DZone
  • DZone's Article Submission Guidelines
  • Vibe Coding With GitHub Copilot: Optimizing API Performance in Fintech Microservices
  • Intro to RAG: Foundations of Retrieval Augmented Generation, Part 1
  1. DZone
  2. Data Engineering
  3. Big Data
  4. How to Create a Real-Time Scalable Streaming App Using Apache NiFi, Apache Pulsar, and Apache Flink SQL

How to Create a Real-Time Scalable Streaming App Using Apache NiFi, Apache Pulsar, and Apache Flink SQL

In this article, we'll cover how and when to use Pulsar with NiFi and Flink as you build your streaming application.

By 
Tim Spann user avatar
Tim Spann
DZone Core CORE ·
Jan. 22, 23 · Tutorial
Likes (7)
Comment
Save
Tweet
Share
7.9K Views

Join the DZone community and get the full member experience.

Join For Free

Building streaming applications can be a difficult task — often, just figuring out how to get started can be overwhelming. In this article, we'll cover how and when to use NiFi with Pulsar and Flink and how to get started. Apache NiFi, Apache Pulsar, and Apache Flink are robust, powerful open-source platforms that enable running any size application at any scale. This enables us to take our quickly developed prototype and deploy it as an unbreakable clustered application ready for internet-scale production workloads of any size.

Using a drag-and-drop tool can help reduce that difficulty and allow us to get the data flowing. By utilizing Apache NiFi, we can quickly go from ideation to data streaming as live events in our Pulsar topics. Once we have a stream, building applications via SQL becomes a much more straightforward premise. The ability to rapidly prototype, iterate, test, and repeat are critical in modern cloud applications.

We are now faced with a familiar scenario where it appears like traditional database-driven or batch applications that most data engineers and programmers use.

Before we cover how to get started with NiFi, Pulsar, and Flink, let’s discuss how and why these platforms work for real-time streaming. ChatGPT said:

Why Use Apache NiFi With Apache Pulsar and Apache Flink?

Architects and developers have many options for building real-time scalable streaming applications, so why should they utilize the combination of Apache NiFi, Apache Pulsar, and Apache Flink? The initial reason I started utilizing this combination of open-source projects is the ease of getting started. I always recommend first starting with the simplest way for anyone exploring solutions to new use cases or problems. The simplest solution to start data flowing from a source or extract is usually Apache NiFi.

Apache NiFi is a drag-and-drop tool that works on live data, so I can quickly point to my source of data and start pulling or triggering data from it. Since Apache NiFi supports hundreds of sources, often, the data I want to access is a straightforward drag-and-drop. Once the data starts flowing, I can build an interactive streaming pipeline one step at a time in real time with live data flowing. I can examine the state of that data before building the next step. With the combination of inter-step queues and data provenance, I know the current state of the data and all the previous states with their extensive metadata. In an hour or less, I can usually build the ingest, routing, transforming, and essential data enrichment. The final step of the Apache NiFi portion of our streaming application is to stream the data to Apache Pulsar utilizing the NiFi-Pulsar connector. Next in our application development process is to provide routing and additional enhancements before data is consumed from Pulsar.

Within Apache Pulsar, we can utilize Functions written in Java, Python, or Go to enrich, transfer, add schemas, and route our data in real time to other topics.  

When?

Quick Criteria

Your Data Source Recommended Platform(s)

JSON REST feed

Looks like a good fit for NiFi+

Relational tables CDC

Looks like a good fit for NiFi+

Complex ETL and Transformation

Look at Spark + Pulsar

Source requires joins

Look at Flink applications

Large batch data

Look at native applications 

Mainframe data sources

Look at existing applications that can send messages

Websocket streams of data

Looks like a good fit for NiFi+

Clickstream data

Looks like a good fit for NiFi+

Sensor data from devices

Just stream directly to Pulsar via MQTT. 

You may ask when I should use the combination of Apache NiFi/Apache Pulsar/Apache Flink to build my apps, and what if I only need Pulsar? That can be the case many times. Suppose you have an existing application that produces messages or events being sent to Apache Kafka, MQTT, RabbitMQ, REST, WebSockets, JMS, RPC, or RocketMQ. In that case, you can just point that program at a Pulsar cluster or rewrite to us the superior native Pulsar libraries.

After an initial prototype with NiFi, if it is too slow, you can deploy your flow on a cluster and resize with Kubernetes, expand out vertically with more RAM and CPU cores or look for solutions with my streaming advisors. The great thing about Apache NiFi is that there are many pre-built solutions, demos, examples, and articles for most use cases spanning from REST to CDC to logs to sensor processing.

If you hit a wall at any step, then perhaps Apache NiFi is not right for this data. If it is mainframe data, complex ingest rules require joins, many enrichment steps, and complex ETL or ELT. I suggest looking at custom Java code, Apache Spark, Apache Flink, or another tool.

If you don’t have an existing application, but your data requires no immediate changes and comes from a known system, perhaps you can use a native Pulsar source. Check them out at https://hub.streamnative.io/. If you need to do some routing, enrichment, and enhancement, you may want to look at Pulsar Functions which can take your raw data in that newly populated topic event at a time to do that.   

If you have experience with an existing tool such as Spark and have an environment, that may be a good way for you to bring this data into the Pulsar stream. This is especially true if there are a lot of ETL steps or you are combining it with Spark ML.    

There are several items you should catalog about your data sources, data types, schemas, formats, requirements, and systems before you finalize infrastructure decisions.  

A series of questions should be answered. These are a few basic questions.

  1. Is this pipeline one that requires Exactly Once semantics?

  2. What effects would duplicate data have on your pipeline?

  3. What are the scale in events per second, gigabytes per second, and total storage and completion requirements?

  4. How many infrastructure resources do you have?

  5. What is the sacrifice for speed vs. cost?

  6. How much total storage per day?

  7. How long do you wish to store your data stream?

  8. Does this need to be repeatable?

  9. Where will this run? Will it need to run in different locations, countries, availability zones, on-premise, cloud, K8, Edge...?

  10. What does your data look like? Prepare data types, schemas, and everything you can about the source and final data. Is your data binary, image, video, audio, documents, unstructured, semi-structured, structured, normalized relational data, etc.?

  11. What are the upstream and downstream systems?

  12. Do NiFi, Pulsar, Flink, Spark, and other systems have native connectors or drivers for your system?

  13. Is this data localized, and does it require translation for formatting or language?

  14. What type of enrichment is required?

  15. Do you require strict auditing, lineage, provenance, and data quality?

  16. Who is using this data and how?

  17. What team is involved? Data scientists? Data engineers? Data analysts? Programmers?   Citizen streaming developers?   

  18. Is this batch-oriented?

  19. How long will this pipeline live?   

  20. Is this time series data?

  21. Is machine learning or deep learning part of the flow or final usage?

References

  • https://dev.to/tspannhw/did-the-user-really-ask-for-exactly-once-fault-tolerance-3fek
  • https://www.datainmotion.dev/2021/01/migrating-from-apache-storm-to-apache.html 
  • https://thenewstack.io/pulsar-nifi-better-together-for-messaging-streaming/
Apache Flink Apache NiFi Extract, transform, load sql Data stream

Opinions expressed by DZone contributors are their own.

Related

  • Running Streaming ETL Pipelines with Apache Flink on Zeppelin Notebooks
  • Leveraging Apache Flink Dashboard for Real-Time Data Processing in AWS Apache Flink Managed Service
  • Real-Time Streaming Architectures: A Technical Deep Dive Into Kafka, Flink, and Pinot
  • How To Optimize the Salesforce CRM Analytics Dashboards Using SAQL

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!