Streaming ETL With Apache Flink
Join the DZone community and get the full member experience.Join For Free
Streaming data computation is becoming more and more common with the growing Big Data landscape. Many enterprises are also adopting or moving towards streaming for message passing instead of relying solely on REST APIs.
Apache Flink has emerged as a popular framework for streaming data computation in a very short amount of time. It has many advantages in comparison to Apache Spark (e.g. lightweight, rich APIs, developer-friendly, high throughput, an active and vibrant community).
When I started working on a new project where I had to process streaming data (e.g. events, server logs), after initial research, I found Flink to be the most suitable framework for my particular use case.
This blog is based on my work in Flink, starting with a simple example to a subset of real-world use cases. I also share a few exceptions and ways to solve them, which will help beginners.
You may also like: The State of ETL: Traditional to Cloud.
Links to the articles in this series and a short summary of the content. (All code examples are available on GitHub.)
Part 1 - Getting started guide, I share an example of computing sum of Integers generated as a stream using custom
SourceFunction and a
TumblingWindow(fixed size, fixed time, non-overlapping).
Part 2 - Improving upon from part 1, in this article, I share an example of keyed data stream computation. This one uses Flink's
sum methods to achieve the same result.
Upcoming Articles on Flink
1. Moving towards more real-world, deployment use cases, I will share how to set up and use Flink in a cluster mode. This will also have a DB and Grafana to complete the tutorial end to end.
2. AWS Kinesis Stream with the same example as above.
- Top 5 Enterprise ETL Tools.
- Things to Understand Before Implementing ETL Tools.
- Transforming ETL for Data-Driven Age.
Opinions expressed by DZone contributors are their own.