Streaming ETL With Apache Flink

DZone 's Guide to

Streaming ETL With Apache Flink

In this article, we discuss use-cases and best practices for utilizing Apache Flink for for processing streaming data.

· Big Data Zone ·
Free Resource

Streaming data computation is becoming more and more common with the growing Big Data landscape. Many enterprises are also adopting or moving towards streaming for message passing instead of relying solely on REST APIs. 

Apache Flink has emerged as a popular framework for streaming data computation in a very short amount of time. It has many advantages in comparison to Apache Spark (e.g. lightweight, rich APIs, developer-friendly, high throughput, an active and vibrant community).

When I started working on a new project where I had to process streaming data (e.g. events, server logs), after initial research, I found Flink to be the most suitable framework for my particular use case.

This blog is based on my work in Flink, starting with a simple example to a subset of real-world use cases. I also share a few exceptions and ways to solve them, which will help beginners.

You may also like: The State of ETL: Traditional to Cloud.

Links to the articles in this series and a short summary of the content. (All code examples are available on GitHub.)

Part 1 - Getting started guide, I share an example of computing sum of Integers generated as a stream using custom  SourceFunction and a  TumblingWindow(fixed size, fixed time, non-overlapping).

Part 2 - Improving upon from part 1, in this article, I share an example of keyed data stream computation. This one uses Flink's  reduce and sum methods to achieve the same result.

Part 3 - Changing gear, I take a subset of a real-world use case of Flink (see this post from zalando.com). I share an example of how to process connectivity events to identify a simple pattern.

Part 4 - Improving upon the example from part 3, I share how to achieve the same result using Flink's CEP.

Upcoming Articles on Flink

1. Moving towards more real-world, deployment use cases, I will share how to set up and use Flink in a cluster mode. This will also have a DB and Grafana to complete the tutorial end to end.

2. AWS Kinesis Stream with the same example as above.

Further Reading

bigdata ,etl ,flink ,streaming ,streaming analytics

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}