An Introduction to Apache Flink

DZone 's Guide to

An Introduction to Apache Flink

We've never had to process and stream as much big data as we do these days. Luckily, Apache Flink can help. This article offers a brief introduction.

· Big Data Zone ·
Free Resource

Until the past couple of years, scientists always dealt with deterministic problems that behave the same way under the same conditions. They never needed to process vast amounts of data. However, today's scientists are to solve problems that may have too many factors. Newton needed to know just a few points of an equation in order to identify a movement. But a data scientist must know a customer's internet activities for at least one year in order to identify his or her choices. Moreover, that data scientist must also identify millions of customers. Today’s data scientists should find a way to quickly process streaming big data.

There have been multiple frameworks created to process streaming big data. Among those, the most successful is Apache Flink. In this article, we will give a practical introduction.

You can either use Java or Scala to create a Flink application. In this article, we will use Scala as the programming language and Maven as the build tool.

Your pom file should import following dependencies:


<!--Scala library-->
<!--Flink core & streaming-->

If you need a full pom.xml, you can find it here.

With our project structure ready, we can start coding.

Beginning with the simplest case, we will listen to a socket. If the input line’s length is greater than 10, then it will be printed with prompt of G10> . Otherwise, it will be printed with a prompt of S10> .

import org.apache.flink.streaming.api.scala._

object PracticalFlink extends App{
  val sEnv = StreamExecutionEnvironment.getExecutionEnvironment
  val dStream: DataStream[String] = sEnv.socketTextStream(hostname = "localhost", port = 9900)

  dStream.filter(_.length>10).map("G10> "+_).print()
  dStream.filter(_.length<=10).map("S10> "+_).print()


At Line 4, we refer to the StreamExecutionEnvironment object; this line is boilerplate. At line 10, the StreamExecutionEnvironment begins to execute. Again, this line is boilerplate.

At Line 5, we refer to a datastream listening on port 9900.

At Lines 7 and 8, we are filtering the data stream and creating new dStreams. Line 7 filters lines whose length is greater than 10, and with the map method adds G10>  as a prefix, then prints it Line 8 filters lines whose length is smaller than 10 and with the map method adds S10>  as a prefix and prints it. 

If we execute this code, the output would be as follows:

Image title

Image title

In a later article, we will discuss some advanced methods of dataStream.

apache flink, big data, big data analytics, data science, tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}