Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

An Introduction to Apache Flink

DZone's Guide to

An Introduction to Apache Flink

We've never had to process and stream as much big data as we do these days. Luckily, Apache Flink can help. This article offers a brief introduction.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

Until the past couple of years, scientists always dealt with deterministic problems that behave the same way under the same conditions. They never needed to process vast amounts of data. However, today's scientists are to solve problems that may have too many factors. Newton needed to know just a few points of an equation in order to identify a movement. But a data scientist must know a customer's internet activities for at least one year in order to identify his or her choices. Moreover, that data scientist must also identify millions of customers. Today’s data scientists should find a way to quickly process streaming big data.

There have been multiple frameworks created to process streaming big data. Among those, the most successful is Apache Flink. In this article, we will give a practical introduction.

You can either use Java or Scala to create a Flink application. In this article, we will use Scala as the programming language and Maven as the build tool.

Your pom file should import following dependencies:

<properties>
   <maven.compiler.source>1.8</maven.compiler.source>
   <maven.compiler.target>1.8</maven.compiler.target>
   <encoding>UTF-8</encoding>
   <scala.version>2.11.5</scala.version>
   <scala.compat.version>2.11</scala.compat.version>
   <flink.version>1.3.1</flink.version>
</properties>
…


<!--Scala library-->
<dependency>
   <groupId>org.scala-lang</groupId>
   <artifactId>scala-library</artifactId>
   <version>${scala.version}</version>
</dependency>
<!--Flink core & streaming-->
<dependency>
   <groupId>org.apache.flink</groupId>
   <artifactId>flink-core</artifactId>
   <version>${flink.version}</version>
</dependency>
<dependency>
   <groupId>org.apache.flink</groupId>
   <artifactId>flink-streaming-scala_${scala.compat.version}</artifactId>
   <version>${flink.version}</version>
</dependency>

If you need a full pom.xml, you can find it here.

With our project structure ready, we can start coding.

Beginning with the simplest case, we will listen to a socket. If the input line’s length is greater than 10, then it will be printed with prompt of G10> . Otherwise, it will be printed with a prompt of S10> .

import org.apache.flink.streaming.api.scala._

object PracticalFlink extends App{
  val sEnv = StreamExecutionEnvironment.getExecutionEnvironment
  val dStream: DataStream[String] = sEnv.socketTextStream(hostname = "localhost", port = 9900)

  dStream.filter(_.length>10).map("G10> "+_).print()
  dStream.filter(_.length<=10).map("S10> "+_).print()

  sEnv.execute("PracticalFlink")
}

At Line 4, we refer to the StreamExecutionEnvironment object; this line is boilerplate. At line 10, the StreamExecutionEnvironment begins to execute. Again, this line is boilerplate.

At Line 5, we refer to a datastream listening on port 9900.

At Lines 7 and 8, we are filtering the data stream and creating new dStreams. Line 7 filters lines whose length is greater than 10, and with the map method adds G10>  as a prefix, then prints it Line 8 filters lines whose length is smaller than 10 and with the map method adds S10>  as a prefix and prints it. 

If we execute this code, the output would be as follows:

Image title

Image title

In a later article, we will discuss some advanced methods of dataStream.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
data science ,big data ,apache flink ,big data analytics ,tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}