Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

What Is Structured Streaming?

DZone's Guide to

What Is Structured Streaming?

Structured Streaming is a fast, scalable, fault-tolerant, end-to-end, exactly-once stream processing API that helps users in building streaming applications.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

With the advent of streaming frameworks like Spark Streaming, Flink, Storm, etc., developers stopped worrying about issues related to a streaming application, e.g. fault tolerance, zero data loss, real-time processing of data, etc. and started focussing only on solving business challenges. The reason is that the frameworks mentioned above provided inbuilt support for all of them. For example,

In Spark Streaming, by just adding the checkpoint directory path as done in the below code snippet, recovery from failure(s) became easy.

    val sparkConf = new SparkConf().setAppName("RecoverableNetworkWordCount")
    // Create the context with a 1 second batch size
    val ssc = new StreamingContext(sparkConf, Seconds(1))
    ssc.checkpoint("/path/to/checkpoint")

And in Flink, we just have to enable checkpointing in the execution environment, as is done in below code snippet.

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(5000); // create a checkpoint every 5 seconds
env.getConfig().setGlobalJobParameters(parameterTool); // make parameters available in the web interface
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

Everything was working fine in the streaming data world, but then, along came the structured data era, where data was in tabular form (i.e. stored in large data warehouses) and data was processed using simple SQL queries. For example, in Spark SQL/Flink Table, reading data became as simple as select *, as shown below:

spark.sql("SELECT * FROM employee").show()
// +----+-------+
// | age|   name|
// +----+-------+
// |null|Michael|
// |  30|   Andy|
// |  19| Justin|
// +----+-------+
Table table = tableEnv.sqlQuery("SELECT * FROM employee");

DataSet<WC> result = tEnv.toDataSet(table, WC.class);

result.print();

This helped a wider range of people, i.e. the ones who do not know how to code, like data scientists, business analysts, etc., but who were aware of SQL. Both Spark SQL and Flink tables became an instant hit in the big data industry.

However, this success was limited to only batch data, i.e. files, tables, etc. The streaming world was totally untouched by it. Everyone wanted to have the capability of running their SQL queries on streaming data, as well so that they could draw insights from their data in real-time.

This compelled the big data industry experts to develop API(s) that can process streaming data present in structured/semi-structured form. As a result, a lot of frameworks were developed that can process streaming data using SQL queries. For example:

  • Spark Structured Streaming
  • KSQL (Kafka-SQL)
  • Flink tables

They all have their own pros and cons, but in this post, we are only talking about Spark Structured Streaming. According to Spark's official documentation:

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.

This means that we can express our streaming computation the same way we would express a batch computation on static data. Since Structured Streaming is built over Spark SQL engine, it comes with a lot of advantages, like:

  1. Incremental and continous update of the final result (table) is taken care of by the API itself.
  2. Dataset/DataFrame API can be used/re-used in Scala, Java, Python, or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc.
  3. Computations are optimized, as the same Spark SQL engine is used.
  4. And, the application guarantees end-to-end exactly-once fault-tolerance through checkpointing and WALs (write-ahead logs).

So, long story short, Structured Streaming is a fast, scalable, fault-tolerant, end-to-end, exactly-once stream processing API that helps users in building streaming applications without having to reason about it.

We will explore more about Structured Streaming in our future articles. Until then, stay tuned! Please feel free to leave your suggestions or comments.

References:

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,streaming data ,structured data ,structured streaming ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}