What Is Structured Streaming?
What Is Structured Streaming?
Structured Streaming is a fast, scalable, fault-tolerant, end-to-end, exactly-once stream processing API that helps users in building streaming applications.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
With the advent of streaming frameworks like Spark Streaming, Flink, Storm, etc., developers stopped worrying about issues related to a streaming application, e.g. fault tolerance, zero data loss, real-time processing of data, etc. and started focussing only on solving business challenges. The reason is that the frameworks mentioned above provided inbuilt support for all of them. For example,
In Spark Streaming, by just adding the checkpoint directory path as done in the below code snippet, recovery from failure(s) became easy.
val sparkConf = new SparkConf().setAppName("RecoverableNetworkWordCount") // Create the context with a 1 second batch size val ssc = new StreamingContext(sparkConf, Seconds(1)) ssc.checkpoint("/path/to/checkpoint")
And in Flink, we just have to enable checkpointing in the execution environment, as is done in below code snippet.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.enableCheckpointing(5000); // create a checkpoint every 5 seconds env.getConfig().setGlobalJobParameters(parameterTool); // make parameters available in the web interface env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
Everything was working fine in the streaming data world, but then, along came the structured data era, where data was in tabular form (i.e. stored in large data warehouses) and data was processed using simple SQL queries. For example, in Spark SQL/Flink Table, reading data became as simple as
select *, as shown below:
spark.sql("SELECT * FROM employee").show() // +----+-------+ // | age| name| // +----+-------+ // |null|Michael| // | 30| Andy| // | 19| Justin| // +----+-------+
Table table = tableEnv.sqlQuery("SELECT * FROM employee"); DataSet<WC> result = tEnv.toDataSet(table, WC.class); result.print();
This helped a wider range of people, i.e. the ones who do not know how to code, like data scientists, business analysts, etc., but who were aware of SQL. Both Spark SQL and Flink tables became an instant hit in the big data industry.
However, this success was limited to only batch data, i.e. files, tables, etc. The streaming world was totally untouched by it. Everyone wanted to have the capability of running their SQL queries on streaming data, as well so that they could draw insights from their data in real-time.
This compelled the big data industry experts to develop API(s) that can process streaming data present in structured/semi-structured form. As a result, a lot of frameworks were developed that can process streaming data using SQL queries. For example:
- Spark Structured Streaming
- KSQL (Kafka-SQL)
- Flink tables
They all have their own pros and cons, but in this post, we are only talking about Spark Structured Streaming. According to Spark's official documentation:
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.
This means that we can express our streaming computation the same way we would express a batch computation on static data. Since Structured Streaming is built over Spark SQL engine, it comes with a lot of advantages, like:
- Incremental and continous update of the final result (table) is taken care of by the API itself.
- Dataset/DataFrame API can be used/re-used in Scala, Java, Python, or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc.
- Computations are optimized, as the same Spark SQL engine is used.
- And, the application guarantees end-to-end exactly-once fault-tolerance through checkpointing and WALs (write-ahead logs).
So, long story short, Structured Streaming is a fast, scalable, fault-tolerant, end-to-end, exactly-once stream processing API that helps users in building streaming applications without having to reason about it.
We will explore more about Structured Streaming in our future articles. Until then, stay tuned! Please feel free to leave your suggestions or comments.
Published at DZone with permission of Himanshu Gupta , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.