Spark Streaming Under the Hood
Spark Streaming Under the Hood
In this article, we discuss basic concepts behind Spark Streaming and the internal functionality and use cases of Spark's various APIs
Join the DZone community and get the full member experience.Join For Free
Apache Spark Streaming is a scalable, fault-tolerant streaming processing system that natively supports both batch and streaming workloads.
Spark Streaming is different from other systems that either have a processing engine designed only for streaming or have similar batch and streaming APIs but compile internally to different engines. Spark’s single execution engine and unified programming model for batch and streaming lead to some unique benefits over other traditional streaming systems. In particular, four major benefits include:
- Fast recovery from failures and stragglers.
- Better load balancing and resource usage.
- Combining of streaming data with a static dataset and interactive queries.
- Native integration with advanced processing libraries (SQL, machine learning, graph processing).
You may also like: The Complete Apache Spark Collection [Tutorials and Articles].
Internally, it works as follows: Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches
Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs
Discretized Streams (DStreams)
Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from the source or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset.
Any operation applied on a DStream translates to operations on the underlying RDDs. Spark Streaming provides two categories of built-in streaming sources.
- Basic sources: Sources directly available in the StreamingContext API. Examples: file systems and socket connections.
- Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes
import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ val conf = new SparkConf().setMaster("local").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() // Start the computation ssc.awaitTermination() // Wait for the computation to terminate
Submit this Spark job and run
nc -lk 9999 on another terminal and input data in it. This will display total word count in that timeframe
Spark Streaming Operations
i. Transformation Operations in Spark
Similar to Spark RDDs, Spark transformations allow modification of the data from the input DStream. DStreams support many transformations that are available on normal Spark RDDs. Some of the common ones are:
map(), flatMap(), filter(), repartition(numPartitions), union(otherStream), count(), reduce(), countByValue(), reduceByKey(func, [numTasks]), join(otherStream, [numTasks]), cogroup(otherStream, [numTasks]), transform(), updateStateByKey(), Window().
ii. Output Operations in Apache Spark
DStream’s data pushed out to external systems like a database or file systems using Output Operations. Since external systems consume the transformed data as allowed by the output operations, they trigger the actual execution of all the DStream transformations. Currently, the following output operations define as:
print(), saveAsTextFiles(prefix, [suffix])”prefix-TIME_IN_MS[.suffix]”, saveAsObjectFiles(prefix, [suffix]), saveAsHadoopFiles(prefix, [suffix]), foreachRDD(func)
Hence, DStreams like RDDs execute lazily by the output operations.
- Spark Streaming: Unit Testing DStreams.
- How to Perform Distributed Spark Streaming With PySpark.
- Kafka and Spark Streams: Living Happily Ever After.
Opinions expressed by DZone contributors are their own.