{{announcement.body}}
{{announcement.title}}

Spark Streaming Under the Hood

DZone 's Guide to

Spark Streaming Under the Hood

In this article, we discuss basic concepts behind Spark Streaming and the internal functionality and use cases of Spark's various APIs

· Big Data Zone ·
Free Resource

sparkler-in-water

Apache Spark Streaming is a scalable, fault-tolerant streaming processing system that natively supports both batch and streaming workloads.

Spark Streaming is different from other systems that either have a processing engine designed only for streaming or have similar batch and streaming APIs but compile internally to different engines. Spark’s single execution engine and unified programming model for batch and streaming lead to some unique benefits over other traditional streaming systems. In particular, four major benefits include:

  • Fast recovery from failures and stragglers.
  • Better load balancing and resource usage.
  • Combining of streaming data with a static dataset and interactive queries.
  • Native integration with advanced processing libraries (SQL, machine learning, graph processing).

Spark Streaming workflow

You may also like: The Complete Apache Spark Collection [Tutorials and Articles].

Internally, it works as follows: Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches

Spark workflow

Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs

Discretized Streams (DStreams)

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from the source or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset.

Any operation applied on a DStream translates to operations on the underlying RDDs. Spark Streaming provides two categories of built-in streaming sources.

  • Basic sources: Sources directly available in the StreamingContext API. Examples: file systems and socket connections.
  • Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes

Example:

import org.apache.spark._

import org.apache.spark.streaming._

import org.apache.spark.streaming.StreamingContext._ 

val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()             // Start the computation

ssc.awaitTermination()  // Wait for the computation to terminate


Submit this Spark job and run nc -lk 9999 on another terminal and input data in it. This will display total word count in that timeframe

Spark Streaming Operations

i. Transformation Operations in Spark

Similar to Spark RDDs, Spark transformations allow modification of the data from the input DStream. DStreams support many transformations that are available on normal Spark RDDs. Some of the common ones are:

map(), flatMap(), filter(), repartition(numPartitions), union(otherStream), count(), reduce(), countByValue(), reduceByKey(func, [numTasks]), join(otherStream, [numTasks]), cogroup(otherStream, [numTasks]), transform(), updateStateByKey(), Window().

ii. Output Operations in Apache Spark

DStream’s data pushed out to external systems like a database or file systems using Output Operations. Since external systems consume the transformed data as allowed by the output operations, they trigger the actual execution of all the DStream transformations. Currently, the following output operations define as:

print(), saveAsTextFiles(prefix, [suffix])”prefix-TIME_IN_MS[.suffix]”, saveAsObjectFiles(prefix, [suffix]), saveAsHadoopFiles(prefix, [suffix]), foreachRDD(func)

Hence, DStreams like RDDs execute lazily by the output operations.


Further Reading

Topics:
streaming ,spark ,big data ,kafka ,sql ,streams api ,producer ,consumer

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}