DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Securing Your Software Supply Chain with JFrog and Azure
Register Today

Trending

  • Writing a Vector Database in a Week in Rust
  • Performance Comparison — Thread Pool vs. Virtual Threads (Project Loom) In Spring Boot Applications
  • How To Approach Java, Databases, and SQL [Video]
  • Part 3 of My OCP Journey: Practical Tips and Examples

Trending

  • Writing a Vector Database in a Week in Rust
  • Performance Comparison — Thread Pool vs. Virtual Threads (Project Loom) In Spring Boot Applications
  • How To Approach Java, Databases, and SQL [Video]
  • Part 3 of My OCP Journey: Practical Tips and Examples
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Spark Streaming Under the Hood

Spark Streaming Under the Hood

Shubham Dangare user avatar by
Shubham Dangare
·
Dec. 09, 19 · Presentation
Like (3)
Save
Tweet
Share
6.71K Views

Join the DZone community and get the full member experience.

Join For Free

sparkler-in-water

Apache Spark Streaming is a scalable, fault-tolerant streaming processing system that natively supports both batch and streaming workloads.

Spark Streaming is different from other systems that either have a processing engine designed only for streaming or have similar batch and streaming APIs but compile internally to different engines. Spark’s single execution engine and unified programming model for batch and streaming lead to some unique benefits over other traditional streaming systems. In particular, four major benefits include:

  • Fast recovery from failures and stragglers.
  • Better load balancing and resource usage.
  • Combining of streaming data with a static dataset and interactive queries.
  • Native integration with advanced processing libraries (SQL, machine learning, graph processing).

Spark Streaming workflow

You may also like: The Complete Apache Spark Collection [Tutorials and Articles].

Internally, it works as follows: Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches

Spark workflow

Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs

Discretized Streams (DStreams)

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from the source or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset.

Any operation applied on a DStream translates to operations on the underlying RDDs. Spark Streaming provides two categories of built-in streaming sources.

  • Basic sources: Sources directly available in the StreamingContext API. Examples: file systems and socket connections.
  • Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes

Example:

import org.apache.spark._

import org.apache.spark.streaming._

import org.apache.spark.streaming.StreamingContext._ 

val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")

val ssc = new StreamingContext(conf, Seconds(1))

val lines = ssc.socketTextStream("localhost", 9999)

val words = lines.flatMap(_.split(" "))

val pairs = words.map(word => (word, 1))

val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.print()

ssc.start()             // Start the computation

ssc.awaitTermination()  // Wait for the computation to terminate


Submit this Spark job and run nc -lk 9999 on another terminal and input data in it. This will display total word count in that timeframe

Spark Streaming Operations

i. Transformation Operations in Spark

Similar to Spark RDDs, Spark transformations allow modification of the data from the input DStream. DStreams support many transformations that are available on normal Spark RDDs. Some of the common ones are:

map(), flatMap(), filter(), repartition(numPartitions), union(otherStream), count(), reduce(), countByValue(), reduceByKey(func, [numTasks]), join(otherStream, [numTasks]), cogroup(otherStream, [numTasks]), transform(), updateStateByKey(), Window().

ii. Output Operations in Apache Spark

DStream’s data pushed out to external systems like a database or file systems using Output Operations. Since external systems consume the transformed data as allowed by the output operations, they trigger the actual execution of all the DStream transformations. Currently, the following output operations define as:

print(), saveAsTextFiles(prefix, [suffix])”prefix-TIME_IN_MS[.suffix]”, saveAsObjectFiles(prefix, [suffix]), saveAsHadoopFiles(prefix, [suffix]), foreachRDD(func)

Hence, DStreams like RDDs execute lazily by the output operations.


Further Reading

  • Spark Streaming: Unit Testing DStreams.
  • How to Perform Distributed Spark Streaming With PySpark.
  • Kafka and Spark Streams: Living Happily Ever After.

Data stream Apache Spark

Opinions expressed by DZone contributors are their own.

Trending

  • Writing a Vector Database in a Week in Rust
  • Performance Comparison — Thread Pool vs. Virtual Threads (Project Loom) In Spring Boot Applications
  • How To Approach Java, Databases, and SQL [Video]
  • Part 3 of My OCP Journey: Practical Tips and Examples

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: