DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations

Trending

  • How AI Will Change Agile Project Management
  • Top Six React Development Tools
  • Managing Data Residency, the Demo
  • Decoding eBPF Observability: How eBPF Transforms Observability as We Know It

Trending

  • How AI Will Change Agile Project Management
  • Top Six React Development Tools
  • Managing Data Residency, the Demo
  • Decoding eBPF Observability: How eBPF Transforms Observability as We Know It

Stateful Streaming in Spark

In this article, see how to do stateful streaming in Spark using Spark-DStream. This shows how to use mapWithState and updateStateByKey.

Muskan Gupta user avatar by
Muskan Gupta
·
Sep. 24, 20 · Tutorial
Like (5)
Save
Tweet
Share
6.07K Views

Join the DZone community and get the full member experience.

Join For Free

Apache Spark is a fast and general-purpose cluster computing system. In Spark, we can do the batch processing and stream processing as well. It does near real-time processing. It means that it processes the data in micro-batches. I have discussed more Spark Streaming in my previous blog. Now in this blog, I'll discuss Stateful Streaming in Spark. So let's start !!

What Is Stateful Streaming?

Stateful stream processing means that a "state" is shared between events and therefore past events can influence the way current events are processed. 

In simple words, we can say that in Stateful Streaming the processing of the current data/batch is dependent on the data/batch that has been processed already. 

To understand it more, let's discuss a scenario. You want to find out the total occurrences of each word that has been received by the processor till now. Here is the spark streaming application for the above functioning. Now consider, first, it receives data as follows :

and then we get the following output: 

Now, while it was processing I sent the above data again. Now you must be expecting that the count of "I", "am" and "great" should be 2, 2 and 2 respectively. But I still got the following output:

So, are you wondering why this? This is because the stream processing is stateless right now i.e processing of current batch is independent of the batch that has been processed already.

In the above scenario, you need to check the previous state of the to update te new state of the RDD. This is what is known as stateful streaming in Spark.

Now to get our expected result we will move to stateful streaming.

How to do Stateful Processing in Spark?

To do stateful streaming in Spark we can use updateStateByKey or mapWithState. I'm going to discuss both of them here.

updateStateByKey

The updateStateByKey operation allows you to maintain an arbitrary state while continuously updating it with new information.

To use this, you will have to do two steps :

  • Define the state: The state can be an arbitrary data type.
  • Define the state update function: In this function, you have to specify how to update the state with the previous state and the new value received from the stream.

In every batch, Spark will apply the state update function for all existing keys, regardless of whether they have new data in a batch or not. If the update function returns None then the key-value pair will be eliminated.

Look at the following example to see how to use updateStateByKey:

Now, if I send the following data:

the output will be the following:

And on sending the same data again the word count will be the following:

Now, it works as per our expectations.

mapWithState

The mapWithState operation takes an instance of StateSpec and uses its factory method StateSpec.function() for setting all the specification of mapWithState. StateSpec.function takes a mapping function as a parameter. So you will have to define a mapping function in your code.

Look at the following example to  understand how to use mapWithState:

Checkpointing

To provide fault tolerance we use Checkpointing. In this basically, the intermediate values are stored in a storage, preferably fault-tolerant storage such as HDFS. In stateful streaming, it is compulsory to do checkpointing with the streaming context, so that if needed the state can be restored.

That's all in this blog about Stateful Streaming in Spark. You can look for the examples here. I hope it was helpful.  

Stream processing

Published at DZone with permission of Muskan Gupta. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • How AI Will Change Agile Project Management
  • Top Six React Development Tools
  • Managing Data Residency, the Demo
  • Decoding eBPF Observability: How eBPF Transforms Observability as We Know It

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: