DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
What's in store for DevOps in 2023? Hear from the experts in our "DZone 2023 Preview: DevOps Edition" on Fri, Jan 27!
Save your seat
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Kafka and Spark Streams: Living Happily Ever After

Kafka and Spark Streams: Living Happily Ever After

Learn how to use Spark Streaming to stream, transform, and transport data between Kafka topics.

Anmol Sarna user avatar by
Anmol Sarna
·
Apr. 02, 18 · Analysis
Like (14)
Save
Tweet
Share
19.67K Views

Join the DZone community and get the full member experience.

Join For Free

Today, we are going to get to understand a bit about using Spark Streaming to transform and transport data between Kafka topics.

The demand for stream processing is increasing every day. The reason is that often, processing big volumes of data is not enough. We need real-time processing of data, especially when we need to handle continuously increasing volumes of data and also need to process it and maintain it.

Image result for let the data flow meme'

Recently, in one of our projects, we faced such a requirement. Myself, being a newbie to Apache Spark, had only a little idea about what to do. So, I considered the best option to be referring to the Apache Spark documentation. It did help me understand the basic concepts of Spark, streaming, and transporting data using streams.

To give you a heads up, Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.

It provides a high-level abstraction called a discretized stream, or DStream, which represents a continuous stream of data.

DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs (Resilient Distributed Datasets).

Although in Spark 2.0, structured streaming was introduced, a lot of big companies still use the DStreams — the original Spark Streams.

Image result for spark streaming memeWe were a bit confused, but after a lot of discussions, we decided to go with the DStreams, only because they seemed to be the more logical choice.

So, our next step was integrating the Kafka with Spark Streaming.

This is where our challenge began. Using the help of the documentation, I started the coding journey.

  • The very first thing we require is a SparkConf object that loads defaults from system properties and the classpath.
    val conf = new SparkConf().setMaster("local[*]")
    .setAppName("SimpleDStreamExample")
    Note: local[*] gives it access to all the available cores. It can be changed as per requirement.
  • The next important thing is the KafkaParams, which is nothing but a Map[String, Object].
    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "localhost:9092",
      "key.deserializer" -> classOf[StringDeserializer], 
    // used as i am using a string serializer for the input kafka topic
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "spark-demo",
      "kafka.consumer.id" -> "kafka-consumer-01"
    )
    These are required by the input stream to know the whereabouts of the input data and how to fetch it.
  • Next, we require a StreamingContext that will take SparkConf as one of the arguments:
    val ssc = new StreamingContext(conf, Seconds(1))
  • Once we are through with StreamingContext, we now require an input stream that will actually stream the data (requires KafkaParams):
    val inputStream = KafkaUtils.createDirectStream(ssc,
      PreferConsistent, Subscribe[String, String](Array(inputTopic), kafkaParams))
    This stream now contains the consumer records, i.e the records from the input topic. Here, I am not playing with maintaining the state or checkpointing because that is a very different topic that has no reference here, and also, you can separately integrate it as well.
  • Once we have inputStream, we can perform our required operations on that stream (operations like flatMap, map, filter, etc.) to get yourself a processed stream, i.e. the stream now contains the modified/transformed data.
    val processedStream = inputStream.map(record => record.value) //Any operation can be performed here.
    
      //checking the batches for data
    processedStream.print()
  • Once you are done with the operations, you will get a processedStream, which we now need to store in an output Kafka topic. Now, this is where I was not sure. I went through a lot of blogs but couldn’t find any decent solution. But finally, after a lot of digging, I found a simple solution.
    rdd.foreach {
      case data: String => {
        val message = new ProducerRecord[String, String](outputTopic, data)
        producer.send(message).get().toString
      }
    })
    
    //producer here is the new producer  and the outputTopic is the Kafka topic where we need to store the processed data
  • Finally, your messages in the processed stream will be added to the output Kafka topic using the producer in the above case.

A piece of cake. Right?

Image result for easy meme

We have also added our code for reference, which you can find here.

To know more about the Spark Streaming and its integration with Kafka, you can refer here.

In this article, we tried to understand how we can use Spark Streaming to stream data between Kafka topics. I hope you enjoyed reading this article. Stay tuned for more.

Stream processing kafka Data (computing)

Published at DZone with permission of Anmol Sarna, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • How to Create a Real-Time Scalable Streaming App Using Apache NiFi, Apache Pulsar, and Apache Flink SQL
  • Top 10 Secure Coding Practices Every Developer Should Know
  • A Brief Overview of the Spring Cloud Framework
  • Understanding gRPC Concepts, Use Cases, and Best Practices

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: