DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Minimizing Latency in Kafka Streaming Applications That Use External API or Database Calls
  • Real-Time Streaming Architectures: A Technical Deep Dive Into Kafka, Flink, and Pinot
  • Exploring the Dynamics of Streaming Databases
  • Designing High-Volume Systems Using Event-Driven Architectures

Trending

  • The Role of Functional Programming in Modern Software Development
  • How Clojure Shapes Teams and Products
  • SQL Server Index Optimization Strategies: Best Practices with Ola Hallengren’s Scripts
  • Power BI Embedded Analytics — Part 2: Power BI Embedded Overview
  1. DZone
  2. Data Engineering
  3. Big Data
  4. How to Use the Kafka Streams API

How to Use the Kafka Streams API

The Kafka Streams API allows you to create real-time applications that power your core business. Here's everything you need to know about it!

By 
Anuj Saxena user avatar
Anuj Saxena
·
Jul. 23, 17 · Tutorial
Likes (15)
Comment
Save
Tweet
Share
111.2K Views

Join the DZone community and get the full member experience.

Join For Free

Whenever we hear the word "Kafka," all we think about it as a messaging system with a publisher-subscriber model that we use for our streaming applications as a source and a sink.

We could say that Kafka is just a dumb storage system that stores the data that's been provided by a producer for a long time (configurable) and can provide it customers (from a topic, of course).

Between consuming the data from producer and then sending it to the consumer, we can’t do anything with this data in Kafka. So we make use of other tools, like Spark or Storm, to process the data between producers and consumers. We have to build two separate clusters for our app: one for our Kafka cluster that stores our data and another to do stream processing on our data.

To save us from this hassle, the Kafka Streams API comes to our rescue. With this, we have a unified Kafka where we can set our stream processing inside the Kafka cluster. And with this tight integration, we get all the support from Kafka (for example, topic partition becomes stream partition for parallel processing).

What Is the Kafka Streams API?

The Kafka Streams API allows you to create real-time applications that power your core business. It is the easiest to use yet the most powerful technology to process data stored in Kafka. It gives us the implementation of standard classes of Kafka.

A unique feature of the Kafka Streams API is that the applications you build with it are normal applications. These applications can be packaged, deployed, and monitored like any other application, with no need to install separate processing clusters or similar special-purpose and expensive infrastructure!

streams-introduction-your-app

Source

Features Brief

The features provided by Kafka Streams:

  • Highly scalable, elastic, distributed, and fault-tolerant application.
  • Stateful and stateless processing.
  • Event-time processing with windowing, joins, and aggregations.
  • We can use the already-defined most common transformation operation using Kafka Streams DSL or the lower-level processor API, which allow us to define and connect custom processors.
  • Low barrier to entry, which means it does not take much configuration and setup to run a small scale trial of stream processing; the rest depends on your use case.
  • No separate cluster requirements for processing (integrated with Kafka).
  • Employs one-record-at-a-time processing to achieve millisecond processing latency, and supports event-time based windowing operations with the late arrival of records.
  • Supports Kafka Connect to connect to different applications and databases.

Streams

A stream is the most important abstraction provided by Kafka Streams. It represents an unbounded, continuously updating data set. A stream is an ordered, replayable, and fault-tolerant sequence of immutable data records, where a data record is defined as a key-value pair. It can be considered as either a record stream (defined as KStream) or a changelog stream (defined as KTable or GlobalKTable).

Stream Processor

A stream processor is a node in the processor topology. It represents a processing step in a topology (to transform the data). A node is basically our processing logic that we want to apply on streaming data.

Source

As shown in the figure, a source processor is a processor without any upstream processors and a sink processor that does not have downstream processors.

Processing in Kafka Streams

The aim of this processing is to provide ways to enable processing of data that is consumed from Kafka and will be written back into Kafka. Two options available for processing stream data:

  1. High-level Kafka Streams DSL.
  2. A lower-level processor that provides APIs for data-processing, composable processing, and local state storage.

1. High-Level DSL

High-Level DSL contains already implemented methods ready to use. It is composed of two main abstractions: KStream and KTable or GlobalKTable.

KStream

A KStream is an abstraction of record stream where each data is a simple key value pair in the unbounded dataset. It provides us many functional ways to manipulate stream data like

  • map 
  • mapValue 
  • flatMap 
  • flatMapValues 
  • filter 

It also provides joining methods for joining multiple streams and aggregation methods on stream data.

KTable or GlobalKTable

A KTable is an abstraction of a changelog stream. In this change log, every data record is considered an Insert or Update (Upsert) depending upon the existence of the key as any existing row with the same key will be overwritten.

2. Processor API

The low-level Processor API provides a client to access stream data and to perform our business logic on the incoming data stream and send the result as the downstream data. It is done via extending the abstract class AbstractProcessor and overriding the process method which contains our logic. This process method is called once for every key-value pair.

Where the high-level DSL provides ready to use methods with functional style, the low-level processor API provides you the flexibility to implement processing logic according to your need. The trade-off is just the lines of code you need to write for specific scenarios.

Code in Action: Quickstart

To start working on Kafka Streams, the following dependency must be included in the SBT project:

"org.apache.kafka" % "kafka-streams" % "0.11.0.0"

Following imports are required for the application:

import org.apache.kafka.common.serialization.{Serde, Serdes}
import org.apache.kafka.streams.KafkaStreams
import org.apache.kafka.streams.StreamsConfig._
import org.apache.kafka.streams.kstream.{KStream, KStreamBuilder}

Next, we have to set up some configuration properties for Kafka Streams

val streamsConfiguration = new Properties()
streamsConfiguration.put(APPLICATION_ID_CONFIG, "Streaming-QuickStart")
streamsConfiguration.put(BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
streamsConfiguration.put(DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String.getClass.getName)
streamsConfiguration.put(DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String.getClass.getName)

Now we have to create an instance of KStreamBuilder that provides us with a KStream object:

val builder = new KStreamBuilder

The builder object has a Stream method that takes a topic name and returns an instance of the kStream object subscribed to that specific topic:

val kStream = builder.stream("InTopic")

Here on this kStream object, we can use many methods provided by the high-level DSL of Kafka Streams like ‘map’, ‘process’, ‘transform’, ‘join’ which in turn gives us another KStream object with that method applied. And now the last step is to send this processed data to another topic

val upperCaseKStream = kStream.mapValues(_.toUpperCase)
//characters of values are now converted to upper case
upperCaseKStream.to("OutTopic")
//sending data to out topic

The last step is to start the streaming. For this step, we use the builder and the streaming configuration that we created:

val stream = new KafkaStreams(builder, streamsConfiguration)
stream.start()

This is a simple example of high-level DSL. For clarity, here are some examples. One example demonstrates the use of Kafka Streams to combine data from two streams (different topics) and send them to a single stream (topic) using the High-Level DSL. The other shows filtering data with stateful operations using the Low-Level Processor API. Here is the link to the code repository.

Conclusion

With Kafka Streams, we can process the stream data within Kafka. No separate cluster is required just for processing. With the functionality of the High-Level DSL, it's much easier to use — but it restricts how the user to processes data. For those situations, we use Lower-Level Processor APIs.

I hope this article was of some help!

References

  1. Apache Kafka documentation on Streams
  2. Confluent documentation on Streams
kafka Data stream Stream processing API Database

Published at DZone with permission of Anuj Saxena, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Minimizing Latency in Kafka Streaming Applications That Use External API or Database Calls
  • Real-Time Streaming Architectures: A Technical Deep Dive Into Kafka, Flink, and Pinot
  • Exploring the Dynamics of Streaming Databases
  • Designing High-Volume Systems Using Event-Driven Architectures

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!