Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

An Introduction to Kafka Streams

DZone's Guide to

An Introduction to Kafka Streams

Kafka Streams is a library for building streaming apps that transform input Kafka topics into output Kafka topics. In this article, learn how to implement Kafka Streams.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

If you are working on a huge amount of data, you may have heard about Kafka. At a very high level, Kafka is a fault tolerant, distributed publish-subscribe messaging system that is designed for fast processing of data and that has the ability to handle hundreds of thousands of messages.

What Is Stream Processing?

Stream processing is the real-time processing of data continuously, concurrently, and in a record-by-record fashion.

Real-Time Processing

Kafka has many applications, one of which is real-time processing.

Let's first understand what we actually do in real-time processing. In simple words, we all know that it includes a continuous stream of data. Some form of analysis is done and we get some useful data out of it. In terms of Kafka, real-time processing typically involves reading data from a topic (source), doing some analysis or transformation work, and then writing the results back to another topic (sink). Currently, to do this type of work, our options are:

  • Write your own custom code with a KafkaConsumer to read the data and write that data via a KafkaProducer.

  • Use a full-fledged stream processing framework like Spark Streaming, Flink, Storm, etc.

Now, we'll learn about an alternative to the above options: Kafka Streams.

What Is Kafka Streams?

Kafka Streams is a library for building streaming applications, specifically applications that transform input Kafka topics into output Kafka topics (or call external services, update databases, etc.). Kafka Streams allows you do this with concise code in a way that is distributed and fault-tolerant.

Implementing Kafka Streams

A stream processing application built with Kafka Streams looks like this.

Providing Stream configurations:

Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "Streaming-QuickStart");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
streamsConfiguration.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());

Getting topic and serdes:

String topic = configReader.getKStreamTopic();
String producerTopic = configReader.getKafkaTopic();

final Serde stringSerde = Serdes.String();
final Serde longSerde = Serdes.Long();

Building the Stream and fetching data.

KStreamBuilder builder = new KStreamBuilder();
KStream<String, String> inputStreamData = builder.stream(stringSerde, stringSerde, producerTopic);

Processing of Stream:

KStream<String, Long> processedStream = inputStreamData.mapValues(record -> record.length() )

Besides join and aggregate operations, there is a list of other transformation operations provided for KStream. Each of these operations may generate either one or more KStream objects and can be translated into one or more connected processors into the underlying processor topology. All of these transformation methods can be chained together to compose a complex processor topology.

Among these transformations, filtermapmapValues, etc., are stateless transformation operations with which users can pass a customized function as a parameter, such as Predicate for filter, KeyValueMapper for map, etc. as per their usage in a language.

Writing Streams back to Kafka:

processedStream.to(stringSerde, longSerde, topic);

At this point, internal structures have been initialized, but the processing is not started yet. You have to explicitly start the Kafka Streams thread by calling the start() method:

KafkaStreams streams = new KafkaStreams(builder, streamsConfiguration);
streams.start();

Finally, the last step is closing the Stream.

I hope that this article has helped you get a quick start with Kafka Streams. Happy coding!

References

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
big data ,kafka streams ,stream processing ,data processing ,tutorial

Published at DZone with permission of Sangeeta Gulia, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}