If you are working on a huge amount of data, you may have heard about Kafka. At a very high level, Kafka is a fault tolerant, distributed publish-subscribe messaging system that is designed for fast processing of data and that has the ability to handle hundreds of thousands of messages.
What Is Stream Processing?
Stream processing is the real-time processing of data continuously, concurrently, and in a record-by-record fashion.
Kafka has many applications, one of which is real-time processing.
Let's first understand what we actually do in real-time processing. In simple words, we all know that it includes a continuous stream of data. Some form of analysis is done and we get some useful data out of it. In terms of Kafka, real-time processing typically involves reading data from a topic (source), doing some analysis or transformation work, and then writing the results back to another topic (sink). Currently, to do this type of work, our options are:
Write your own custom code with a
KafkaConsumerto read the data and write that data via a
Use a full-fledged stream processing framework like Spark Streaming, Flink, Storm, etc.
Now, we'll learn about an alternative to the above options: Kafka Streams.
What Is Kafka Streams?
Kafka Streams is a library for building streaming applications, specifically applications that transform input Kafka topics into output Kafka topics (or call external services, update databases, etc.). Kafka Streams allows you do this with concise code in a way that is distributed and fault-tolerant.
Implementing Kafka Streams
A stream processing application built with Kafka Streams looks like this.
Providing Stream configurations:
Properties streamsConfiguration = new Properties(); streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "Streaming-QuickStart"); streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092"); streamsConfiguration.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName()); streamsConfiguration.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
String topic = configReader.getKStreamTopic(); String producerTopic = configReader.getKafkaTopic(); final Serde stringSerde = Serdes.String(); final Serde longSerde = Serdes.Long();
Building the Stream and fetching data.
KStreamBuilder builder = new KStreamBuilder(); KStream<String, String> inputStreamData = builder.stream(stringSerde, stringSerde, producerTopic);
Processing of Stream:
KStream<String, Long> processedStream = inputStreamData.mapValues(record -> record.length() )
aggregate operations, there is a list of other transformation operations provided for KStream. Each of these operations may generate either one or more KStream objects and can be translated into one or more connected processors into the underlying processor topology. All of these transformation methods can be chained together to compose a complex processor topology.
Among these transformations,
mapValues, etc., are stateless transformation operations with which users can pass a customized function as a parameter, such as
map, etc. as per their usage in a language.
Writing Streams back to Kafka:
processedStream.to(stringSerde, longSerde, topic);
At this point, internal structures have been initialized, but the processing is not started yet. You have to explicitly start the Kafka Streams thread by calling the
KafkaStreams streams = new KafkaStreams(builder, streamsConfiguration); streams.start();
Finally, the last step is closing the Stream.
I hope that this article has helped you get a quick start with Kafka Streams. Happy coding!