JVM Advent Calendar: Frameworks for Big Data Processing in Java
Check out this post to learn more about the top libraries and frameworks for big data processing in Java.
Join the DZone community and get the full member experience.Join For Free
The concept of big data is understood differently in a variety of domains where companies face the need to deal with increasing volumes of data. In most of these scenarios, the system under consideration needs to be designed in such a way so that it is capable of processing that data without sacrificing throughput as data grows in size.
This essentially leads to the necessity of building systems that are highly scalable so that more resources can be allocated based on the volume of data that needs to be processed at a given point in time.
You may also like: Big Data Building Blocks: Selecting Architectures and Open-Source Frameworks
Building such a system is a time-consuming and complex activity, and for that reason, third-party frameworks and libraries can be used to provide the scalability requirements out of the box. There are already a number of good choices that can be used in Java applications; this article we will discuss briefly some of the most popular ones:
The Frameworks in Action
We are going to demonstrate each of the frameworks by implementing a simple pipeline for processing of data from devices that measure the air quality index for a given area. For simplicity, we will assume that numeric data from the devices is either received in batches or in a streaming fashion. Throughout the examples, we are going to use the THRESHOLD constant to denote the value above which we consider an area being polluted.
In Spark, we need to first convert the data into a proper format. We are going to use Datasets but we can also choose
DataFrames or RDDs (Resilient Distributed Datasets) as an alternative for the data representation. We can then apply a number of Spark transformations and actions in order to process the data in a distributed fashion.
If we want to change the above application to read data from an external source, write to an external data source, and run it on a Spark cluster rather than a local Spark instance, we would have the following execution flow:
The Spark driver might be either a separate instance or part of the Spark cluster.
Similarly to Spark, we need to represent the data in a Flink DataSet and then apply the necessary transformations and actions over it:
If we want to change the above application to read data from an external source, write to an external data source, and run it on a Flink cluster, we would have the following execution flow:
The Flink client where the application is submitted to the Flink cluster is either the Flink CLI utility or JobManager's UI.
In Storm, the data pipeline is created as a topology of Spouts (the sources of data) and Bolts (the data processing units). Since Storm typically processes unbounded streams of data, we will emulate the processing of an array of air quality index numbers as bounded stream:
We have one spout that provides a data source for the array of air quality index numbers and one bolt that filters only the ones that indicate polluted areas:
We are using a =
LocalCluster instance for submitting to a local Storm cluster, which is convenient for development purposes, but we want to submit the Storm topology to a production cluster. In that case, we would have the following execution flow:
In Ignite, we need first to put the data in the distributed cache before running the data processing pipeline, which is the former of an SQL query executed in a distributed fashion over the Ignite cluster:
If we want to run the application in an Ignite cluster, it will have the following execution flow:
Hazelcast Jet works on top of Hazelcast IMDG, and similarly to Ignite, if we want to process data, we need first to put it in the Hazelcast IMDG cluster:
Note, however, that Jet also provides integration without external data sources and data does not need to be stored in the IMDG cluster. You can also do the aggregation without first storing the data into a list (review the full example in Github that contains the improved version). Thanks to Jaromir and Can from Hazelcast engineering team for the valuable input!
If we want to run the application in a Hazelcast Jet cluster, it will have the following execution flow:
Kafka Streams is a client library that uses Kafka topics as sources and sinks for the data processing pipeline. To make use of the Kafka Streams library for our scenario, we would be putting the air quality index numbers in a numbers Kafka topic:
We will have the following execution flow for our Kafka Stream application instances:
Apache Pulsar Functions are lightweight compute processes that work in a serverless fashion along with an Apache Pulsar cluster. Assuming we are streaming our air quality index in a Pulsar cluster, we can write a function to count the number of indexes that exceed the given threshold and write the result back to Pulsar as follows:
The execution flow of the function along with a Pulsar cluster is the following:
The Pulsar function can run either in the Pulsar cluster or as a separate application.
In this article, we reviewed briefly some of the most popular frameworks that can be used to implement big data processing systems in Java. Each of the presented frameworks is fairly big and deserves a separate article on its own.
Although quite simple, our air quality index data pipeline demonstrates the way these frameworks operate and you can use that as a basis for expanding your knowledge in each one of them that might be of further interest.
You can review the complete code samples here.
Want to write for the Java Advent blog? We are looking for contributors to fill all 24 slots and would love to have your contribution! Contact the Java Advent Admin at firstname.lastname@example.org!
Published at DZone with permission of Martin Toshev. See the original article here.
Opinions expressed by DZone contributors are their own.