Kafka Hadoop Integration
Check out more information on Hadoop and explore the main components as well as Kafka Hadoop Integration.
Join the DZone community and get the full member experience.Join For Free
What Is Hadoop?
A large-scale distributed batch processing framework that use to parallelize the data processing among many nodes and also addresses the challenges for distributed computing, including big data, is what we call Hadoop.
Basically, it works on the principle of the MapReduce framework, which is introduced by Google. It offers a simple interface for the parallelization as well as the distribution of large-scale computations. In addition, it has its own distributed data filesystem which we call as HDFS (Hadoop Distributed File System). To understand HDFS, it splits the data into small pieces (called blocks) and further distributes it to all the nodes in any typical Hadoop cluster. Moreover, it creates the replication of these small pieces of data as well as it stores them to ensure that the data is available from another node if any node is down.
Now, here is an image showing the high-level view of a multi-node Hadoop cluster:
Main Components of Hadoop
The following are the Hadoop Components:
A single point of interaction for HDFS is what we call Namenode. As its job, it keeps the information about the small pieces (blocks) of data, which are distributed among node.
In case of a name node failure, it stores the edit logs, to restore the latest updated state of HDFS.
It keeps the actual data which is distributed by the namenode in blocks as well as keeps the replicated copy of data from other nodes.
In order to split the MapReduce jobs into smaller tasks, Job Tracker helps.
Whereas, for the execution of tasks split by the job tracker, the task tracker is responsible.
Although, make sure that the task tracker and the data nodes share the same machines.
Kafka Hadoop Integration
In order to build a pipeline, which is available for real-time processing or monitoring as well as to load the data into Hadoop, NoSQL, or data warehousing systems for offline processing and reporting, especially for real-time publish-subscribe use cases, we use Kafka.
In order to publish the data from a Hadoop Cluster to Kafka, a Hadoop producer offers a bridge you can see in the below image:
Moreover, Kafka topics are considered as URIs for a Kafka producer. URIs are specified below to connect to a specific Kafka broker:
Well, for getting the data from Hadoop, the Hadoop producer code suggests two possible approaches, they are:
Using the Pig script and writing messages in Avro format:
Basically, for writing data in a binary Avro format, Kafka producers use Pig scripts in this approach. Here, each row refers to a single message. Further, the AvroKafkaStorage class picks the Avro schema as its first argument and then connects to the Kafka URI in order to push the data into the Kafka cluster. Moreover, we can easily write to multiple topics and brokers in the same Pig script-based job by using the AvroKafkaStorage producer.
Using the Kafka OutputFormat class for jobs:
Now, in the second method, for publishing data to the Kafka cluster, the Kafka OutputFormat class (extends Hadoop’s OutputFormat class) is used. Here, by using low-level methods of publishing, it publishes messages as bytes and also offers control over the output. Although, for writing a record (message) to a Hadoop cluster, the Kafka OutputFormat class uses the KafkaRecordWriter class.
In addition, we can also configure Kafka Producer parameters and Kafka Broker information under a job’s configuration, for Kafka Producers.
A Hadoop job, which pulls data from the Kafka broker and further pushes it into HDFS, is what we call a Hadoop consumer. From below image, you can see the position of a Kafka Consumer in the architecture pattern:
As a process, a Hadoop job does perform parallel loading from Kafka to HDFS and also some mappers for the purpose of loading the data, which depends on the number of files in the input directory. Moreover, data coming from Kafka and the updated topic offsets is in the output directory. Further, at the end of the map task, individual mappers write the offset of the last consumed message to HDFS. Each mapper simply restarts from the offsets stored in HDFS if a job fails or gets restarted.
Conclusion: Kafka Hadoop Integration
Hence, we have seen the whole of Kafka Hadoop Integration in detail. Hope it helps! Furthermore, if you run into any difficulties while learning Kafka Hadoop Integration, feel free to ask questions in the comments.
Published at DZone with permission of Rinu Gour. See the original article here.
Opinions expressed by DZone contributors are their own.