People frequently ask me if it’s necessary to have Java programming skills in order to enter the exciting world of Hadoop. When I begin to explain, I’m often met with a disappointment and a sense of limitation upon learning that Java and Hadoop do, in fact, go hand-in-hand. Let me start by saying that the answer to the question “Do I need to know Java to learn Hadoop?” is not a simple one. But I digress; the future of Hadoop is bright, and going forward, no requirements should be seen as limitations or roadblocks, but rather as ways to increase your expertise and become more seasoned in your work. As you make your way through this, I hope I will be able to clarify your concerns, and help get you on your way to excellence within Hadoop.
To get to the bottom of this question it’s necessary to look into the history of Hadoop. Hadoop is Apache’s open-source platform; built to store and process huge amounts of data (orders of petabytes). The platform happens to be built in Java. (Personally, I see the language choice as merely accidental.) Hadoop was originally created as a subproject of “Nutch" (an open-source search engine). It was later conceptualized and would go on to become Apache’s highest priority project. At the time this was all happening, the Hadoop developer team was more comfortable with Java than any other language.
Let’s move on to understanding the platform...
Hadoop solves huge data processing challenges through the older concept of distributed parallel processing, but approaches it in a new way. Hadoop provides a framework to develop distributed applications, rather than solve every problem. It takes away the challenges (such as machine failures, distributed process management etc.) of storing and processing the data in a distributed environment by building the fundamental components: HDFS and MapReduce, respectively.
HDFS is a distributed file system that manages data storage. It stores any given data file by splitting it into fixed size units called “blocks.” Each block is stored on any machine in the cluster. It provides high availability and fault tolerance through replication (think of it as duplication) of these blocks on different machines on the cluster. In spite of all these complexities, it provides a simple file system abstraction so that the user need not bother with how it stores and operates (unless you are an administrator).
MapReduce is actually a computational paradigm for processing distributed data, which was published by Google in 2004. Hadoop implements these concepts in its MapReduce framework (However there are other ways of processing data on Hadoop than MapReduce using the new architecture YARN). Simply: MapReduce works hand-in-hand with HDFS by processing (based on user application) the data blocks stored on HDFS from the machines they are located. This is called data locality (there are exceptions to this, which are beyond the scope of this discussion).
These components let you focus on your data processing logic rather than bogging you down with the complexities of distributed environment. The MapReduce framework provides a programming interface to interact and program on these components located in java. However, the programs can also be written in other scripting languages such as Python, or Ruby, which supports Unix standard streaming using Hadoop Streaming API. So it is not absolutely necessary to know Java to develop MapReduce programs.
From a data processing perspective, MapReduce is too low-level to program for basic data processes like filtering, counting, grouping, joining, etc. Considering advancements in data processing, MapReduce seems elementary. This also leads to communication challenges between data analysts and programmers. In order to combat this, the community has developed higher-level tools such as Hive and Pig (Figure 2). Using these tools, data processing can be expressed in a more descriptive language. What these tools do behind the screens is convert the processing instructions into the low-level MapReduce programs to execute them on the Hadoop cluster. Hive uses SQL to express the data processing, whereas Pig has its own language: Pig Latin. To put it simply, all of your data processing—basic to complex—can be easily expressed and achieved through Hive and Pig.
In conclusion, it’s not necessary to know Java to process your data on Hadoop (unless you want to become a committer). In fact, the programs in most major Hadoop deployments are predominantly developed in Hive or Pig rather than MapReduce.
You may occasionally need to write programs in MapReduce due to performance and complexity reasons, but this situation is quite rare. If you are a data analyst you can easily migrate to Hadoop with a little bit of learning. If you are a programmer, you should either be competent in Java or use any of the streaming languages on Linux to write MapReduce programs. As it’s native to Hadoop, you will have better control with Java. If you are an administrator you will be able to borrow and use a lot of the concepts from your previous experience.
With all that being said, it is absolutely necessary to understand how HDFS and MapReduce engines function in order to work with Hadoop. This will help with understanding the internal functioning, and will help to write better programs. Additionally, in my opinion, it will help if you understand the MapReduce API, as well as some other concepts that are cross-referenced from other tools, and these concepts are understood better through the API.
I wish all the best to those who wish to explore more about Hadoop. So happy Hadooping! I will be glad to hear your feedback, suggestions and answer any questions on the post.
Leave a comment below, and keep an eye out for updates on Java, Hadoop, and more!