In-Memory MapReduce and Your Hadoop Ecosystem: Part I
In-Memory MapReduce and Your Hadoop Ecosystem: Part I
Learn about in-memory MapReduce, the Ignite in-memory file system, and the Hadoop file system cache. Also learn how to install and configure Hadoop and Ignite.
Join the DZone community and get the full member experience.Join For Free
The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.
Portions of this article were taken from the book High-Performance In-Memory Computing With Apache Ignite. If it got you interested, check out the rest of the book for more helpful information.
Hadoop has quickly become the standard for business intelligence on huge data sets. However, its batch scheduling overhead and disk-based data storage have made it unsuitable for use in analyzing live, real-time data in the production environment. One of the main factors that limits performance scaling of Hadoop and MapReduce is the fact that Hadoop relies on a file system that generates a lot of input/output (I/O) files. I/O adds latency that delays the MapReduce computation. An alternative is to store the needed distributed data within the memory. Placing MapReduce in-memory with the data it needs eliminates file I/O latency.
The generic phases of a Hadoop job are shown in the above figure. Phase sort, merge, and shuffle are highly I/O intensive. These overheads are prohibitive when running real-time analytics that returns the result in milliseconds or seconds.
The Ignite in-memory MapReduce engine executes MapReduce programs in seconds (or less) by incorporating several techniques. By avoiding Hadoop’s batch scheduling, it can start up jobs in milliseconds instead of tens of seconds. In-memory data storage dramatically reduces access times by eliminating data motion from the disk or across the network. This is the Ignite approach to accelerate Hadoop application performance without changing the code. The main advantages are that all the operations are highly transparent, all of this is accomplished without changing a line of MapReduce code.
Ignite Hadoop in-memory plug and play accelerator can be grouped by in three different categories.
1. In-Memory MapReduce
It’s an alternative implementation of Hadoop Job tracker and task tracker, which can accelerate job execution performance. It eliminates the overhead associated with job tracker and task trackers in a standard Hadoop architecture while providing low- latency, HPC-style distributed processing.
2. Ignite In-Memory File System (IGFS)
It’s also an alternate implementation of Hadoop file system named IgniteHadoopFileSystem, which can store data sets in-memory. This in-memory file system minimizes disk I/O and improves performances.
3. Hadoop File System Cache
This implementation works as a caching layer above HDFS. Every read and write operation should go through this layer and can improve MapReduce performance.
Conceptual architecture of the Ignite Hadoop accelerator is shown in Figure 2:
The Apache Ignite Hadoop accelerator tool is especially very useful when you already have up and running existing Hadoop cluster and want to get high-performance with minimum efforts. Note that the idea that Hadoop runs on commodity hardware is a myth. Most of the Hadoop process is I/O-intensive and requires homogenous and mid-end servers to performance well.
In this blog post, we are going to explore the details of the Apache Ignite in-memory MapReduce.
The Ignite in-memory MapReduce engine is 100% compatible with Hadoop HDFS and Yarn. It reduces the startup and the execution time of the Hadoop job tracker and the task tracker. Ignite in-memory MapReduce provides dramatic performance boosts for CPU-intensive tasks while requiring an only minimal change to existing applications. This module also provides an implementation of weight based MapReduce planner, which assigns mappers and reducers based on their weights. Weight describes how much resources are required to execute the particular map and reduce task. This planning algorithm assigns mappers so that, total resulting weight on all nodes as minimal as possible.
High-level architecture of the Ignite in-memory MapReduce is shown below:
Ignite in-memory grid has a pre-stage Java-based execution environment on all grid nodes and reuses it for multiple data processing. This execution environment consists of a set of Java virtual machines one on each server within the cluster. This JVM’s forms the Ignite MapReduce engine as shown in the above figure. Also, the Ignite in-memory data grid can automatically deploy all necessary executable programs or libraries for the execution of the MapReduce across the grid, this greatly reducing the startup time down to milliseconds.
Now that we have got the basics, let’s try to configure the sandbox and execute a few MapReduce jobs in Ignite MapReduce engine.
For simplicity, we are going to install a Hadoop pseudo-distributed cluster in a single virtual machine and run the famous Hadoop word count example as a MapReduce job. The Hadoop pseudo-distributed cluster means that Hadoop data nodes, name nodes, task trackers and job trackers — everything will be on one virtual (host) machine.
Let’s have a look at our sandbox configuration as shown below.
- OS: RedHat enterprise Linux.
- CPU: 2.
- RAM: 2Gb.
- JVM: 1.7_60.
- Ignite version: 1.6 or above, single node cluster.
First, we are going to install and configure Hadoop and will proceed to Apache Ignite. Assume that Java has been installed and that
JAVA_HOME is in the environment variables.
1. Unpack the Hadoop Distribution Archive
Unpack the Hadoop distribution archive and set the
JAVA_HOME path in
2. Add Configurations
Add the following configuration in the
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
Also, append the following data replication strategy into the
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
3. Set Up Password-Less SSH
Set up password-less or passphrase-less SSH for your operating system:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys $ chmod 0600 ~/.ssh/authorized_keys
Try the following command on your console:
$ ssh localhost
It shouldn’t ask you for input password.
4. Set Up the Hadoop HDFS File System
Format the Hadoop HDFS file system:
$ bin/hdfs namenode -format
Next, start the namenode/datanode daemon by the following command:
Also, I would like to suggest that you add the
HADOOP_HOME environmental variable to the operating system.
5. Set Up Directories
Make a few directories in the HDFS file system to run MapReduce jobs.
bin/hdfs dfs -mkdir /user bin/hdfs dfs -mkdir /input
The above command will create two folder user and input in HDFS file system. Insert some text files in directory input:
bin/hdfs dfs -put $HADOOP_HOME/etc/hadoop /input .
6. Configure the Hadoop Pseudo Cluster
Run the Hadoop native MapReduce application to count the words of the file.
$ bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar w\ ordcount /input/hadoop output
You can view the result of the word count with the following command:
bin/hdfs dfs -cat output/*
In my case, the file is huge. Let’s see the fragment of the file:
want 1 warnings. 1 when 9 where 4 which 7 while 1 who 6 will 23 window 1 window, 1 with 62 within 4 without 1 work 12 writing, 27
In this stage, our Hadoop pseudo cluster is configured and ready to use. Now, let’s start configuring Apache Ignite.
7. Unpack Apache Ignite Distribution
Unpack the distribution of Apache Ignite somewhere in your sandbox and add the
IGNITE_- HOME path to the root directory of the installation. For getting statistics about tasks and executions, you have to add the following properties in your
<bean id="ignite.cfg" class="org.apache.ignite.configuration.IgniteConfiguration"> .... <property name="includeEventTypes"> <list> <util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_FAILED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_TASK_FINISHED"/> <util:constant static-field="org.apache.ignite.events.EventType.EVT_JOB_MAPPED"/> </list> </property> </bean>
The above configuration will enable the event task for statistics.
Note that, by default, all events are disabled. Whenever these above events are enabled, you can use the command “tasks” in Ignite Visor to get statistics about tasks executions.
8. Add Libraries
Add the following libraries in the
asm-all-4.2.jar ignite-hadoop-1.6.0.jar hadoop-mapreduce-client-core-2.7.2.jar hadoop-common-2.7.2.jar hadoop-auth-2.7.2.jar
Note that the
asm-all-4.2.jar library version is dependent on your Hadoop version.
9. Start the Ignite Node
We are going to use the Apache Ignite default configuration
config/default-config.xml file to start the Ignite node. Start the Ignite node with the following command:
10. Finish Setting Up Ignite Job Tracker
Add a few more things to use the Ignite job tracker instead of Hadoop. Add
HADOOP_CLASSPATH to the environmental variables as follows.
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$IGNITE_HOME/libs/ignite-core-1.6.0.jar:$IGNITE_\ HOME/libs/ignite-hadoop-1.6.0.jar:$IGNITE_HOME/libs/ignite-shmem-1.0.0.jar
In this stage, we are going to override the Hadoop
mapred-site.xml file. For a quick start, add the following fragment of XML to the
<property> <name>mapreduce.framework.name</name> <value>ignite</value> </property> <property> <name>mapreduce.jobtracker.address</name> <value>127.0.0.1:11211</value> </property>
12. Run It
Run the above example of the word count MapReduce example again.
$bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar wo\ rdcount /input/hadoop output2
The output should be very similar, as shown in Figure 4.
Now, the execution time is faster than the last time that we used the Hadoop task tracker. Let’s examine the Ignite task execution statistics through Ignite Visor.
From the above figure, we should notice the total executions and the durations times of the in-memory task tracker. In our case, the total executions task (
HadoopProtocolJobStatusTask(@t1)) is 24 and the execution rate are 12 second.
In the next part, we will execute a few benchmarks for demonstrating the performance advantage of the Ignite MapReduce engine.
Published at DZone with permission of Shamim Bhuiyan . See the original article here.
Opinions expressed by DZone contributors are their own.