Big Data Processing on an ARM Cluster
Big Data Processing on an ARM Cluster
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
An ARM chip is not for processing Big Data by design. However, it's been gradually becoming powerful enough to do so. Many people have been trying to, at least, run Apache Hadoop on top of an ARM cluster. Cubieboard's guys have posted in the last August that they were able to run Hadoop on an 8-node machine. Also, Jamie Whitehorn seems to be the first guy who successfully ran Hadoop on a Raspberry Pi cluster, in October 2013. Both show that an ARM cluster is OK to up and run Hadoop.
But is it feasible to seriously do Big Data on a low-cost ARM cluster? This question really bugs me.
We know that Hadoop's MapReduce is doing operations on disk. With the slow I/O and networking of these ARM SoCs, Hadoop's MapReduce will really not be able to process a real Big Data computation, which average 15GB per file.
Yes, the true nature of Big Data is not that big. And we often do batch processing over a list of files, each of them is not larger than that size. So if a cluster is capable of processing a file of size 15GB, it is good enough for Big Data.
To answer this question, we have prototyped an ARM-based cluster, designed to process Big Data. It's a 22-node Cubieboard A10 with 100 Mbps Ethernet. Here's what it looks like:
The cluster running Spark and Hadoop
As we learned that Hadoop's MapReduce is not a good choice to process on this kind of cluster, we decided to use only HDFS then looked for an alternative, and stumbled upon Apache Spark.
It's kinda lucky for us that Spark is an in-memory framework which optionally spills intermediate results out to a disk when a computing node is running out-of-memory. So it's running fine on our cluster. Although the cluster has total 20GB of RAM, there's only 10GB available for data processing. If we try to allocate a larger amount of memory, some nodes will die during the computation.
So what is the result? Our cluster is good enough to crush a single, 34GB, Wikipedia article file from the year 2012. Its size is 2-times larger than the average Big Data file size, mentioned above. Also, it's 3-times larger than the memory allocated to process data.
We simply ran a tweaked word count program in Spark's shell and waited for 1 hour 50 mins and finally the cluster answered that the Wikipedia file contains 126,938,368 words of "the". The cluster spent around 30.5 hours in total across all nodes.
The result printed out from Spark's shell
(Just don't mind the date. We didn't set the cluster date/time properly.)
Design and Observation
We have 20 Spark worker nodes, and 2 of them also run Hadoop Data Nodes. This enables us to understand the data locality of our Spark/Hadoop cluster. We run the Hadoop's Name Node and the Spark's master node on the same machine. Another machine is the driver. The file is stored in 2 SSDs with SATA connected to Data Nodes. We set duplication to 2 as we have only 2 SSDs.
We have observed that the current bottleneck may be from 100 Mbps Ethernet, but we still have no chance to confirm this until we create a new cluster with 1Gbps Ethernet. We have 2 power supplies attached and the cluster seems to consume not much power. We'll measure this in detail later. We are located in one of the warmest cities of Thailand, but we've found that the cluster is able to run fine here (with some additional fans). Our room temperature is air-conditioned to 25 degrees Celsius (or 77 degrees Fahrenheit). It's great to have a cluster running without a costly data center, right?
An ARM system-on-chip board has demonstrated an enough power to form a cluster and process non-trivial size of data. The missing puzzle-piece we have found is to not rely only on Hadoop. We need to choose the right software package. Then with some engineering efforts, we can tune the software to fit the cluster. We have successfully used it to process a single large, 34 GB, file with acceptable time spent.
We are looking forwards to develop a new one, bigger by CPU cores but we'll keep its size small. Yes, and we're seriously thinking of putting the new one in the production.
Ah, I forgot to tell you the cluster's name. It's SUT Aiyara Cluster: Mk-I.
Opinions expressed by DZone contributors are their own.