Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Big Data Processing on an ARM Cluster

DZone's Guide to

Big Data Processing on an ARM Cluster

· Big Data Zone
Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

An ARM chip is not for processing Big Data by design. However, it's been gradually becoming powerful enough to do so. Many people have been trying to, at least, run Apache Hadoop on top of an ARM cluster. Cubieboard's guys have posted in the last August that they were able to run Hadoop on an 8-node machine. Also, Jamie Whitehorn seems to be the first guy who successfully ran Hadoop on a Raspberry Pi cluster, in October 2013. Both show that an ARM cluster is OK to up and run Hadoop.

But is it feasible to seriously do Big Data on a low-cost ARM cluster? This question really bugs me.

We know that Hadoop's MapReduce is doing operations on disk. With the slow I/O and networking of these ARM SoCs, Hadoop's MapReduce will really not be able to process a real Big Data computation, which average 15GB per file.

Yes, the true nature of Big Data is not that big. And we often do batch processing over a list of files, each of them is not larger than that size. So if a cluster is capable of processing a file of size 15GB, it is good enough for Big Data.

To answer this question, we have prototyped an ARM-based cluster, designed to process Big Data. It's a 22-node Cubieboard A10 with 100 Mbps Ethernet. Here's what it looks like:


The cluster running Spark and Hadoop

As we learned that Hadoop's MapReduce is not a good choice to process on this kind of cluster, we decided to use only HDFS then looked for an alternative, and stumbled upon Apache Spark.

It's kinda lucky for us that Spark is an in-memory framework which optionally spills intermediate results out to a disk when a computing node is running out-of-memory. So it's running fine on our cluster. Although the cluster has total 20GB of RAM, there's only 10GB available for data processing. If we try to allocate a larger amount of memory, some nodes will die during the computation.

So what is the result? Our cluster is good enough to crush a single, 34GB, Wikipedia article file from the year 2012. Its size is 2-times larger than the average Big Data file size, mentioned above. Also, it's 3-times larger than the memory allocated to process data.

We simply ran a tweaked word count program in Spark's shell and waited for 1 hour 50 mins and finally the cluster answered that the Wikipedia file contains 126,938,368 words of "the". The cluster spent around 30.5 hours in total across all nodes.


The result printed out from Spark's shell
(Just don't mind the date. We didn't set the cluster date/time properly.)

Design and Observation

We have 20 Spark worker nodes, and 2 of them also run Hadoop Data Nodes. This enables us to understand the data locality of our Spark/Hadoop cluster. We run the Hadoop's Name Node and the Spark's master node on the same machine. Another machine is the driver. The file is stored in 2 SSDs with SATA connected to Data Nodes. We set duplication to 2 as we have only 2 SSDs.

We have observed that the current bottleneck may be from 100 Mbps Ethernet, but we still have no chance to confirm this until we create a new cluster with 1Gbps Ethernet. We have 2 power supplies attached and the cluster seems to consume not much power. We'll measure this in detail later. We are located in one of the warmest cities of Thailand, but we've found that the cluster is able to run fine here (with some additional fans). Our room temperature is air-conditioned to 25 degrees Celsius (or 77 degrees Fahrenheit). It's great to have a cluster running without a costly data center, right?

Conclusions

An ARM system-on-chip board has demonstrated an enough power to form a cluster and process non-trivial size of data. The missing puzzle-piece we have found is to not rely only on Hadoop. We need to choose the right software package. Then with some engineering efforts, we can tune the software to fit the cluster. We have successfully used it to process a single large, 34 GB, file with acceptable time spent.

We are looking forwards to develop a new one, bigger by CPU cores but we'll keep its size small. Yes, and we're seriously thinking of putting the new one in the production.

Ah, I forgot to tell you the cluster's name. It's SUT Aiyara Cluster: Mk-I.

Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.

Topics:

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}