Some time ago we faced a typical programming challenge: how to perform an enormous amount of computations as fast as possible? The answer is simple: divide the problem into smaller ones, compute them in parallel and gather results. The overall conception isn't complicated, so it can be achieved on a variety of ways.
One option is to do it by yourself. Simply set up an HTTP server to farm out requests RESTfully, and accept responses the same way. You can also use AMQP/JMS, RMI or even the old-school Corba - it depends only on your skills and imagination.
However, you do not have to reinvent the wheel. There are several open-source frameworks, which can be used almost out of the box. Simply download the library, adjust your sources and observe how fast your problems are solved. But which framework should you use? The answer isn't simple and it highly depends on your individual needs. We cannot state, that for example Hadoop will be always the best choice, but we can point out the advantages and disadvantages of several frameworks, so you will be able to choose the best one for you by yourself.
We have decided to give a try the following frameworks:
We believe, that the serious study should be as transparent as possible. This is why we describe all the methods and results and give you access to all the sources. You should be able to repeat all the tests and receive the similar results. If not, please contact us, so we can revise our report.
This is part I of our comparison, where we concentrate on the task distribution time. There were no node failures nor transactions rollbacks.
Our test environment consisted with 5 machines (named intel1 - intel5), each one with 8 cpu on board, which gave us 40 processing units. You can see the architecture of the test environment on the following figure:
We based our benchmark on the mathematical problem known as 'counting monotone boolean functions' (also known as Dedekind's problem). Why such unrealistic problem? Because:
- it is highly cpu-consuming (one cpu will need more than 800 hours to count all monotone boolean functions for N=8)
- you do not need to solve the whole problem in a benchmark (we chose a fragment that a single cpu will compute in more than 3 hours)
- it can be easily divided into an arbitrary number of tasks
- generated tasks have different computational needs (it is a perfect use case for load balancers)
With a such flexible problem in hands, we had decided to prepare three test scenarios:
- compute problem divided into 33700 tasks (CMBF with arguments: n = 4, level = 1000)
- compute problem divided into 2705 tasks (CMBF with arguments: n = 4, level = 10000)
- compute problem divided into 341 tasks (CMBF with arguments: n = 4, level = 100000)
All tests were repeated ten times in order to avoid measuring error.
Results - overview
You can find the average results on the following figure:
- X-axis: number of tasks the problem was divided to
- Y-axis: average time of the algorithm (in milliseconds)
|Library name||Tasks: 341||Tasks: 2705||Tasks: 33700|
|DAC 0.7.1||305 834.70||299 507.70||304 971.93|
|GridGain 2.1.1||372 279.70||338 310.40||350 744.00|
|Hazelcast 1.8||348 716.70||321 922.70||335 363.30|
|DAC 0.9.1||306 076.20||299 815.70||305 303.30|
|Hadoop 0.20.1||467 042.70||384 331.60||365 660.40|
As you can see, all frameworks obtained quite similar results. However, these are the best cases only (we performed two versions of the test for GridGain and Hadoop frameworks). Moreover, the best results were obtained in the middle test case (2705 tasks) with the exception of Hadoop, which gained the best time when there were 33700 tasks.
You will find the detailed methodology (sources, test environment description) and results (all performed test cases with std deviation and average values) on our website.
The above part I concentrates on the task distribution time. Because of the test environment limitations (only 40 cpu units), all frameworks obtained quite similar results. However, Hadoop was distributing tasks 20%-30% slower than other frameworks, but Hadoop was designed to manipulate large data sets, so the above results are totally understandable.
In the upcoming part II we will concentrate on the fail-over capabilities of the selected frameworks.