Join the DZone community and get the full member experience.Join For Free
Atomist automates your software deliver experience. It's how modern teams deliver modern software.
There’s a better place to specifically look at performance comparisons across languages than this post – The computer languages benchmarks game. But this post attempts look at performance comparisons a little differently. Based on coding idioms as well. And for a much narrower range of problems (namely one).
There are languages which are tightly opinionated on a particular way of doing things. And there are languages which allow you to implement a given logic in multiple ways. Yet, depending upon the language (and as we shall see, the runtime), the performance could vary quite substantially based on the nature of the code we write. This post attempts to take a small piece of logic, and implements in upto 3 different styles in 8 languages (10 if you count the runtime variations as well).
Quoting from The Josephus Problem,
Flavius Josephus was a roman historian of Jewish origin. During the Jewish-Roman wars of the first century AD, he was in a cave with fellow soldiers, 40 men in all, surrounded by enemy Roman troops. They decided to commit suicide by standing in a ring and counting off each third man. Each man so designated was to commit suicide…Josephus, not wanting to die, managed to place himself in the position of the last survivor.
In the general version of the problem, there are n soldiers numbered from 1 to n and each k-th soldier will be eliminated. The count starts from the first soldier. What is the number of the last survivor. In the code I benchmarked, n = 40 and k = 3.
I have considered three idioms :
- Object Oriented : This code has classes reflecting a person (or a soldier) and the chain. The objects of person maintain reference to their prior and next people in the cirlce (a doubly linked list, and as the counting progresses, whenever they need to eliminate themselves, they do so by updating the next / prev references in the prev / next objects. This style results perhaps in the least operations involving mutation or memory allocation / deallocation. One would’ve imagined it to be the fastest, but as you will see that is not necessarily true.
- List reduction :This code starts with a list of integers, each element representing a soldier. It performs an operation which effectively creates a subset of the list by removing every third soldier. The result of one such pass is a smaller list. Rinse and repeat if the smaller list is more than 1 element long. It emphasises looping over lists (using comprehension or other constructs) and focuses on reducing the list by conducting an operation on the entire list, every pass.
- Element recursion :This is a more fine grained logic which emphasises recursion (and often accumulation) for every element in the list. This is particularly apt scenario to use pattern matching (both the erlang and scala code use pattern matching). One would imagine this to be always slower than list reduction since it is much more fine grained and involves many more function calls.
I’ve attempted to implement code in all languages using the styles above as long as reasonably feasible and appropriate. Since (barring C/C++), Java continues to be the language to beat from a performance perspective, I’ve attempted to implement roughly equivalent logic in all styles using Java as well. All programs typically run the code once to print the results (to verify correctness), and then 100000 or a million iterations to warmup, and then again repeat the iterations and measuring the elapsed time. There is a slight inconsistency between the various code snippets. The counter either varies between 0 to 39 or between 1 to 40.
I can’t write the fastest possible code across all these languages. This is the best I could do. However if you can find a better way to implement the code, do let me know in the comments (or send me a pull request on github). I shall certainly include better solutions here if and as they are identified. At the point in time of publishing this, at least two authors had contributed to the code. I imagine (based on my experience with the prior post), more might be interested in suggesting tweaks to further improve performance. These are all listed here.
- Paddy3118 had suggested some python code in the comments in last blog post, which I have substantially reused for the python list-reduction logic
- Rahul Göma Phuloré (missingfaktor) contributed substantial improvments to the scala code
- Viktor Klang contributed a improved version for the scala element recursion code
- David Nolen (swannodette) contributed a substantially improved version for clojure element recursion
- Fred Hebert suggested native compilation by adding “compile(native).” and a couple of other minor improvements over github
- Isaac Guoy offered an improved version for Java List Reduction
Hardware / Software
The specs of the machine used for measurement are as follows
OS : Ubuntu Maveric Meerkat 10.10
Kernel : 2.6.35-28-server 64 bit
CPU : Intel(R) Core(TM)2 Quad CPU Q8300 @ 2.50GHz
RAM : 4GB
Here are the results. All timings are in microseconds per iteration, clearly lower metrics are better than higher metrics :
|Object Oriented||List Reduction||Element Recursion|
|Erlang R14B03 HiPE||3.489||3.192|
Observations : (Updated)
Some interesting observations (in no particular order).
- These are observations based only on a particular problem. For broader coverage of cross-language benchmarks, I encourage you to refer to The computer languages benchmarks game
- Java and Scala performance is pretty close.
- One would’ve imagined OO code to be generally the fastest due to minimal mutations or memory allocations. Yet that remains true only for the statically typed languages (and one notable exception PyPy). For all others, the list reduction approach is faster.
- The real surprise in the pack comes from PyPy. It performs outstandingly well on the OO and List Reduction approaches. It still is a bit slow in the element recursion approach. I would hope given the very young age of the VM, that the same would get further optimised in times and versions to come. Interestingly, the pause it undergoes (perhaps when jit’ting the code) is quite noticeable.
- jRuby consistently shows superior performance to Ruby 1.9.1. Coupled with further expected improvements arising out of invokedynamic in JDK 7, this is one runtime to watch. Though as the metrics show, it still has some ways to go to catch up.
- The groovy performance using the list reduction is quite nice.
- Its surprising to see clojure roughly similar to jRuby in performance.
Full Source code is available on github at https://github.com/dnene/josephus
Finally, thanks to a number of folks I had a chance to preview the post with and especially to Saager Mhatre to suggest moving the code from a attached zip file to github.
- Updated metrics for groovy 1.8.1 (instead of earlier groovy 1.7)
- Updated code to reflect suggestions by Eric Rozendaal and another almost similar one by Viktor Klang – Viktor’s code was very marginally faster. Leads to a reduction in Scala Element Recursive benchmark from 5.213 to 2.334 microseconds
- Updated clojure element recursion code as per suggestion by David Nolen. Now time down from 135.36 to 29.170 microseconds
- Thanks to the persistent questioning by Isaac, upgraded the metrics to jRuby 1.6.3. That turned out to be a very good step. There is a substantial improvements in the performance metrics which are now updated in the numbers above.
- Fred Hebert submitted a pull request to turn on native compilation which required native compilation – which in turn required HiPE which Isaac had suggested earlier. After verifying that Erlang-HiPE is a valid synaptic target (thus a different readily available VM), I built the same and updated the readings
- Isaac Gouy offered some helpful suggestions in terms of converting the main block also into a function. Also he demonstrated some potential issues in terms of whether the resulting performance was stable. I have made across the board changes now to run all the benchmarks ten times each for a million iterations and used the last 5 readings after visually ensuring that the readings did not vary much
Opinions expressed by DZone contributors are their own.