The Notorious SunSpider Examples

A blog post on traditional JavaScript benchmarks wouldn’t be complete without pointing out the obvious SunSpider problems. So, let’s start with the prime example of a performance test that has limited applicability in the real world: The bitops-bitwise-and.js performance test.

bitops-bitwise-and.js


There are a couple of algorithms that need fast bitwise and, especially in the area of code transpiled from C/C++ to JavaScript, so it does indeed make some sense to be able to perform this operation quickly. However, real-world web pages will probably not care whether an engine can execute bitwise and in a loop 2x faster than another engine. But staring at this code for another couple of seconds, you’ll probably notice that bitwiseAndValue will be 0 after the first loop iteration and will remain 0 for the next 599999 iterations. So once you get this to good performance, i.e. anything below 5ms on decent hardware, you can start gaming this benchmark by trying to recognize that only the first iteration of the loop is necessary, while the remaining iterations are a waste of time (i.e. dead code after loop peeling). This needs some machinery in JavaScript to perform this transformation, i.e. you need to check that bitwiseAndValue is either a regular property of the global object or not present before you execute the script, there must be no interceptor on the global object or it’s prototypes, etc., but if you really want to win this benchmark, and you are willing to go all in, then you can execute this test in less than 1ms. However this optimization would be limited to this special case, and slight modifications of the test would probably no longer trigger it.

Ok, so that bitops-bitwise-and.js test was definitely the worst example of a micro-benchmark. Let’s move on to something more real-world-ish in SunSpider, the string-tagcloud.js test, which essentially runs a very early version of the json.js polyfill. The test arguably looks a lot more reasonable that the bitwise and test, but looking at the profile of the benchmark for some time immediately reveals that a lot of time is spent on a single eval expression (up to 20% of the overall execution time for parsing and compiling plus up to 10% for actually executing the compiled code):

string-tagcloud.js


Looking closer reveals that this eval is executed exactly once, and is passed a JSONish string that contains an array of 2501 objects with tag and popularity fields:

([
  {
    "tag": "titillation",
    "popularity": 4294967296
  },
  {
    "tag": "foamless",
    "popularity": 1257718401
  },
  {
    "tag": "snarler",
    "popularity": 613166183
  },
  {
    "tag": "multangularness",
    "popularity": 368304452
  },
  {
    "tag": "Fesapo unventurous",
    "popularity": 248026512
  },
  {
    "tag": "esthesioblast",
    "popularity": 179556755
  },
  {
    "tag": "echeneidoid",
    "popularity": 136641578
  },
  {
    "tag": "embryoctony",
    "popularity": 107852576
  },
  ...
])


Obviously parsing these object literals, generating native code for it and then executing that code, comes at a high cost. It would be a lot cheaper to just parse the input string as JSON and generate an appropriate object graph. So, one trick to speed up this benchmark is to mock with eval and try to always interpret the data as JSON first and only fallback to real parse, compile, execute if the attempt to read JSON failed (some additional magic is required to skip the parenthesis, though). Back in 2007, this wouldn’t even be a bad hack, since there was no JSON.parse, but in 2017 this is just technical debt in the JavaScript engine and potentially slows down legit uses of eval. In fact, updating the benchmark to modern JavaScript:

--- string-tagcloud.js.ORIG 2016-12-14 09:00:52.869887104 +0100 +++ string-tagcloud.js 2016-12-14 09:01:01.033944051 +0100 @@ -198,7 +198,7 @@                      replace(/"[^"\\\n\r]*"|true|false|null|-?\d+(?:\.\d*)?(:?[eE][+\-]?\d+)?/g, ']').
                     replace(/(?:^|:|,)(?:\s*\[)+/g, ''))) {

- j = eval('(' + this + ')'); + j = JSON.parse(this); 
                 return typeof filter === 'function' ? walk('', j) : j;
             }


...yields an immediate performance boost, dropping runtime from 36ms to 26ms for V8 LKGR as of today, a 30% improvement!

$ node string-tagcloud.js.ORIG Time (string-tagcloud): 36 ms.
$ node string-tagcloud.js Time (string-tagcloud): 26 ms.
$ node -v v8.0.0-pre
$ 


This is a common problem with static benchmarks and performance test suites. Today, no one would seriously use eval to parse JSON data (also for obvious security reasons, not only for the performance issues), but rather stick to JSON.parse for all code written in the last five years. In fact using eval to parse JSON would probably be considered a bug in production code today! So the engine writers effort of focusing on performance of newly written code is not reflected in this ancient benchmark, instead it would be beneficial to make eval unnecessarily  smart   complex to win on string-tagcloud.js.

Ok, so let’s look at yet another example: the 3d-cube.js. This benchmark does a lot of matrix operations, where even the smartest compiler can’t do a lot about it, but just has to execute it. Essentially, the benchmark spends a lot of time executing the Loop function and functions called by it.

3d-cube.js


One interesting observation here is that the RotateX, RotateY, and RotateZ functions are always called with the same constant parameter Phi. 

3d-cube.js


This means that we basically always compute the same values for Math.sin and Math.cos, 204 times each. There are only three different inputs:

  • 0.017453292519943295,
  • 0.05235987755982989, and
  • 0.08726646259971647

...obviously. So, one thing you could do here to avoid recomputing the same sine and cosine values all the time is to cache the previously computed values, and in fact, that’s what V8 used to do in the past, and other engines like SpiderMonkey still do. We removed the so-called transcendental cache from V8 because the overhead of the cache was noticeable in actual workloads where you don’t always compute the same values in a row, which is unsurprisingly very common in the wild. We took serious hits on the SunSpider benchmark when we removed these benchmark-specific optimizations back in 2013 and 2014, but we totally believe that it doesn’t make sense to optimize for a benchmark while at the same time penalizing the real world use case in such a way.


3d-cube benchmark Source: arewefastyet.com


Obviously, a better way to deal with the constant sine/cosine inputs is a sane inlining heuristic that tries to balance inlining and take into account different factors like prefer inlining at call sites where constant folding can be beneficial, like in the case of the RotateX, RotateY, and RotateZ call sites. But this was not really possible with the Crankshaft compiler for various reasons. With Ignition and TurboFan, this becomes a sensible option, and we are already working on better inlining heuristics.

Garbage Collection Considered Harmful

Besides these very test-specific issues, there’s another fundamental problem with the SunSpider benchmark: The overall execution time. V8 on decent Intel hardware runs the whole benchmark in roughly 200ms currently (with the default configuration). A minor GC can take anything between 1ms and 25ms currently (depending on live objects in new space and old space fragmentation), while a major GC pause can easily take 30ms (not even taking into account the overhead from incremental marking), that’s more than 10% of the overall execution time of the whole SunSpider suite! So any engine that doesn’t want to risk a 10-20% slowdown due to a GC cycle has to somehow ensure it doesn’t trigger GC while running SunSpider.

driver-TEMPLATE.html


There are different tricks to accomplish this, none of which has any positive impact in the real world as far as I can tell. V8 uses a rather simple trick: Since every SunSpider test is run in a new <iframe>, which corresponds to a new native context in V8 speak, we just detect rapid <iframe> creation and disposal (all SunSpider tests take less than 50ms each), and in that case perform a garbage collection between the disposal and creation, to ensure that we never trigger a GC while actually running a test. This trick works pretty well, and in 99.9% of the cases doesn’t clash with real uses; except, every now and then, it can hit you hard if for whatever reason you do something that makes you look like you are the SunSpider test driver to V8, then you can get hit hard by forced GCs, and that can have a negative effect on your application. So rule of thumb: Don’t let your application look like SunSpider!

I could go on with more SunSpider examples here, but I don’t think that’d be very useful. By now it should be clear that optimizing further for SunSpider above the threshold of good performance will not reflect any benefits in the real world. In fact, the world would probably benefit a lot from not having SunSpider anymore, as engines could drop weird hacks that are only useful for SunSpider and can even hurt real-world use cases. Unfortunately, SunSpider is still being used heavily by the (tech) press to compare what they think is browser performance, or even worse, to compare phones! So there’s a certain natural interest from phone makers and also from Android, in general, to have Chrome look somewhat decent on SunSpider (and other nowadays meaningless benchmarks FWIW). The phone makers generate money by selling phones, so getting good reviews is crucial for the success of the phone division or even the whole company, and some of them even went as far as shipping old versions of V8 in their phones that had a higher score on SunSpider, exposing their users to all kinds of unpatched security holes that had long been fixed and shielding their users from any real-world performance benefits that come with more recent V8 versions!

 Galaxy S7 and S7 Edge review: Samsung&apos;s finest get more polished Source: Galaxy S7 and S7 Edge review: Samsung's finest get more polished, www.engadget.com


If we as the JavaScript community really want to be serious about real world performance in JavaScript land, we need to make the tech press stop using traditional JavaScript benchmarks to compare browsers or phones. I see that there’s a benefit in being able to just run a benchmark in each browser and compare the number that comes out of it, but then please, please use a benchmark that has something in common with what is relevant today, i.e. real-world web pages. And, if you feel the need to compare two phones via a browser benchmark, please at least consider using Speedometer.

Cuteness Break!


I always loved this in Myles Borins’ talks, so I had to shamelessly steal his idea. Anyway, hope that refreshed your pallate. And now that we've recovered from the SunSpider rant, we can all get back to whatever it is we were doing. Tomorrow, we'll continue on looking into the other classic benchmarks…

Stay tuned for part three. Here's part one in case you missed it.