Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

The Truth About Traditional JavaScript Benchmarks (Part 3 - Kraken)

DZone's Guide to

The Truth About Traditional JavaScript Benchmarks (Part 3 - Kraken)

The problem with benchmarks is that they give you hints about where to look and what to do but won't tell you how far you have to go or protect the optimization properly.

· Performance Zone
Free Resource

Transform incident management with machine learning and analytics to help you maintain optimal performance and availability while keeping pace with the growing demands of digital business with this eBook, brought to you in partnership with BMC.

Continuing on with the series, I plan to highlight a few concrete examples that illustrate why I think it’s not only useful but crucial for the health of the JavaScript community to stop paying attention to static peak performance benchmarks above a certain threshold.

Last post, we talked about the notorious SunSpider examples, this post we'll go into detail about the Kraken benchmark.

The Not-So-Obvious Kraken Case

The Kraken benchmark was released by Mozilla in September 2010. It was said to contain snippets and kernels of real-world applications and be less of a micro-benchmark compared to SunSpider. I don’t want to spend too much time on Kraken because I think it wasn’t as influential on JavaScript performance as SunSpider and Octane, so I’ll highlight one particular example from the audio-oscillator.js test.

audio-oscillator.js

So, the test invokes the calcOsc function 500 times. calcOsc first calls generate on the global sineOscillator, then creates a new Oscillator, calls generate on that, and adds it to the global sine oscillator. Without going into detail as to why the test is doing this, let’s have a look at the generate method on the Oscillator prototype.

audio-oscillator-data.js

Looking at the code, you’d expect this to be dominated by the array accesses or the multiplications or the Math.round calls in the loop, but surprisingly, what’s completely dominating the runtime of Oscillator.prototype.generate is the offset % this.waveTableLength expression. Running this benchmark in a profiler on any Intel machine reveals that more than 20% of the ticks are attributed to the idiv instruction that we generate for the modulus. One interesting observation, however, is that the waveTableLength field of the Oscillator instances always contains the same value 2048, as it’s only assigned once in the Oscillator constructor.

audio-oscillator-data.js

If we know that the right hand side of an integer modulus operation is a power of two, we can generate way better code obviously and completely avoid the idiv instruction on Intel. So, what we needed was a way to get the information that this.waveTableLength is always 2048 from the Oscillator constructor to the modulus operation in Oscillator.prototype.generate. One obvious way would be to try to rely on inlining of everything into the calcOsc function and let load/store elimination do the constant propagation for us, but this would not work for the sine oscillator, which is allocated outside the calcOsc function.

So, what we did instead was add support for tracking certain constant values as right-hand side feedback for the modulus operator. This does make some sense in V8 since we track type feedback for binary operations like +, * and % on uses, which means the operator tracks the types of inputs it has seen and the types of outputs that were produced (see the slides from the roundtable talk on Fast arithmetic for dynamic languages recently for some details). Hooking this up with fullcodegen and Crankshaft was even fairly easy back then. The BinaryOpIC for MOD can also track known power of two right hand sides. In fact, running the default configuration of V8 (with Crankshaft and fullcodegen):

$ ~/Projects/v8/out/Release/d8 --trace-ic audio-oscillator.js
[...SNIP...]
[BinaryOpIC(MOD:None*None->None) => (MOD:Smi*2048->Smi) @ ~Oscillator.generate+598 at audio-oscillator.js:697]
[...SNIP...]
$ 

...shows that the BinaryOpIC is picking up the proper constant feedback for the right-hand side of the modulus and properly tracks that the left-hand side was always a small integer (a Smi in V8 speak), and we also always produced a small integer result. Looking at the generated code using --print-opt-code --code-comments quickly reveals that Crankshaft utilizes the feedback to generate an efficient code sequence for the integer modulus in Oscillator.prototype.generate:

[...SNIP...]
                  ;;; <@80,#84> load-named-field
0x133a0bdacc4a   330  8b4343         movl rax,[rbx+0x43]
                  ;;; <@83,#86> compare-numeric-and-branch
0x133a0bdacc4d   333  3d00080000     cmp rax,0x800
0x133a0bdacc52   338  0f85ff000000   jnz 599  (0x133a0bdacd57)
[...SNIP...]
                  ;;; <@90,#94> mod-by-power-of-2-i
0x133a0bdacc5b   347  4585db         testl r11,r11
0x133a0bdacc5e   350  790f           jns 367  (0x133a0bdacc6f)
0x133a0bdacc60   352  41f7db         negl r11
0x133a0bdacc63   355  4181e3ff070000 andl r11,0x7ff
0x133a0bdacc6a   362  41f7db         negl r11
0x133a0bdacc6d   365  eb07           jmp 374  (0x133a0bdacc76)
0x133a0bdacc6f   367  4181e3ff070000 andl r11,0x7ff
[...SNIP...]
                  ;;; <@127,#88> deoptimize
0x133a0bdacd57   599  e81273cdff     call 0x133a0ba8406e
[...SNIP...]

So, you see we load the value of this.waveTableLength (rbx holds the this reference), check that it’s still 2048 (hexadecimal 0x800), and if so, just perform a bitwise and with the proper bitmask 0x7ff (r11 contains the value of the loop induction variable i) instead of using the idiv instruction (paying proper attention to preserve the sign of the left hand side).

The Over-Specialization Issue

So, this trick is pretty damn cool, but as with many benchmark-focused tricks, it has one major drawback: it’s overspecialized! As soon as the right hand side changes, all optimized code will have to be deoptimized (as the assumption that the right hand is always a certain power of two no longer holds) and any further optimization attempts will have to use idiv again, as the BinaryOpIC will most likely report feedback in the form Smi*Smi->Smi then. For example, let’s assume we instantiate another Oscillator, set a differentwaveTableLength on it, and call generate for the oscillator. We’d lose 20% performance even though the actually interesting Oscillators are not affected (i.e., the engine does non-local penalization here).

--- audio-oscillator.js.ORIG    2016-12-15 22:01:43.897033156 +0100
+++ audio-oscillator.js 2016-12-15 22:02:26.397326067 +0100
@@ -1931,6 +1931,10 @@
 var frequency = 344.53;
 var sine = new Oscillator(Oscillator.Sine, frequency, 1, bufferSize, sampleRate);

+var unused = new Oscillator(Oscillator.Sine, frequency, 1, bufferSize, sampleRate);
+unused.waveTableLength = 1024;
+unused.generate();
+
 var calcOsc = function() {
   sine.generate();

Comparing the execution times of the original audio-oscillator.js and the version that contains an additional unusedOscillator instance with a modified waveTableLength shows the expected results:

$ ~/Projects/v8/out/Release/d8 audio-oscillator.js.ORIG
Time (audio-oscillator-once): 64 ms.
$ ~/Projects/v8/out/Release/d8 audio-oscillator.js
Time (audio-oscillator-once): 81 ms.
$ 

This is an example for a pretty terrible performance cliff. Let’s say a developer writes code for a library and does careful tweaking and optimizations using certain sample input values, and the performance is decent. Now a user starts using that library reading through the performance notes, but somehow falls off the performance cliff, because she/he is using the library in a slightly different way (i.e., somehow polluting type feedback for a certain BinaryOpIC,) and is hit by a 20% slowdown (compared to the measurements of the library author) that neither the library author nor the user can explain, and that seems rather arbitrary.

Now, this is not uncommon in JavaScript land, and unfortunately, quite a couple of these cliffs are just unavoidable, because they are due to the fact that JavaScript performance is based on optimistic assumptions and speculation. We have been spending a lot of time and energy trying to come up with ways to avoid these performance cliffs, and still provide (nearly) the same performance.

As it turns out, it makes a lot of sense to avoid idiv whenever possible even if you don’t necessarily know that the right-hand side is always a power of two (via dynamic feedback). What TurboFan does is different from Crankshaft in that it always checks at runtime whether the input is a power of two, so a general case for signed integer modulus with optimization for an unknown power of two right-hand side looks like this (in pseudo-code):

if 0 < rhs then
  msk = rhs - 1
  if rhs & msk != 0 then
    lhs % rhs
  else
    if lhs < 0 then
      -(-lhs & msk)
    else
      lhs & msk
else
  if rhs < -1 then
    lhs % rhs
  else
    zero

That leads to a lot more consistent and predictable performance (with TurboFan):

$ ~/Projects/v8/out/Release/d8 --turbo audio-oscillator.js.ORIG
Time (audio-oscillator-once): 69 ms.
$ ~/Projects/v8/out/Release/d8 --turbo audio-oscillator.js
Time (audio-oscillator-once): 69 ms.
$ 

The problem with benchmarks and overspecialization is that the benchmark can give you hints where to look and what to do, but it doesn’t tell you how far you have to go and doesn’t protect the optimization properly. For example, all JavaScript engines use benchmarks as a way to guard against performance regressions, but running Kraken, for example, wouldn’t protect the general approach that we have in TurboFan, i.e., we could degrade the modulus optimization in TurboFan to the over-specialized version of Crankshaft and the benchmark wouldn’t tell us that we regressed, because from the point of view of the benchmark it’s fine!

Now, you could extend the benchmark, maybe in the same way that I did above and try to cover everything with benchmarks, which is what engine implementors do to a certain extent. However, that approach doesn’t scale arbitrarily. Even though benchmarks are convenient and easy to use for communication and competition, you’ll also need to leave space for common sense, otherwise over-specialization will dominate everything and you’ll have a really, really fine line of acceptable performance and big performance cliffs.

There are various other issues with the Kraken tests. Next time, we'll look at what is probably the most influential JavaScript benchmark of the last five years: the Octane benchmark.

Stay tuned for part four. Here are part one and part two, in case you missed them.

Evolve your approach to Application Performance Monitoring by adopting five best practices that are outlined and explored in this e-book, brought to you in partnership with BMC.

Topics:
performance ,javascript ,benchmarks ,kraken

Published at DZone with permission of Benedikt Meurer, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}