The Truth About Traditional JavaScript Benchmarks (Part 3 - Kraken)
The problem with benchmarks is that they give you hints about where to look and what to do but won't tell you how far you have to go or protect the optimization properly.
Join the DZone community and get the full member experience.
Join For Freecontinuing on with the series , i plan to highlight a few concrete examples that illustrate why i think it’s not only useful but crucial for the health of the javascript community to stop paying attention to static peak performance benchmarks above a certain threshold.
last post , we talked about the notorious sunspider examples, this post we'll go into detail about the kraken benchmark.
the not-so-obvious kraken case
the kraken benchmark was
released by mozilla in september 2010
. it was said to contain snippets and kernels of real-world applications and be less of a micro-benchmark compared to sunspider. i don’t want to spend too much time on kraken because i think it wasn’t as influential on javascript performance as sunspider and octane, so i’ll highlight one particular example from the
audio-oscillator.js
test.
so, the test invokes the
calcosc
function 500 times.
calcosc
first calls
generate
on the global
sineoscillator
, then creates a new
oscillator
, calls
generate
on that, and adds it to the global
sine
oscillator. without going into detail as to why the test is doing this, let’s have a look at the
generate
method on the
oscillator
prototype.
looking at the code, you’d expect this to be dominated by the array accesses or the multiplications or the
math.round
calls in the loop, but surprisingly, what’s completely dominating the runtime of
oscillator.prototype.generate
is the
offset % this.wavetablelength
expression. running this benchmark in a profiler on any intel machine reveals that more than 20% of the ticks are attributed to the
idiv
instruction that we generate for the modulus. one interesting observation, however, is that the
wavetablelength
field of the
oscillator
instances always contains the same value 2048, as it’s only assigned once in the
oscillator
constructor.
if we know that the right hand side of an integer modulus operation is a power of two, we can generate
way better code
obviously and completely avoid the
idiv
instruction on intel. so, what we needed was a way to get the information that
this.wavetablelength
is always 2048 from the
oscillator
constructor to the modulus operation in
oscillator.prototype.generate
. one obvious way would be to try to rely on inlining of everything into the
calcosc
function and let load/store elimination do the constant propagation for us, but this would not work for the
sine
oscillator, which is allocated outside the
calcosc
function.
so, what we did instead was add support for tracking certain constant values as right-hand side feedback for the modulus operator. this does make some sense in v8 since we track type feedback for binary operations like
+
,
*
and
%
on uses, which means the operator tracks the types of inputs it has seen and the types of outputs that were produced (see the slides from the roundtable talk on
fast arithmetic for dynamic languages
recently for some details). hooking this up with fullcodegen and crankshaft was even fairly easy back then. the
binaryopic
for
mod
can also track known power of two right hand sides. in fact, running the default configuration of v8 (with crankshaft and fullcodegen):
$ ~/projects/v8/out/release/d8 --trace-ic audio-oscillator.js
[...snip...]
[binaryopic(mod:none*none->none) => (mod:smi*2048->smi) @ ~oscillator.generate+598 at audio-oscillator.js:697]
[...snip...]
$
...shows that the
binaryopic
is picking up the proper constant feedback for the right-hand side of the modulus and properly tracks that the left-hand side was always a small integer (a
smi
in v8 speak), and we also always produced a small integer result. looking at the generated code using
--print-opt-code --code-comments
quickly reveals that crankshaft utilizes the feedback to generate an efficient code sequence for the integer modulus in
oscillator.prototype.generate
:
[...snip...]
;;; <@80,#84> load-named-field
0x133a0bdacc4a 330 8b4343 movl rax,[rbx+0x43]
;;; <@83,#86> compare-numeric-and-branch
0x133a0bdacc4d 333 3d00080000 cmp rax,0x800
0x133a0bdacc52 338 0f85ff000000 jnz 599 (0x133a0bdacd57)
[...snip...]
;;; <@90,#94> mod-by-power-of-2-i
0x133a0bdacc5b 347 4585db testl r11,r11
0x133a0bdacc5e 350 790f jns 367 (0x133a0bdacc6f)
0x133a0bdacc60 352 41f7db negl r11
0x133a0bdacc63 355 4181e3ff070000 andl r11,0x7ff
0x133a0bdacc6a 362 41f7db negl r11
0x133a0bdacc6d 365 eb07 jmp 374 (0x133a0bdacc76)
0x133a0bdacc6f 367 4181e3ff070000 andl r11,0x7ff
[...snip...]
;;; <@127,#88> deoptimize
0x133a0bdacd57 599 e81273cdff call 0x133a0ba8406e
[...snip...]
so, you see we load the value of
this.wavetablelength
(
rbx
holds the
this
reference), check that it’s still 2048 (hexadecimal 0x800), and if so, just perform a bitwise and with the proper bitmask 0x7ff (
r11
contains the value of the loop induction variable
i
) instead of using the
idiv
instruction (paying proper attention to preserve the sign of the left hand side).
the over-specialization issue
so, this trick is pretty damn cool, but as with many benchmark-focused tricks, it has one major drawback: it’s overspecialized! as soon as the right hand side changes, all optimized code will have to be deoptimized (as the assumption that the right hand is always a certain power of two no longer holds) and any further optimization attempts will have to use
idiv
again, as the
binaryopic
will most likely report feedback in the form
smi*smi->smi
then. for example, let’s assume we instantiate another
oscillator
, set a different
wavetablelength
on it, and call
generate
for the oscillator. we’d lose 20% performance even though the actually interesting
oscillator
s are not affected (i.e., the engine does non-local penalization here).
--- audio-oscillator.js.orig 2016-12-15 22:01:43.897033156 +0100
+++ audio-oscillator.js 2016-12-15 22:02:26.397326067 +0100
@@ -1931,6 +1931,10 @@
var frequency = 344.53;
var sine = new oscillator(oscillator.sine, frequency, 1, buffersize, samplerate);
+var unused = new oscillator(oscillator.sine, frequency, 1, buffersize, samplerate);
+unused.wavetablelength = 1024;
+unused.generate();
+
var calcosc = function() {
sine.generate();
comparing the execution times of the original
audio-oscillator.js
and the version that contains an additional unused
oscillator
instance with a modified
wavetablelength
shows the expected results:
$ ~/projects/v8/out/release/d8 audio-oscillator.js.orig
time (audio-oscillator-once): 64 ms.
$ ~/projects/v8/out/release/d8 audio-oscillator.js
time (audio-oscillator-once): 81 ms.
$
this is an example for a pretty terrible performance cliff. let’s say a developer writes code for a library and does careful tweaking and optimizations using certain sample input values, and the performance is decent. now a user starts using that library reading through the performance notes, but somehow falls off the performance cliff, because she/he is using the library in a slightly different way (i.e., somehow polluting type feedback for a certain
binaryopic
,) and is hit by a 20% slowdown (compared to the measurements of the library author) that neither the library author nor the user can explain, and that seems rather arbitrary.
now, this is not uncommon in javascript land, and unfortunately, quite a couple of these cliffs are just unavoidable, because they are due to the fact that javascript performance is based on optimistic assumptions and speculation. we have been spending a lot of time and energy trying to come up with ways to avoid these performance cliffs, and still provide (nearly) the same performance.
as it turns out, it makes a lot of sense to avoid
idiv
whenever possible even if you don’t necessarily know that the right-hand side is always a power of two (via dynamic feedback). what turbofan does is different from crankshaft in that it always checks at runtime whether the input is a power of two, so a general case for signed integer modulus with optimization for an unknown power of two right-hand side looks like this (in pseudo-code):
if 0 < rhs then
msk = rhs - 1
if rhs & msk != 0 then
lhs % rhs
else
if lhs < 0 then
-(-lhs & msk)
else
lhs & msk
else
if rhs < -1 then
lhs % rhs
else
zero
that leads to a lot more consistent and predictable performance (with turbofan):
$ ~/projects/v8/out/release/d8 --turbo audio-oscillator.js.orig
time (audio-oscillator-once): 69 ms.
$ ~/projects/v8/out/release/d8 --turbo audio-oscillator.js
time (audio-oscillator-once): 69 ms.
$
the problem with benchmarks and overspecialization is that the benchmark can give you hints where to look and what to do, but it doesn’t tell you how far you have to go and doesn’t protect the optimization properly. for example, all javascript engines use benchmarks as a way to guard against performance regressions, but running kraken, for example, wouldn’t protect the general approach that we have in turbofan, i.e., we could degrade the modulus optimization in turbofan to the over-specialized version of crankshaft and the benchmark wouldn’t tell us that we regressed, because from the point of view of the benchmark it’s fine!
now, you could extend the benchmark, maybe in the same way that i did above and try to cover everything with benchmarks, which is what engine implementors do to a certain extent. however, that approach doesn’t scale arbitrarily. even though benchmarks are convenient and easy to use for communication and competition, you’ll also need to leave space for common sense, otherwise over-specialization will dominate everything and you’ll have a really, really fine line of acceptable performance and big performance cliffs.
there are various other issues with the kraken tests. next time, we'll look at what is probably the most influential javascript benchmark of the last five years: the octane benchmark.
stay tuned for part four . here are part one and part two , in case you missed them.
Published at DZone with permission of Benedikt Meurer, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments