Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

The Truth About Traditional JavaScript Benchmarks (Part 4 - Octane)

DZone's Guide to

The Truth About Traditional JavaScript Benchmarks (Part 4 - Octane)

All the optimizations that went into JavaScript engines driven by Octane in the past were added on good faith that Octane is a good proxy for real world performance.

· Performance Zone
Free Resource

Wrapping up with our series, I'll tell you a bit about Octane. The plan for this series was to highlight a few concrete examples that illustrate why I think it’s not only useful but crucial for the health of the JavaScript community to stop paying attention to static peak performance benchmarks above a certain threshold. I hope that I've convinced you to not just take benchmarks at face value, but to really dig down into how they work in real-world circumstances. And keep in mind that benchmarks which made sense in their time, may not (probably do not) apply to today.

Last post, we talked about the Kraken benchmark. Now let's wrap things up with Octane.

A Closer Look at Octane

The Octane benchmark is the successor of the V8 benchmark. It was initially announced by Google in mid-2012 and the current version Octane 2.0 was announced in late 2013. This version contains 15 individual tests, where for two of them (Splay and Mandreel), we measure both the throughput and the latency. These tests range from Microsofts TypeScript compiler compiling itself, to raw asm.js performance being measured by the zlib test, to a performance test for the RegExp engine, to a ray tracer, to a full 2D physics engine, etc. See the description for a detailed overview of the individual benchmark line items. All these line items were carefully chosen to reflect a certain aspect of JavaScript performance that we considered important in 2012 or expected to become important in the near future.

To a large extent, Octane was super successful in achieving its goals of taking JavaScript performance to the next level, it resulted in a healthy competition in 2012 and 2013 where great performance achievements were driven by Octane. However, it’s almost 2017 now, and the world looks fairly different than in 2012, really, really different actually. Besides the usual and often cited criticism that most items in Octane are essentially outdated (i.e., ancient versions of TypeScript, zlib being compiled via an ancient version of Emscripten, Mandreel not even being available anymore, etc.), something way more important affects Octane's usefulness.

We saw big web frameworks winning the race on the web, especially heavy frameworks like Ember and AngularJS, that use patterns of JavaScript execution, which are not reflected at all by Octane and are often hurt by (our) Octane specific optimizations. We also saw JavaScript winning on the server and tooling front, which means there are large scale JavaScript applications that now often run for weeks if not years, which was also not captured by Octane. As stated in the beginning, we have hard data that suggests that the execution and memory profile of Octane are completely different than what we see on the web daily.

So, let’s look into some concrete examples of benchmark gaming that is happening today with Octane, where optimizations are no longer reflected in the real world. Note that even though this might sound a bit negative in retrospect, it’s definitely not meant that way! As I said a couple of times already, Octane is an important chapter in the JavaScript performance story, and it played a very important role. All the optimizations that went into JavaScript engines driven by Octane in the past were added on good faith that Octane is a good proxy for real world performance! Every age has its benchmark, and for every benchmark, there comes a time when you have to let go.

That being said, let’s get this show on the road and start by looking at the Box2D test, which is based on Box2DWeb, a popular 2D physics engine originally written by Erin Catto, ported to JavaScript. Overall does a lot of floating-point math and drove a lot of good optimizations in JavaScript engines. However, as it turns out, it contains a bug that can be exploited to game the benchmark a bit (blame it on me, I spotted the bug and added the exploit in this case). There’s a function D.prototype.UpdatePairs in the benchmark that looks like this (deminified):

D.prototype.UpdatePairs = function(b) {
    var e = this;
    var f = e.m_pairCount = 0,
        m;
    for (f = 0; f < e.m_moveBuffer.length; ++f) {
        m = e.m_moveBuffer[f];
        var r = e.m_tree.GetFatAABB(m);
        e.m_tree.Query(function(t) {
                if (t == m) return true;
                if (e.m_pairCount == e.m_pairBuffer.length) e.m_pairBuffer[e.m_pairCount] = new O;
                var x = e.m_pairBuffer[e.m_pairCount];
                x.proxyA = t < m ? t : m;
                x.proxyB = t >= m ? t : m;
                ++e.m_pairCount;
                return true
            },
            r)
    }
    for (f = e.m_moveBuffer.length = 0; f < e.m_pairCount;) {
        r = e.m_pairBuffer[f];
        var s = e.m_tree.GetUserData(r.proxyA),
            v = e.m_tree.GetUserData(r.proxyB);
        b(s, v);
        for (++f; f < e.m_pairCount;) {
            s = e.m_pairBuffer[f];
            if (s.proxyA != r.proxyA || s.proxyB != r.proxyB) break;
            ++f
        }
    }
};

Some profiling shows that a lot of time is spent in the innocent looking inner function passed to e.m_tree.Query in the first loop:

function(t) {
    if (t == m) return true;
    if (e.m_pairCount == e.m_pairBuffer.length) e.m_pairBuffer[e.m_pairCount] = new O;
    var x = e.m_pairBuffer[e.m_pairCount];
    x.proxyA = t < m ? t : m;
    x.proxyB = t >= m ? t : m;
    ++e.m_pairCount;
    return true
}

More precisely, the time is not spent in this function itself, but rather operations and built-in library functions triggered by this. As it turned out, we spent 4-7% of the overall execution time of the benchmark calling into the Compare runtime function, which implements the general case for the abstract relational comparison.

Box2D compare profile

Almost all the calls to the runtime function came from the CompareICStub, which is used for the two relational comparisons in the inner function:

x.proxyA = t < m ? t : m;
x.proxyB = t >= m ? t : m;

So, these two innocent-looking lines of code are responsible for 99% of the time spent in this function! How come? Well, as with so many things in JavaScript, the abstract relational comparison is not necessarily intuitive to use properly. In this function, both t and m are always instances of L, which is a central class in this application but doesn’t override either anySymbol.toPrimitive, "toString", "valueOf" or Symbol.toStringTag properties that are relevant for the abstract relation comparison. So, what happens if you write t < m is this:

  1. Calls ToPrimitive (t, hint Number).
  2. Runs OrdinaryToPrimitive (t, "number") since there’s no Symbol.toPrimitive.
  3. Executes t.valueOf(), which yields t itself since it calls the default Object.prototype.valueOf.
  4. Continues with t.toString(), which yields "[object Object]" since the default Object.prototype.toString is being used and no Symbol.toStringTag was found for L.
  5. Calls ToPrimitive (m, hint Number).
  6. Runs OrdinaryToPrimitive (m, "number") since there’s no Symbol.toPrimitive.
  7. Executes m.valueOf(), which yields m itself since it calls the default Object.prototype.valueOf.
  8. Continues with m.toString(), which yields "[object Object]", since the default Object.prototype.toString is being used and no Symbol.toStringTag was found for L.
  9. Does the comparison "[object Object]" < "[object Object]", which yields false.

It's the same for t >= m, which always produces true. So, the bug here is that using abstract relational comparison this way just doesn’t make sense. The way to exploit it is to have the compiler constant-fold it, i.e., similar to applying this patch to the benchmark:

--- octane-box2d.js.ORIG        2016-12-16 07:28:58.442977631 +0100
+++ octane-box2d.js     2016-12-16 07:29:05.615028272 +0100
@@ -2021,8 +2021,8 @@
                     if (t == m) return true;
                     if (e.m_pairCount == e.m_pairBuffer.length) e.m_pairBuffer[e.m_pairCount] = new O;
                     var x = e.m_pairBuffer[e.m_pairCount];
-                    x.proxyA = t < m ? t : m;
-                    x.proxyB = t >= m ? t : m;
+                    x.proxyA = m;
+                    x.proxyB = t;
                     ++e.m_pairCount;
                     return true
                 },

Because doing so results in a serious speed-up of 13% by not having to do the comparison, and all the propery lookups and builtin function calls triggered by it.

$ ~/Projects/v8/out/Release/d8 octane-box2d.js.ORIG
Score (Box2D): 48063
$ ~/Projects/v8/out/Release/d8 octane-box2d.js
Score (Box2D): 55359
$

So, how did we do that? As it turned out, we already had a mechanism for tracking the shape of objects that are being compared in theCompareIC, the so-called known receiver map tracking (where map is V8 speak for object shape+prototype), but that was limited to abstract and strict equality comparisons. However, I could easily extend the tracking to also collect the feedback for abstract relational comparison:

$ ~/Projects/v8/out/Release/d8 --trace-ic octane-box2d.js
[...SNIP...]
[CompareIC in ~+557 at octane-box2d.js:2024 ((UNINITIALIZED+UNINITIALIZED=UNINITIALIZED)->(RECEIVER+RECEIVER=KNOWN_RECEIVER))#LT @ 0x1d5a860493a1]
[CompareIC in ~+649 at octane-box2d.js:2025 ((UNINITIALIZED+UNINITIALIZED=UNINITIALIZED)->(RECEIVER+RECEIVER=KNOWN_RECEIVER))#GTE @ 0x1d5a860496e1]
[...SNIP...]
$

Here the CompareIC used in the baseline code tells us that for the LT (less than) and the GTE (greater than or equal) comparisons in the function we’re looking at, it had only seen RECEIVERs so far (which is V8 speak for JavaScript objects), and all these receivers had the same map 0x1d5a860493a1, which corresponds to the map of L instances.

So, in optimized code, we can constant-fold these operations to false and true respectively as long as we know that both sides of the comparison are instances with the map 0x1d5a860493a1 and noone messed with Ls prototype chain, i.e., the Symbol.toPrimitive, "valueOf", and "toString"methods are the default ones, and no one installed a Symbol.toStringTag accessor property. The rest of the story is black voodoo magic in Crankshaft, with a lot of cursing and initially forgetting to check Symbol.toStringTag properly:

Hydrogen voodoo magic

In the end, there was a rather huge performance boost on this particular benchmark:

Box2D boost

To my defense, back then I was not convinced that this particular behavior would always point to a bug in the original code, so I was even expecting that code in the wild might hit this case fairly often, also because I was assuming that JavaScript developers wouldn’t always care about these kinds of potential bugs. However, I was so wrong, and here I stand corrected! I have to admit that this particular optimization is purely a benchmark thing, and will not help any real code (unless the code is written to benefit from this optimization, but then you could as well write true or false directly in your code instead of using an always-constant relational comparison).

You might wonder why we slightly regressed soon after my patch. That was the period where we threw the whole team at implementing ES2015, which was really a dance with the devil to get all the new stuff in (ES2015 is a monster!) without seriously regressing the traditional benchmarks.

Enough said about Box2D. Let’s have a look at the Mandreel benchmark. Mandreel was a compiler for compiling C/C++ code to JavaScript. It didn’t use the asm.js subset of JavaScript that is being used by the more recent Emscripten compiler and has been deprecated (and more or less disappeared from the internet) for roughly three years now. Nevertheless, Octane still has a version of the Bullet physics engine compiled via Mandreel.

An interesting test here is the MandreelLatency test, which instruments the Mandreel benchmark with frequent time measurement checkpoints. The idea here was that since Mandreel stresses the VM’s compiler, this test provides an indication of the latency introduced by the compiler, and long pauses between measurement checkpoints lower the final score. In theory, that sounds very reasonable and it does indeed make some sense. However, as usual vendors figured out ways to cheat on this benchmark.

Mozilla bug 1162272

Mandreel contains a huge initialization function global_init that takes an incredible amount of time just parsing this function and generating baseline code for it. Since engines usually parse various functions in scripts multiple times, one so-called pre-parse step to discovering functions inside the script and then as the function is invoked for the first time a full parse step to actually generate baseline code (or bytecode) for the function.

This is called lazy parsing in V8 speak. V8 has some heuristics in place to detect functions that are invoked immediately where pre-parsing is actually a waste of time, but that’s not clear for the global_init function in the Mandreel benchmark, thus we’d have an incredibly long pause for pre-parsing + parsing + compiling the big function. So, we added an additional heuristic that would also avoid the pre-parsing for this global_init function.

MandreelLatency benchmark

Source: arewefastyet.com.

So, we saw an almost 200% improvement just by detecting global_init and avoiding the expensive pre-parse step. We are somewhat certain that this should not negatively impact real-world use cases, but there’s no guarantee that this won’t bite you on large functions where pre-parsing would be beneficial (because they aren’t immediately executed).

So, let’s look into another slightly less controversial benchmark: the splay.js test, which is meant to be a data manipulation benchmark that deals with splay trees and exercises the automatic memory management subsystem (AKA the garbage collector). It comes bundled with a latency test that instruments the Splay code with frequent measurement checkpoints, where a long pause between checkpoints is an indication of high latency in the garbage collector. This test measures the frequency of latency pauses, classifies them into buckets, and penalizes frequent long pauses with a low score. Sounds great! No GC pauses, no jank. So much for the theory. Let’s have a look at the benchmark. Here’s what’s at the core of the whole splay tree business:

splay.js

This is the core of the splay tree construction, and despite what you might think looking at the full benchmark, this is more or less all that matters for the SplayLatency score. How come? Actually what the benchmark does is to construct huge splay trees, so that the majority of nodes survive, thus making it to old space. With a generational garbage collector like the one in V8 this is super expensive if a program violates the generational hypothesis leading to extreme pause times for essentially evacuating everything from new space to old space. Running V8 in the old configuration clearly shows this problem:

$ out/Release/d8 --trace-gc --noallocation_site_pretenuring octane-splay.js
[20872:0x7f26f24c70d0]       10 ms: Scavenge 2.7 (6.0) -> 2.7 (7.0) MB, 1.1 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]       12 ms: Scavenge 2.7 (7.0) -> 2.7 (8.0) MB, 1.7 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]       14 ms: Scavenge 3.7 (8.0) -> 3.6 (10.0) MB, 0.8 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]       18 ms: Scavenge 4.8 (10.5) -> 4.7 (11.0) MB, 2.5 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]       22 ms: Scavenge 5.7 (11.0) -> 5.6 (16.0) MB, 2.8 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]       28 ms: Scavenge 8.7 (16.0) -> 8.6 (17.0) MB, 4.3 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]       35 ms: Scavenge 9.6 (17.0) -> 9.6 (28.0) MB, 6.9 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]       49 ms: Scavenge 16.6 (28.5) -> 16.4 (29.0) MB, 8.2 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]       65 ms: Scavenge 17.5 (29.0) -> 17.5 (52.0) MB, 15.3 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]       93 ms: Scavenge 32.3 (52.5) -> 32.0 (53.5) MB, 17.6 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      126 ms: Scavenge 33.4 (53.5) -> 33.3 (68.0) MB, 31.5 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      151 ms: Scavenge 47.9 (68.0) -> 47.6 (69.5) MB, 15.8 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      183 ms: Scavenge 49.2 (69.5) -> 49.2 (84.0) MB, 30.9 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      210 ms: Scavenge 63.5 (84.0) -> 62.4 (85.0) MB, 14.8 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      241 ms: Scavenge 64.7 (85.0) -> 64.6 (99.0) MB, 28.8 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      268 ms: Scavenge 78.2 (99.0) -> 77.6 (101.0) MB, 16.1 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      298 ms: Scavenge 80.4 (101.0) -> 80.3 (114.5) MB, 28.2 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      324 ms: Scavenge 93.5 (114.5) -> 92.9 (117.0) MB, 16.4 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      354 ms: Scavenge 96.2 (117.0) -> 96.0 (130.0) MB, 27.6 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      383 ms: Scavenge 108.8 (130.0) -> 108.2 (133.0) MB, 16.8 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      413 ms: Scavenge 111.9 (133.0) -> 111.7 (145.5) MB, 27.8 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      440 ms: Scavenge 124.1 (145.5) -> 123.5 (149.0) MB, 17.4 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      473 ms: Scavenge 127.6 (149.0) -> 127.4 (161.0) MB, 29.5 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      502 ms: Scavenge 139.4 (161.0) -> 138.8 (165.0) MB, 18.7 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      534 ms: Scavenge 143.3 (165.0) -> 143.1 (176.5) MB, 28.5 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      561 ms: Scavenge 154.7 (176.5) -> 154.2 (181.0) MB, 19.0 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      594 ms: Scavenge 158.9 (181.0) -> 158.7 (192.0) MB, 29.2 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      622 ms: Scavenge 170.0 (192.5) -> 169.5 (197.0) MB, 19.5 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      655 ms: Scavenge 174.6 (197.0) -> 174.3 (208.0) MB, 28.7 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      683 ms: Scavenge 185.4 (208.0) -> 184.9 (212.5) MB, 19.4 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      715 ms: Scavenge 190.2 (213.0) -> 190.0 (223.5) MB, 27.7 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      743 ms: Scavenge 200.7 (223.5) -> 200.3 (228.5) MB, 19.7 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      774 ms: Scavenge 205.8 (228.5) -> 205.6 (239.0) MB, 27.1 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      802 ms: Scavenge 216.1 (239.0) -> 215.7 (244.5) MB, 19.8 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      833 ms: Scavenge 221.4 (244.5) -> 221.2 (254.5) MB, 26.2 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      861 ms: Scavenge 231.5 (255.0) -> 231.1 (260.5) MB, 19.9 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      892 ms: Scavenge 237.0 (260.5) -> 236.7 (270.5) MB, 26.3 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      920 ms: Scavenge 246.9 (270.5) -> 246.5 (276.0) MB, 20.1 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      951 ms: Scavenge 252.6 (276.0) -> 252.3 (286.0) MB, 25.8 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]      979 ms: Scavenge 262.3 (286.0) -> 261.9 (292.0) MB, 20.3 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1014 ms: Scavenge 268.2 (292.0) -> 267.9 (301.5) MB, 29.8 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1046 ms: Scavenge 277.7 (302.0) -> 277.3 (308.0) MB, 22.4 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1077 ms: Scavenge 283.8 (308.0) -> 283.5 (317.5) MB, 25.1 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1105 ms: Scavenge 293.1 (317.5) -> 292.7 (323.5) MB, 20.7 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1135 ms: Scavenge 299.3 (323.5) -> 299.0 (333.0) MB, 24.9 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1164 ms: Scavenge 308.6 (333.0) -> 308.1 (339.5) MB, 20.9 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1194 ms: Scavenge 314.9 (339.5) -> 314.6 (349.0) MB, 25.0 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1222 ms: Scavenge 324.0 (349.0) -> 323.6 (355.5) MB, 21.1 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1253 ms: Scavenge 330.4 (355.5) -> 330.1 (364.5) MB, 25.1 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1282 ms: Scavenge 339.4 (364.5) -> 339.0 (371.0) MB, 22.2 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1315 ms: Scavenge 346.0 (371.0) -> 345.6 (380.0) MB, 25.8 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1413 ms: Mark-sweep 349.9 (380.0) -> 54.2 (305.0) MB, 5.8 / 0.0 ms  (+ 87.5 ms in 73 steps since start of marking, biggest step 8.2 ms, walltime since start of marking 131 ms) finalize incremental marking via stack guard GC in old space requested
[20872:0x7f26f24c70d0]     1457 ms: Scavenge 65.8 (305.0) -> 65.1 (305.0) MB, 31.0 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1489 ms: Scavenge 69.9 (305.0) -> 69.7 (305.0) MB, 27.1 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1523 ms: Scavenge 80.9 (305.0) -> 80.4 (305.0) MB, 22.9 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1553 ms: Scavenge 85.5 (305.0) -> 85.3 (305.0) MB, 24.2 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1581 ms: Scavenge 96.3 (305.0) -> 95.7 (305.0) MB, 18.8 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1616 ms: Scavenge 101.1 (305.0) -> 100.9 (305.0) MB, 29.2 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1648 ms: Scavenge 111.6 (305.0) -> 111.1 (305.0) MB, 22.5 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1678 ms: Scavenge 116.7 (305.0) -> 116.5 (305.0) MB, 25.0 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1709 ms: Scavenge 127.0 (305.0) -> 126.5 (305.0) MB, 20.7 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1738 ms: Scavenge 132.3 (305.0) -> 132.1 (305.0) MB, 23.9 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1767 ms: Scavenge 142.4 (305.0) -> 141.9 (305.0) MB, 19.6 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1796 ms: Scavenge 147.9 (305.0) -> 147.7 (305.0) MB, 23.8 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1825 ms: Scavenge 157.8 (305.0) -> 157.3 (305.0) MB, 19.9 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1853 ms: Scavenge 163.5 (305.0) -> 163.2 (305.0) MB, 22.2 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1881 ms: Scavenge 173.2 (305.0) -> 172.7 (305.0) MB, 19.1 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1910 ms: Scavenge 179.1 (305.0) -> 178.8 (305.0) MB, 23.0 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1944 ms: Scavenge 188.6 (305.0) -> 188.1 (305.0) MB, 25.1 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     1979 ms: Scavenge 194.7 (305.0) -> 194.4 (305.0) MB, 28.4 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     2011 ms: Scavenge 204.0 (305.0) -> 203.6 (305.0) MB, 23.4 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     2041 ms: Scavenge 210.2 (305.0) -> 209.9 (305.0) MB, 23.8 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     2074 ms: Scavenge 219.4 (305.0) -> 219.0 (305.0) MB, 24.5 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     2105 ms: Scavenge 225.8 (305.0) -> 225.4 (305.0) MB, 24.7 / 0.0 ms  allocation failure
[20872:0x7f26f24c70d0]     2138 ms: Scavenge 234.8 (305.0) -> 234.4 (305.0) MB, 23.1 / 0.0 ms  allocation failure
[...SNIP...]
$ 

So, the key observation here is that allocating the splay tree nodes in old space directly would avoid essentially all the overhead of copying objects around and reduce the number of minor GC cycles to the bare minimum (thereby reducing the pauses caused by the GC). We came up with a mechanism called Allocation Site Pretenuring that would try to dynamically gather feedback at allocation sites when it's run in baseline code to decide whether a certain percent of the objects allocated here survives, and if so, instrument the optimized code to allocate objects in old space directly, i.e., pre-tenure the objects.

$ out/Release/d8 --trace-gc octane-splay.js
[20885:0x7ff4d7c220a0]        8 ms: Scavenge 2.7 (6.0) -> 2.6 (7.0) MB, 1.2 / 0.0 ms  allocation failure
[20885:0x7ff4d7c220a0]       10 ms: Scavenge 2.7 (7.0) -> 2.7 (8.0) MB, 1.6 / 0.0 ms  allocation failure
[20885:0x7ff4d7c220a0]       11 ms: Scavenge 3.6 (8.0) -> 3.6 (10.0) MB, 0.9 / 0.0 ms  allocation failure
[20885:0x7ff4d7c220a0]       17 ms: Scavenge 4.8 (10.5) -> 4.7 (11.0) MB, 2.9 / 0.0 ms  allocation failure
[20885:0x7ff4d7c220a0]       20 ms: Scavenge 5.6 (11.0) -> 5.6 (16.0) MB, 2.8 / 0.0 ms  allocation failure
[20885:0x7ff4d7c220a0]       26 ms: Scavenge 8.7 (16.0) -> 8.6 (17.0) MB, 4.5 / 0.0 ms  allocation failure
[20885:0x7ff4d7c220a0]       34 ms: Scavenge 9.6 (17.0) -> 9.5 (28.0) MB, 6.8 / 0.0 ms  allocation failure
[20885:0x7ff4d7c220a0]       48 ms: Scavenge 16.6 (28.5) -> 16.4 (29.0) MB, 8.6 / 0.0 ms  allocation failure
[20885:0x7ff4d7c220a0]       64 ms: Scavenge 17.5 (29.0) -> 17.5 (52.0) MB, 15.2 / 0.0 ms  allocation failure
[20885:0x7ff4d7c220a0]       96 ms: Scavenge 32.3 (52.5) -> 32.0 (53.5) MB, 19.6 / 0.0 ms  allocation failure
[20885:0x7ff4d7c220a0]      153 ms: Scavenge 61.3 (81.5) -> 57.4 (93.5) MB, 27.9 / 0.0 ms  allocation failure
[20885:0x7ff4d7c220a0]      432 ms: Scavenge 339.3 (364.5) -> 326.6 (364.5) MB, 12.7 / 0.0 ms  allocation failure
[20885:0x7ff4d7c220a0]      666 ms: Scavenge 563.7 (592.5) -> 553.3 (595.5) MB, 20.5 / 0.0 ms  allocation failure
[20885:0x7ff4d7c220a0]      825 ms: Mark-sweep 603.9 (644.0) -> 96.0 (528.0) MB, 4.0 / 0.0 ms  (+ 92.5 ms in 51 steps since start of marking, biggest step 4.6 ms, walltime since start of marking 160 ms) finalize incremental marking via stack guard GC in old space requested
[20885:0x7ff4d7c220a0]     1068 ms: Scavenge 374.8 (528.0) -> 362.6 (528.0) MB, 19.1 / 0.0 ms  allocation failure
[20885:0x7ff4d7c220a0]     1304 ms: Mark-sweep 460.1 (528.0) -> 102.5 (444.5) MB, 10.3 / 0.0 ms  (+ 117.1 ms in 59 steps since start of marking, biggest step 7.3 ms, walltime since start of marking 200 ms) finalize incremental marking via stack guard GC in old space requested
[20885:0x7ff4d7c220a0]     1587 ms: Scavenge 374.2 (444.5) -> 361.6 (444.5) MB, 13.6 / 0.0 ms  allocation failure
[20885:0x7ff4d7c220a0]     1828 ms: Mark-sweep 485.2 (520.0) -> 101.5 (519.5) MB, 3.4 / 0.0 ms  (+ 102.8 ms in 58 steps since start of marking, biggest step 4.5 ms, walltime since start of marking 183 ms) finalize incremental marking via stack guard GC in old space requested
[20885:0x7ff4d7c220a0]     2028 ms: Scavenge 371.4 (519.5) -> 358.5 (519.5) MB, 12.1 / 0.0 ms  allocation failure
[...SNIP...]
$ 

Indeed, that essentially fixed the problem for the SplayLatency benchmark completely and boosted our score by over 250%!

SplayLatency benchmark

Source: arewefastyet.com.

As mentioned in the SIGPLAN paper, we had good reasons to believe that allocation site pre-tenuring might be a win for real world applications, and were really looking forward to seeing improvements and extending the mechanism to cover more than just object and array literals. It didn’t take long to realize that allocation site pre-tenuring can have a pretty serious negative impact on real-world application performance. We actually got a lot of negative press, including a sh*t storm from Ember.js developers and users, not only because of allocation site pre-tenuring (but that was a big part of the story).

The fundamental problem with allocation site pre-tenuring, as we learned, are factories, which are very common in applications today (mostly because of frameworks but also for other reasons). Another fundamental problem is assuming that your object factory is initially used to create the long living objects that form your object model and the views, which transitions the allocation site in your factory method(s) to tenured state, and everything allocated from the factory immediately goes to old space. After the initial setup is done, your application starts doing stuff, and as part of that, allocates temporary objects from the factory that now start polluting old space, eventually leading to expensive major garbage collection cycles and other negative side effects like triggering incremental marking way too early.

We started to reconsider the benchmark driven effort and started looking for real-world-driven solutions instead, which resulted in an effort called Orinoco with the goal to incrementally improve the garbage collector. Part of that effort is a project called unified heap, which will try to avoid copying objects if almost everything in a page survives. For example, on a high level, if new space is full of live objects, just mark all new space pages as belonging to old space now and create a fresh new space from empty pages. This might not yield the same score on the SplayLatency benchmark, but it’s a lot better for real world use cases and it automatically adapts to the concrete use case. We are also considering concurrent marking to offload the marking work to a separate thread and thus further reducing the negative impact of incremental marking on both latency and throughput.

Cuteness Break!

Breathe.

Ok, I think that should be sufficient to underline the point. I could go on pointing to even more examples where Octane driven improvements turned out to be a bad idea later, and maybe I’ll do that another day. But let’s stop right here for now…

Conclusion

I hope it should be clear by now why benchmarks are generally a good idea, but are only useful to a certain level, and once you cross the line of useful competition, you’ll start wasting the time of your engineers or even start hurting your real world performance! If we are serious about performance for the web, we need to start judging the browser by real-world performance and not their ability to game four-year-old benchmarks. We need to start educating the (tech) press, or failing that, at least ignore them.

Browser benchmark battle October 2016: Chrome vs. Firefox vs. Edge

Source: Browser benchmark battle October 2016: Chrome vs. Firefox vs. Edge, venturebeat.com.


No one is afraid of competition, but gaming potentially broken benchmarks is not really a useful investment of engineering time. We can do a lot more and take JavaScript to the next level. Let’s work on meaningful performance tests that can drive competition on areas of interest for the end user and the developer. Additionally, let’s also drive meaningful improvements for server and tooling side code running in Node.js (either on V8 or ChakraCore)!

One closing comment: Don’t use traditional JavaScript benchmarks to compare phones. It’s really the most useless thing you can do, as the JavaScript performance often depends a lot on the software and not necessarily on the hardware, and Chrome ships a new version every six weeks, so whatever you measure in March may be irrelevant already in April. And if there’s no way to avoid running something in a browser that assigns a number to a phone, then at least use a recent full browser benchmark that has at least something to do with what people will do with their browsers, i.e. consider Speedometer benchmark.

Thank you!

Here are parts one, two, and three in case you missed them.

Topics:
performance ,javascript ,octane ,benchmarks

Published at DZone with permission of Benedikt Meurer, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}