Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Garbage Collectors Affect Microbenchmarks

DZone 's Guide to

Garbage Collectors Affect Microbenchmarks

We all know garbage collection affects benchmarking, but it also can be in a manner you might not be aware of.

· Java Zone ·
Free Resource

When comparing garbage collectors, there are two key metrics: how much time is spent collecting garbage and the maximum pause time. There’s another dimension to the choice of garbage collector though: how it instruments JIT compiled code and the consequences of that instrumentation. The cost of this instrumentation is usually a tiny price to pay for improved pause times, which only matters to some applications, but it makes writing benchmarks for code that assigns and reads references potentially error-prone: sometimes, the effect of changing the garbage collector is larger than the difference between two competing implementations. To illustrate this, I will compare a microbenchmark from a document cursor with three garbage collectors: ParallelOld (the default in OpenJDK8), G1 (the default from OpenJDK 9 onwards), and the experimental ZGC available from JDK 11 onwards.

The code being benchmarked is simple. Imagine a stream of JSON-like documents that need to be translated into another format. The documents contain a special field called the cursor, which, for some reason, the last-encountered value must always be known. There is a callback that will be invoked whenever a value of a certain type is encountered (e.g. writeLong(long value)) and a callback that will be invoked whenever a name of an attribute is encountered (i.e. writeName(String name). The interface being implemented cannot be changed to include a method writeLong(String name, long value) because it is owned by a third party, so the state between the two calls must be saved between the invocations. On each invocation of the writeName callback, we could save the name in the cursor object.


public class CursoredScanner2 implements CursoredScanner {

  private final String trigger;
  private String current;
  private long cursor;

  public CursoredScanner2(String trigger) {
    this.trigger = trigger;
  }

  @Override
  public void writeName(String name) {
    this.current = name;
  }

  @Override
  public void writeLong(long value) {
    if (trigger.equals(current)) {
      this.cursor = value;
    }
  }

  @Override
  public long getCursor() {
    return cursor;
  }

}


Alternatively, we could do the same number of comparisons by storing whether the last name was the name of the cursor or not:


public class CursoredScanner1 implements CursoredScanner {

  private final String trigger;

  private boolean atCursor;
  private long cursor;

  public CursoredScanner1(String trigger) {
    this.trigger = trigger;
  }

  @Override
  public void writeName(String name) {
    this.atCursor = trigger.equals(name);
  }

  @Override
  public void writeLong(long value) {
    if (atCursor) {
      this.cursor = value;
    }
  }

  @Override
  public long getCursor() {
    return cursor;
  }
}


Each implementation performs the same number of string comparisons. Supposing performance matters, how can one of the alternatives be selected? I wrote a benchmark that captures the cursor value from documents of varying sizes. I ran this benchmark with different garbage collector settings with JDK 11. With ParallelOld, I saw that CursoredScanner2 was slightly slower.

-XX:+UseCondCardMark -XX:+UseParallelOldGC -mx1G -XX:+AlwaysPreTouch


Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: scannerType Param: size Param: triggerName
scan thrpt 1 15 58.081438 1.008727 ops/us SCANNER1 10 trigger1
scan thrpt 1 15 6.586134 0.173920 ops/us SCANNER1 100 trigger1
scan thrpt 1 15 49.402537 0.943554 ops/us SCANNER2 10 trigger1
scan thrpt 1 15 5.248657 0.135281 ops/us SCANNER2 100 trigger1


The cost here can be attributed to the card marking, which keeps the approximation of inter-generational references up to date when references are assigned (see here). By avoiding assigning the reference in CursoredScanner1, the garbage collector doesn’t need to instrument anything at all, because the object graph isn’t being mutated.

G1 offers significant reductions in maximum pause times by structuring the heap and tracking references differently. It also instruments reference assignments to keep its book-keeping data structures up to date. The effect of this is pronounced in this benchmark. The barrier can be seen here with some skid implicating the innocent adjacent instruction.

-XX:+UseG1GC -mx1G -XX:+AlwaysPreTouch


Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: scannerType Param: size Param: triggerName
scan thrpt 1 15 62.633572 0.995514 ops/us SCANNER1 10 trigger1
scan thrpt 1 15 7.660122 0.231402 ops/us SCANNER1 100 trigger1
scan thrpt 1 15 23.833586 0.379903 ops/us SCANNER2 10 trigger1
scan thrpt 1 15 2.757419 0.148344 ops/us SCANNER2 100 trigger1


What about ZGC, one of the two upcoming ultra-low pause garbage collectors? I can’t pretend to understand in detail how ZGC works (beyond what I can glean from profilers), but suffice it to say it works differently and instruments application code differently. Rather than intercepting reference assignment, it seems to intercept reads. It’s not clear why both implementations perform slightly worse than CursoredScanner1 with G1 or ParallelOld, but there’s not much to choose between the two when using ZGC.

-XX:+UnlockExperimentalVMOptions -XX:+UseZGC -mx1G -XX:+AlwaysPreTouch


Benchmark Mode Threads Samples Score Score Error (99.9%) Unit Param: scannerType Param: size Param: triggerName
scan thrpt 1 15 43.761915 1.160516 ops/us SCANNER1 10 trigger1
scan thrpt 1 15 6.190803 0.101114 ops/us SCANNER1 100 trigger1
scan thrpt 1 15 39.080922 0.826591 ops/us SCANNER2 10 trigger1
scan thrpt 1 15 4.763075 0.126938 ops/us SCANNER2 100 trigger1


Am I making a case for using ParallelOld and avoiding assigning references because the throughput is slightly better? Not really. While it’s possible that it's appropriate in some applications, the point is that unless benchmarks focus exclusively on primitive types, the garbage collector has to be considered, and results need to be qualified by this choice. It would be very hard to choose between these implementations without knowing, in advance, which garbage collector would be in use.

As an aside, this is the first time I have run ZGC, so I’m keen to track down the read barrier I have heard about. It looks like the sequence of instructions mov, test, jne occurs on each read:


0x00007fe47c765295: mov    0x30(%r10),%r9
0x00007fe47c765299: test   %r9,0x20(%r15)
0x00007fe47c76529d: jne    0x00007fe47c765b68 
0x00007fe47c7652a3: mov    0x10(%r9),%r14    
0x00007fe47c7652a7: test   %r14,0x20(%r15)
0x00007fe47c7652ab: jne    0x00007fe47c765b76  


The assembly above can be seen whenever a reference is read and sets up a data dependency between reads: the mov instruction must happen before the test instruction, which may trigger the jne, so the second move must depend on the jump and can’t be reordered. I was wondering what the purpose of this was, if the data dependency was the means or the end, and what’s in r15, and I found a decent article about that talks about it in greater detail. Aleksey Shipilëv, who writes garbage collectors for a living and is better placed to interpret this output, gave some feedback: in Hotspot, r15 is the base address for thread local storage. Here, r15 + 0x20 is ZGC’s so-called bad mask, and failing the test means that the object needs to be marked or relocated. Neither marking nor relocation actually show up in this profile because there isn’t enough garbage generated to trigger it, so the code at 0x00007fe47c765b68 can’t be seen. If the test passes nothing need happen, and the next reference is read (and intercepted itself). What jumps out at me here is the data dependency, but there’s also no obvious bottleneck in the profile, and this avoids the use of synchronization, whereas G1 can be seen to use lock addl on stores, which is often on or close to a bottleneck.

Topics:
java ,jdk ,microbenchmark ,benchmark ,garbage collection ,performance ,code

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}