Over a million developers have joined DZone.

Mistakes in Micro Benchmarks

DZone's Guide to

Mistakes in Micro Benchmarks

Even benchmarks make mistakes. Check out these tips on potential changes, and the results of casting.

Free Resource

xMatters delivers integration-driven collaboration that relays data between systems, while engaging the right people to proactively resolve issues. Read the Monitoring in a Connected Enterprise whitepaper and learn about 3 tools for resolving incidents quickly.

So on my last post I showed a bunch of small micro benchmark, and aside from the actual results, I wasn’t really sure what was going on there. Luckily, I know a few perf experts, so I was able to lean on them.

In particular, the changes that were recommended were:

  • Don’t make just a single tiny operation, it is easy to get too much jitter in the setup for the call if the op is too cheap.
  • Pay attention to potential data issues, the compiler/jit can decide to put something on a register, in which case you are benching the CPU directly, which won’t be the case in the real world.

I also learned how to get the actual assembly being run, which is great. All in all, we get the following benchmark code:

[BenchmarkTask(platform: BenchmarkPlatform.X86,
            jitVersion: BenchmarkJitVersion.RyuJit)]
[BenchmarkTask(platform: BenchmarkPlatform.X86,
            jitVersion: BenchmarkJitVersion.LegacyJit)]
[BenchmarkTask(platform: BenchmarkPlatform.X64,
                jitVersion: BenchmarkJitVersion.LegacyJit)]
[BenchmarkTask(platform: BenchmarkPlatform.X64,
                jitVersion: BenchmarkJitVersion.RyuJit)]
public unsafe class ToCastOrNotToCast
    byte* p1, p2, p3, p4;
    FooHeader* h1, h2,h3,h4;
    public ToCastOrNotToCast()
        p1 = (byte*)Marshal.AllocHGlobal(1024);
        p2 = (byte*)Marshal.AllocHGlobal(1024);
        p3 = (byte*)Marshal.AllocHGlobal(1024);
        p4 = (byte*)Marshal.AllocHGlobal(1024);
        h1 = (FooHeader*)p1;
        h2 = (FooHeader*)p2;
        h3 = (FooHeader*)p3;
        h4 = (FooHeader*)p4;

    public void NoCast()

    public void Cast()

And the following results:

          Method | Platform |       Jit |   AvrTime |    StdDev |             op/s |
---------------- |--------- |---------- |---------- |---------- |----------------- |
            Cast |      X64 | LegacyJit | 0.2135 ns | 0.0113 ns | 4,683,511,436.74 |
          NoCast |      X64 | LegacyJit | 0.2116 ns | 0.0017 ns | 4,725,696,633.67 |
            Cast |      X64 |    RyuJit | 0.2177 ns | 0.0038 ns | 4,593,221,104.97 |
          NoCast |      X64 |    RyuJit | 0.2097 ns | 0.0006 ns | 4,769,090,600.54 |
---------------- |--------- |---------- |---------- |---------- |----------------- |
            Cast |      X86 | LegacyJit | 0.7465 ns | 0.1743 ns | 1,339,630,922.79 |
          NoCast |      X86 | LegacyJit | 0.7474 ns | 0.1320 ns | 1,337,986,425.19 |
            Cast |      X86 |    RyuJit | 0.7481 ns | 0.3014 ns | 1,336,808,932.91 |
          NoCast |      X86 |    RyuJit | 0.7426 ns | 0.0039 ns | 1,346,537,728.81 |

Interestingly enough, the NoCast approach is faster in pretty much all setups.

Here is the assembly code for LegacyJit in x64:


For RyuJit, the code is identical for the cast code, and the only difference in the no casting code is that the mov edx, ecx is mov rdx,rcx in RyuJit.

As an aside, X64 assembly code is much easier to read than x86 assembly code.

In short, casting or not casting has a very minor performance difference, but not casting allows us to save a pointer reference in the object, which means it will be somewhat smaller, and if we are going to have a lot of them, then that can be a pretty nice space saving.

3 Steps to Monitoring in a Connected Enterprise. Check out xMatters.

performance ,benchmark ,casting ,cpu

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}