Over a million developers have joined DZone.
Platinum Partner

Abstraction slows you down

· Performance Zone

The Performance Zone is brought to you in partnership with New Relic. Quickly learn how to use Docker and containers in general to create packaged images for easy management, testing, and deployment of software.

Update

As there have been several comments about my microbenchmark, I feel in a position to give a common reply.

Thank you for pointing out flaws in my test implementation. This time I used snippet from Maciej Gorączka to illustrate the pitfalls of optimization and at the same time to show real impact of interface calls. Of course it had to be modified as it also suffered from the optimization pitfalls.

There are several things in the area of method call optimizations one have to keep in mind when measuring performance. Some of them have have been described in performance techniques chapter of HotSpot internals.
With my original benchmark I did not take into account the below.

Virtual (and interface) invocations with a lopsided type profile are compiled with an optimistic check in favor of the historically common type (or two types).

So there's a lot of truth in saying microbenchmarks can be reeealy tricky.

As an example of how forgeting about optimizer may mess up your tests, the below code:

import java.math.BigDecimal;
import java.util.Random;

public class OptimizationTest
{
    private static final long ITERATIONS = 1000 * 1000 * 1000;
    private static final int RUNS = 1;
    private static final SomeInterface[] IFACES = new SomeInterface[] {new Impl1(), new Impl2()};
    private static final Impl1[] IMPLS = new Impl1[] {new Impl1(), new Impl1()};
    private static final Random RANDOM = new Random();
    private static final int CLASS_COUNT = 2;

    public static void main(String[] args)
    {
        performTestIface(2); //warm up

        final long implTime = performTestImpl(RUNS);
        final BigDecimal runs = new BigDecimal(RUNS);
        final BigDecimal implAvg = new BigDecimal(implTime).divide(runs);
        System.out.println("Invokevirtual: " + implAvg + " ms");

        final long ifaceTime = performTestIface(RUNS);
        final BigDecimal ifaceAvg = new BigDecimal(ifaceTime).divide(runs);
        System.out.println("Invokeinterface: " + ifaceAvg + " ms");
    }

    private static long performTestIface(final long runs)
    {
        final SomeInterface iface = IFACES[RANDOM.nextInt(CLASS_COUNT)];
        long ifaceTime = 0;

        for (int run = 0; run < runs; run++)
        {
            System.gc();
            long start = System.currentTimeMillis();
            for (int i = 0; i < ITERATIONS; i++)
            {
                iface.doSomething(i);
            }
            ifaceTime += System.currentTimeMillis() - start;
        }
        return ifaceTime;
    }

    private static long performTestImpl(final long runs)
    {
        final Impl1 impl = IMPLS[RANDOM.nextInt(CLASS_COUNT)];
        long implTime = 0;

        for (int run = 0; run < runs; run++)
        {
            System.gc();
            long start = System.currentTimeMillis();
            for (int i = 0; i < ITERATIONS; i++)
            {
                impl.doSomething(i);
            }
            implTime += System.currentTimeMillis() - start;
        }
        return implTime;
    }


    static interface SomeInterface
    {
        void doSomething(int i);
    }

    static class Impl1 implements SomeInterface
    {
        private int i;

        public void doSomething(final int i)
        {
            this.i = i;
        }
    }

    static class Impl2 implements SomeInterface
    {
        private int i;

        public void doSomething(final int i)
        {
            this.i = i;
        }
    }


}

Please make sure to run it 10+ times. I hope you'll have fun with the results.

Now, the explanation.
When run with -XX:+PrintCompilation flag, the test either outputs:

    142   9       test.InvokeInterfaceTest2$Impl1::doSomething (6 bytes)
    142   1%      test.InvokeInterfaceTest2::performTestIface @ 36 (77 bytes)
   1701   2%      test.InvokeInterfaceTest2::performTestImpl @ 36 (75 bytes)
Invokevirtual: 745 ms
   2447  10       test.InvokeInterfaceTest2::performTestIface (77 bytes)
Invokeinterface: 722 ms

 or:

      65   3       test.InvokeInterfaceTest2$Impl2::doSomething (6 bytes)
     66   1%      test.InvokeInterfaceTest2::performTestIface @ 36 (77 bytes)
   1523   4       test.InvokeInterfaceTest2$Impl1::doSomething (6 bytes)
   1523   2%      test.InvokeInterfaceTest2::performTestImpl @ 36 (75 bytes)
Invokevirtual: 732 ms
   2255   5       test.InvokeInterfaceTest2::performTestIface (77 bytes)
   2262   1%     made not entrant  test.InvokeInterfaceTest2::performTestIface @ -2 (77 bytes)
   2268   3%      test.InvokeInterfaceTest2::performTestIface @ 36 (77 bytes)
Invokeinterface: 1816 ms

Note that HotSpot complains about method made not entrant, which means that previously compiled method is not valid anymore. This happens because we are using 2 different implementations of the same interface like in the faster example, but this time class implementing the interface changed, so target method had to be deoptimized.

It becomes even more clear when we look at how the optimization have been implemented by the compiler. In the fast run we get native code for performTestIface:

126   B16: #    N1 <- B2  Freq: 1.01326e-06
126       movl    RSI, #-28    # int
12b       movq    RBP, [rsp + #0]    # spill
12f       movl    [rsp + #4], RAX    # spill
133       call,static  wrapper for: uncommon_trap(reason='range_check' action='make_not_entrant')
        # OptimizationTest::performTestIface @ bci:10  L[0]=RBP L[1]=_ L[2]=_ L[3]=_ L[4]=_ L[5]=_ L[6]=_ L[7]=_ L[8]=_ STK[0]=rsp + #8 STK[1]=rsp + #4
        # OopMap{[8]=NarrowOop off=312}
138       int3    # ShouldNotReachHere

The code gets recompiled after some number of subsequent monomorphic calls reach PerMethodTrapLimit or PerBytecodeRecompilationCutoff. So for the slower run the code gets recompiled to:

14f   B16: #    N250 <- B9  Freq: 0.25578
14f       movl    RSI, #-58    # int
154       movq    [rsp + #16], RBX    # spill
159       movq    [rsp + #32], R14    # spill
15e       movq    [rsp + #40], R13    # spill
163       call,static  wrapper for: uncommon_trap(reason='bimorphic' action='maybe_recompile')
        # OptimizationTest::performTestIface @ bci:49  L[0]=rsp + #0 L[1]=_ L[2]=rsp + #40 L[3]=rsp + #16 L[4]=_ L[5]=rsp + #24 L[6]=rsp + #32 L[7]=_ L[8]=RBP STK[0]=rsp + #40 STK[1]=RBP
        # OopMap{[40]=Oop off=360}
168       int3    # ShouldNotReachHere
168
16d   B17: #    B3 B18 <- B2  Freq: 0.16983
16d       decode_heap_oop_not_null RSI,R10
171       movq    rdi, [RSI + (sizeof(oopDesc) + Klass::secondary_supers_offset_in_bytes())]
    movl    rcx, [rdi + arrayOopDesc::length_offset_in_bytes()]    # length to scan
    addq    rdi, arrayOopDex::base_offset_in_bytes(T_OBJECT)    # Skip to start of data; set NZ in case count is zero
    repne   scasq    # Scan *rdi++ for a match with rax while cx-- != 0
    jne,s   miss        # Missed: flags nz
    movq    [RSI + (sizeof(oopDesc) + Klass::secondary_super_cache_offset_in_bytes())], RAX    # Hit: update cache
    miss:   
2a7       je     B3  P=0.999999 C=-1.000000

B18: #    N250 <- B17  Freq: 1.6983e-07
2ad       movl    RSI, #-83    # int
2b2       movq    [rsp + #8], R13    # spill
2b7       movq    [rsp + #16], RBX    # spill
2bc       movq    [rsp + #32], R14    # spill
2c1       nop     # 2 bytes pad for loops and calls
2c3       call,static  wrapper for: uncommon_trap(reason='unreached' action='reinterpret')
        # OptimizationTest::performTestIface @ bci:36  L[0]=rsp + #0 L[1]=_ L[2]=rsp + #8 L[3]=rsp + #16 L[4]=_ L[5]=rsp + #24 L[6]=rsp + #32 L[7]=_ L[8]=RBP
        # OopMap{[8]=Oop off=712}
2c8       int3    # ShouldNotReachHere

Usually in high performance applications interfaces reflecting patterns like observer or listener are called in a loop with implementations changing frequently. That is why I believe this problem may have practical implications.

Thank you gents for the remarks. I obviously failed to supply flawless test proving my arguments the first time.
Hope you'll find this one solid and entertaining.

End of update

 

I think we all like to write nice code; we very often employ ideas such as composition, separation of concerns and so on. Above all we quite correctly tend to introduce various levels of abstraction by using interfaces and abstract classes. This approach is perfectly valid for outstanding number of scenarios, but may lead to performance deficiency when we're dealing with low latency requirements.

Every time we introduce another level of abstraction, at the same time we add more complexity to the procedure by which JVM resolves target method to be called.
Methods in Java are not invoked directly due to the polymorphic nature of the language. For instance:

final Set set = getImpl(); // returns different implementations  
set.add(new Object());

will have different implications than:

final HashSet set = getImpl(); // returns different implementations 
set.add(new Object());

The former would get translated into:

    ...
    NEW java/lang/Object
    DUP
    INVOKESPECIAL java/lang/Object.<init> ()V
    INVOKEINTERFACE java/util/Set.add (Ljava/lang/Object;)Z
    ...

While the latter would look more like:

    ...
    NEW java/lang/Object
    DUP
    INVOKESPECIAL java/lang/Object.<init> ()V
    INVOKEVIRTUAL java/util/HashSet.add (Ljava/lang/Object;)Z
    ...

The difference in JVM's behaviour for these two cases may not be so obvious even after thorough reading of the instructions specification. The procedure for looking up particular method in virtual method table is almost identical, but the open character of INVOKEINTERFACE semantics renders some optimizations impossible.

Let's consider the below examples. In the first case:

class Parent {
    public void method1() { }
    public void method2() { }
}

class Child extends Parent {
    public void method2() { } // overriden method
    public void method3() { }
}

This setup will result in virtual method table that goes something like:

 Parent
 1Parent.method1
 2Parent.method2
Child
 1 Parent.method1
 2Child.method2
 3Child.method3

Here we can observe how the virtual method table for Child class preserves the order of methods for its parent's class method table. It only overrides the reference/link to the overriden methods thus enabling monomorphic calls and optimizations that employ hard-linking to target methods and completely eliminate the need for method table lookup with each call. This in turn provides means for inlining methods, which can lead to a nice performance boost.

Now, let's look at the case for interfaces:

interface SomeInterface {
    void someMethod();
}

class Parent {
    public void method1() { }
    public void method2() { }
}

class Child extends Parent implements SomeInterface {
    public void method2() { } // overridden method from Parent
    public void someMethod() { }
}

class OtherClass implements SomeInterface {
    public void method3() { }
    public void someMethod() { }
}

Virtual method table for the above would resemble this:

 Parent
 1Parent.method1
 2Parent.method2
 Child
 1Parent.method1
 2Child.method2
 3SomeInterface.someMethod
 OtherClass
 1OtherClass.method3
 2SomeInterface.someMethod

As we can see, there is no order preserved in terms of index of the interface method someMethod(). For Child class, this will be method #3, but for OtherClass entry for this method would be found under #2.

The consequence of introducing these megamorphic calls is the additional cost of virtual method table lookup with each method call. To illustrate the impact of megamorphic calls we can run a simple microbenchmark:

Edit: the below code is obviously flawed (see the comments). Use the one from the top of the article.

import java.math.BigDecimal;

public class InvokeInterfaceTest
{
    private static final long ITERATIONS = 1000 * 1000 * 1000;
    private static final int RUNS = 10;
    private static long implTime;
    private static long ifaceTime;

    public static void main(String[] args)
    {
        performTest(2); //warm up
        ifaceTime = implTime = 0;
        performTest(RUNS);
        final BigDecimal implAvg = new BigDecimal(implTime).divide(new BigDecimal(RUNS));
        final BigDecimal ifaceAvg = new BigDecimal(ifaceTime).divide(new BigDecimal(RUNS));
        System.out.println("Invokevirtual: " + implAvg + " ms");
        System.out.println("Invokeinterface: " + ifaceAvg + " ms");
    }

    private static void performTest(final long runs)
    {
        final FooImpl impl = new FooImpl();
        final Foo iface = new FooImpl();

        for (int run = 0; run < runs; run++)
        {
            System.gc();
            long start = System.currentTimeMillis();
            for (int i = 0; i < ITERATIONS; i++)
            {
                impl.doSomething(i);
            }
            implTime += System.currentTimeMillis() - start;

            System.gc();
            start = System.currentTimeMillis();
            for (int i = 0; i < ITERATIONS; i++)
            {
                iface.doSomething(i);
            }
            ifaceTime += System.currentTimeMillis() - start;
        }
    }

    private static interface Foo
    {
        void doSomething(int i);
    }

    private static class FooImpl implements Foo
    {
        private int i;

        public void doSomething(final int i)
        {
            this.i = i; //avoid optimization
        }
    }
}

Which on my laptop with Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz using Java 1.6.0_26-b03 for 64-bit Linux gives the below results:

Invokevirtual: 735.4 ms
Invokeinterface: 1412.4 ms

In order to minimize the influence of other processeses running in my system, I pinned the test thread to a single core rendering it unavailable to other processes and achiving more stable results (as described in my previous post about java threads). 

So, If you have to deal with huge number of invocations and latency is crucial factor, then you would probably have to consider eliminating megamorphic calls at the cost of less abstract, dodgy design.

The Performance Zone is brought to you in partnership with New Relic. Read more about providing a framework that gets you started on the right path to move your IT services to cloud computing, and give you an understanding as to why certain applications should not move to the cloud.

Topics:

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}