Abstraction slows you down
Join the DZone community and get the full member experience.
Join For FreeUpdate
As there have been several comments about my microbenchmark, I feel in a position to give a common reply.
Thank you for pointing out flaws in my test implementation. This time I used snippet from Maciej Gorączka to illustrate the pitfalls of optimization and at the same time to show real impact of interface calls. Of course it had to be modified as it also suffered from the optimization pitfalls.
There are several things in the area of method call optimizations one have to keep in mind when measuring performance. Some of them have have been described in performance techniques chapter of HotSpot internals.
With my original benchmark I did not take into account the below.
Virtual (and interface) invocations with a lopsided type profile are compiled with an optimistic check in favor of the historically common type (or two types).
So there's a lot of truth in saying microbenchmarks can be reeealy tricky.
As an example of how forgeting about optimizer may mess up your tests, the below code:
import java.math.BigDecimal; import java.util.Random; public class OptimizationTest { private static final long ITERATIONS = 1000 * 1000 * 1000; private static final int RUNS = 1; private static final SomeInterface[] IFACES = new SomeInterface[] {new Impl1(), new Impl2()}; private static final Impl1[] IMPLS = new Impl1[] {new Impl1(), new Impl1()}; private static final Random RANDOM = new Random(); private static final int CLASS_COUNT = 2; public static void main(String[] args) { performTestIface(2); //warm up final long implTime = performTestImpl(RUNS); final BigDecimal runs = new BigDecimal(RUNS); final BigDecimal implAvg = new BigDecimal(implTime).divide(runs); System.out.println("Invokevirtual: " + implAvg + " ms"); final long ifaceTime = performTestIface(RUNS); final BigDecimal ifaceAvg = new BigDecimal(ifaceTime).divide(runs); System.out.println("Invokeinterface: " + ifaceAvg + " ms"); } private static long performTestIface(final long runs) { final SomeInterface iface = IFACES[RANDOM.nextInt(CLASS_COUNT)]; long ifaceTime = 0; for (int run = 0; run < runs; run++) { System.gc(); long start = System.currentTimeMillis(); for (int i = 0; i < ITERATIONS; i++) { iface.doSomething(i); } ifaceTime += System.currentTimeMillis() - start; } return ifaceTime; } private static long performTestImpl(final long runs) { final Impl1 impl = IMPLS[RANDOM.nextInt(CLASS_COUNT)]; long implTime = 0; for (int run = 0; run < runs; run++) { System.gc(); long start = System.currentTimeMillis(); for (int i = 0; i < ITERATIONS; i++) { impl.doSomething(i); } implTime += System.currentTimeMillis() - start; } return implTime; } static interface SomeInterface { void doSomething(int i); } static class Impl1 implements SomeInterface { private int i; public void doSomething(final int i) { this.i = i; } } static class Impl2 implements SomeInterface { private int i; public void doSomething(final int i) { this.i = i; } } }
Please make sure to run it 10+ times. I hope you'll have fun with the results.
Now, the explanation.
When run with -XX:+PrintCompilation flag, the test either outputs:
142 9 test.InvokeInterfaceTest2$Impl1::doSomething (6 bytes)
142 1% test.InvokeInterfaceTest2::performTestIface @ 36 (77 bytes)
1701 2% test.InvokeInterfaceTest2::performTestImpl @ 36 (75 bytes)
Invokevirtual: 745 ms
2447 10 test.InvokeInterfaceTest2::performTestIface (77 bytes)
Invokeinterface: 722 ms
or:
65 3 test.InvokeInterfaceTest2$Impl2::doSomething (6 bytes)
66 1% test.InvokeInterfaceTest2::performTestIface @ 36 (77 bytes)
1523 4 test.InvokeInterfaceTest2$Impl1::doSomething (6 bytes)
1523 2% test.InvokeInterfaceTest2::performTestImpl @ 36 (75 bytes)
Invokevirtual: 732 ms
2255 5 test.InvokeInterfaceTest2::performTestIface (77 bytes)
2262 1% made not entrant test.InvokeInterfaceTest2::performTestIface @ -2 (77 bytes)
2268 3% test.InvokeInterfaceTest2::performTestIface @ 36 (77 bytes)
Invokeinterface: 1816 ms
Note that HotSpot complains about method made not entrant, which means that previously compiled method is not valid anymore. This happens because we are using 2 different implementations of the same interface like in the faster example, but this time class implementing the interface changed, so target method had to be deoptimized.
It becomes even more clear when we look at how the optimization have been implemented by the compiler. In the fast run we get native code for performTestIface:
126 B16: # N1 <- B2 Freq: 1.01326e-06
126 movl RSI, #-28 # int
12b movq RBP, [rsp + #0] # spill
12f movl [rsp + #4], RAX # spill
133 call,static wrapper for: uncommon_trap(reason='range_check' action='make_not_entrant')
# OptimizationTest::performTestIface @ bci:10 L[0]=RBP L[1]=_ L[2]=_ L[3]=_ L[4]=_ L[5]=_ L[6]=_ L[7]=_ L[8]=_ STK[0]=rsp + #8 STK[1]=rsp + #4
# OopMap{[8]=NarrowOop off=312}
138 int3 # ShouldNotReachHere
The code gets recompiled after some number of subsequent monomorphic calls reach PerMethodTrapLimit or PerBytecodeRecompilationCutoff. So for the slower run the code gets recompiled to:
14f B16: # N250 <- B9 Freq: 0.25578
14f movl RSI, #-58 # int
154 movq [rsp + #16], RBX # spill
159 movq [rsp + #32], R14 # spill
15e movq [rsp + #40], R13 # spill
163 call,static wrapper for: uncommon_trap(reason='bimorphic' action='maybe_recompile')
# OptimizationTest::performTestIface @ bci:49 L[0]=rsp + #0 L[1]=_ L[2]=rsp + #40 L[3]=rsp + #16 L[4]=_ L[5]=rsp + #24 L[6]=rsp + #32 L[7]=_ L[8]=RBP STK[0]=rsp + #40 STK[1]=RBP
# OopMap{[40]=Oop off=360}
168 int3 # ShouldNotReachHere
168
16d B17: # B3 B18 <- B2 Freq: 0.16983
16d decode_heap_oop_not_null RSI,R10
171 movq rdi, [RSI + (sizeof(oopDesc) + Klass::secondary_supers_offset_in_bytes())]
movl rcx, [rdi + arrayOopDesc::length_offset_in_bytes()] # length to scan
addq rdi, arrayOopDex::base_offset_in_bytes(T_OBJECT) # Skip to start of data; set NZ in case count is zero
repne scasq # Scan *rdi++ for a match with rax while cx-- != 0
jne,s miss # Missed: flags nz
movq [RSI + (sizeof(oopDesc) + Klass::secondary_super_cache_offset_in_bytes())], RAX # Hit: update cache
miss:
2a7 je B3 P=0.999999 C=-1.000000B18: # N250 <- B17 Freq: 1.6983e-07
2ad movl RSI, #-83 # int
2b2 movq [rsp + #8], R13 # spill
2b7 movq [rsp + #16], RBX # spill
2bc movq [rsp + #32], R14 # spill
2c1 nop # 2 bytes pad for loops and calls
2c3 call,static wrapper for: uncommon_trap(reason='unreached' action='reinterpret')
# OptimizationTest::performTestIface @ bci:36 L[0]=rsp + #0 L[1]=_ L[2]=rsp + #8 L[3]=rsp + #16 L[4]=_ L[5]=rsp + #24 L[6]=rsp + #32 L[7]=_ L[8]=RBP
# OopMap{[8]=Oop off=712}
2c8 int3 # ShouldNotReachHere
Usually in high performance applications interfaces reflecting patterns like observer or listener are called in a loop with implementations changing frequently. That is why I believe this problem may have practical implications.
Thank you gents for the remarks. I obviously failed to supply flawless test proving my arguments the first time.
Hope you'll find this one solid and entertaining.
End of update
I think we all like to write nice code; we very often employ ideas such as composition, separation of concerns and so on. Above all we quite correctly tend to introduce various levels of abstraction by using interfaces and abstract classes. This approach is perfectly valid for outstanding number of scenarios, but may lead to performance deficiency when we're dealing with low latency requirements.
Every time we introduce another level of abstraction, at the same time we add more complexity to the procedure by which JVM resolves target method to be called.
Methods in Java are not invoked directly due to the polymorphic nature of the language. For instance:
final Set set = getImpl(); // returns different implementations set.add(new Object());
will have different implications than:
final HashSet set = getImpl(); // returns different implementations set.add(new Object());
The former would get translated into:
... NEW java/lang/Object DUP INVOKESPECIAL java/lang/Object.<init> ()V INVOKEINTERFACE java/util/Set.add (Ljava/lang/Object;)Z ...
While the latter would look more like:
... NEW java/lang/Object DUP INVOKESPECIAL java/lang/Object.<init> ()V INVOKEVIRTUAL java/util/HashSet.add (Ljava/lang/Object;)Z ...
The difference in JVM's behaviour for these two cases may not be so obvious even after thorough reading of the instructions specification. The procedure for looking up particular method in virtual method table is almost identical, but the open character of INVOKEINTERFACE semantics renders some optimizations impossible.
Let's consider the below examples. In the first case:
class Parent { public void method1() { } public void method2() { } } class Child extends Parent { public void method2() { } // overriden method public void method3() { } }
This setup will result in virtual method table that goes something like:
Parent | |
1 | Parent.method1 |
2 | Parent.method2 |
Child | |
1 | Parent.method1 |
2 | Child.method2 |
3 | Child.method3 |
Here we can observe how the virtual method table for Child class preserves the order of methods for its parent's class method table. It only overrides the reference/link to the overriden methods thus enabling monomorphic calls and optimizations that employ hard-linking to target methods and completely eliminate the need for method table lookup with each call. This in turn provides means for inlining methods, which can lead to a nice performance boost.
Now, let's look at the case for interfaces:
interface SomeInterface { void someMethod(); } class Parent { public void method1() { } public void method2() { } } class Child extends Parent implements SomeInterface { public void method2() { } // overridden method from Parent public void someMethod() { } } class OtherClass implements SomeInterface { public void method3() { } public void someMethod() { } }
Virtual method table for the above would resemble this:
Parent | |
1 | Parent.method1 |
2 | Parent.method2 |
Child | |
1 | Parent.method1 |
2 | Child.method2 |
3 | SomeInterface.someMethod |
OtherClass | |
1 | OtherClass.method3 |
2 | SomeInterface.someMethod |
As we can see, there is no order preserved in terms of index of the interface method someMethod(). For Child class, this will be method #3, but for OtherClass entry for this method would be found under #2.
The consequence of introducing these megamorphic calls is the additional cost of virtual method table lookup with each method call. To illustrate the impact of megamorphic calls we can run a simple microbenchmark:
Edit: the below code is obviously flawed (see the comments). Use the one from the top of the article.
import java.math.BigDecimal; public class InvokeInterfaceTest { private static final long ITERATIONS = 1000 * 1000 * 1000; private static final int RUNS = 10; private static long implTime; private static long ifaceTime; public static void main(String[] args) { performTest(2); //warm up ifaceTime = implTime = 0; performTest(RUNS); final BigDecimal implAvg = new BigDecimal(implTime).divide(new BigDecimal(RUNS)); final BigDecimal ifaceAvg = new BigDecimal(ifaceTime).divide(new BigDecimal(RUNS)); System.out.println("Invokevirtual: " + implAvg + " ms"); System.out.println("Invokeinterface: " + ifaceAvg + " ms"); } private static void performTest(final long runs) { final FooImpl impl = new FooImpl(); final Foo iface = new FooImpl(); for (int run = 0; run < runs; run++) { System.gc(); long start = System.currentTimeMillis(); for (int i = 0; i < ITERATIONS; i++) { impl.doSomething(i); } implTime += System.currentTimeMillis() - start; System.gc(); start = System.currentTimeMillis(); for (int i = 0; i < ITERATIONS; i++) { iface.doSomething(i); } ifaceTime += System.currentTimeMillis() - start; } } private static interface Foo { void doSomething(int i); } private static class FooImpl implements Foo { private int i; public void doSomething(final int i) { this.i = i; //avoid optimization } } }
Which on my laptop with Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz using Java 1.6.0_26-b03 for 64-bit Linux gives the below results:
Invokevirtual: 735.4 ms
Invokeinterface: 1412.4 ms
In order to minimize the influence of other processeses running in my system, I pinned the test thread to a single core rendering it unavailable to other processes and achiving more stable results (as described in my previous post about java threads).
So, If you have to deal with huge number of invocations and latency is crucial factor, then you would probably have to consider eliminating megamorphic calls at the cost of less abstract, dodgy design.
Opinions expressed by DZone contributors are their own.
Comments