Performance Improvements via JIT Optimization
Performance Improvements via JIT Optimization
In this article, find a discussion on using JIT optimization techniques to optimize the BVM (Ballerina Virtual Machine).
Join the DZone community and get the full member experience.Join For Free
Java is a programming language which is platform independent. When the Java program is compiled it compiles the Java code into an intermediate representation (idealized assembly) of the original program called bytecode. When the Java program runs, the Java virtual machine converts the bytecode into the machine specific code. This makes Java an interpreted language. However, as we know, interpreted languages perform much slower than compiled languages.
The JIT compiler is the reason why Java has been able to perform much faster than other interpreted languages. JIT compiler monitors the java program and compiles the bytecode of hot methods (i.e. methods that get called frequently) into machine specific code (i.e. platform binary or native code). This compilation of bytecode into native code occurs as the program executes. In another words “just in time” and hence the term JIT. JIT compiler does not compile every single method into machine specific code. It starts running in the interpreter and once a method becomes hot (enough) it gets qualified for compilation (into machine specific code).
Ballerina is a concurrent and strongly typed programming language optimised for integration. It provides a great higher level abstraction while hiding the complexity of the code underneath. Ballerina compiler translates the Ballerina source into platform-independent intermediate representation called Ballerina bytecode. The Ballerina virtual machine (BVM) executes the Ballerina programs. The first version of the BVM is implemented in Java.
While developing Ballerina we have done various optimizations to optimize its performance (i.e. throughput, latency, start-up time, memory footprint). In this blog, we will discuss some Java JIT optimizations we did to improve the performance of BVM. These optimizations were done on an older version of Ballerina (i.e. Ballerina 0.93). The current version of Ballerina incorporates these optimization and numerous other optimizations we did during various stages of the development of the language (we'll discuss these in separate blogs).
The scenario which led us to perform the JIT optimization is described below:
Scenario: one million int arrays are created and populated using while loops. The initial observation was that execution time is higher compared to the execution time when the same logic is implemented using other programming languages (such as Python and Jython). In order to find the possible reasons and to further optimize the performance, we have done a JIT compilation analysis and optimization.
In order to understand the behaviour, we have obtained the JIT compilation log by adding -XX:+LogCompilation flag prior to running the Ballerina.
export JAVA_OPTS = “-XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation”
The JIT compilation log was analyzed using JITWATCH. Immediately we noticed org.ballerinalang.bre.bvm.BLangVM.exec() which should become hot (according to the use-case) is not getting JIT compiled (not compiled into native code) due to its large method size. This method consisted of a switch statement with multiple possible execution paths. Although small methods are the preferred way of implementation (due to various reasons) in this specific case, the method has been implemented this way due to specific reasons.
The following figure shows JITWatch output
From the above figure, we note that bytecode size of org.ballerinalang.bre.bvm.BLangVM.exec() method is 8562. Due to large bytecode size, JIT compiler did not compile the method into native code. In fact, there is JVM flag DontCompileHugeMethods which is set to true by default. If this flag is set to true, JIT compiler does not compile the methods whose bytecode sizes are greater than HugeMethodLimit (=8000).
Rather than modifying DontCompileHugeMethods and other related JIT compilation flags to make the method compile, we break the methods into smaller methods (based on some criteria). The initial version and the first modified version is illustrated below:
The above modification has resulted in 16 times improvement in execution time due to the JIT compilation of org.ballerinalang.bre.bvm.BLangVM.exec()and the other associated methods.
Let’s take a look at the new JIT compilation log using JITWatch.
We note that org.ballerinalang.bre.bvm.BLangVM.exec() method is now compiled and bytecode size of the exec() method has reduced to 3285.
Method Inlining is one of the most important performance optimizations performed by the HotSpot VM.Unfortunately, JVM does not provide clear visibility into how the JVM inlines code. Basic criteria for qualifying for inlining is to satisfy one of the following conditions
- Bytecode size of the method has to be below “MaxInlineSize” (default value 35 bytes)
- Or bytecode size should be less than “MaxFreqInlineSize” (default value 325 bytes) and the number of occurrences should be high.
Let’s now take another look at the new method compilation chain:
The methods that are in red have been compiled but not inlined and the methods that are in green have been inlined. When we look at compilation log we note that the bytecode sizes of methods which do not get inlined are greater than 325 bytes. This is illustrated below:
This means that we may be able to get further performance improvements if we further break the methods into smaller methods.
Let’s now break the execLoadOpcodes method into execLoadOpcodes(….) into execLoadOpcodes1(….), execLoadOpcodes2(….) and see what happens.
The new compilation chain is illustrated below:
We note that the several methods (e.g. execLoadOpcodes1(….) ) have now been inlined resulting in up to 1.5x further improvement in the performance.
According to the compilation chain shown above, store and load methods are now inlined (in green). However, execAStoreOpcodes1() method is not inlined.
The bytecode size of execAStoreOpcodes1(), i.e. 316 bytes is less than FreqInlineSize (325 bytes). Therefore, it qualifies for inlining (according to the criteria which we mentioned earlier in this article). However, it does not get inlined due to a different reason. JITWatch provides the reason as “already compiled into a big method”. The reason is that native code size of the method exceeds the threshold InlineSmallCode. The default value of InlineSmallCode is set at 2000 bytes by default. From the above figure, we notice that native size is 2152 bytes (> 2000 bytes). The compilation chain for execAStoreOpcodes1() shown below:
Further breaking the methods into smaller methods did not result in significant performance improvements.
A Different Implementation
Let’s now consider a different implementation option which based on an interface called Processor and each instruction’s concrete implementation (class) will implement this interface.
Processor objects can then be stored in a hashmap:
Hashmap<Integer,Processor> processorHashmap = new Hashmap<>();
The switch statement can then be replaced with the following:
The compilation chain for the new implementation is shown below:
Note that there are virtual calls instead of inlined code segments (purple colour represents virtual calls). When dispatching the method call JVM checks the number of different implementations of the method and there exists three possibilities: (1) monomorphic — one implementation, (2) bimorphic — two implementations and (3) megamorphic — more than two implementations. Monomorphic and bimorphic method can get inlined, whereas megamorphic do not. In this case, there 200 classes which implement the Processor interface making it megamorphic and therefore it does not get inlined. In addition, the virtual method calls affect performance and as a result the performance of this implementation is poor compared to the previous solution where the methods were inlined.
In this article, we discussed how we use JIT optimization techniques to optimize the BVM (Ballerina Virtual Machine). These optimizations were done in multiple stages and in each phase, we have analyzed the JIT compilation log (using JITWATCH) to see if further optimizations are possible. These optimizations resulted in significant performance improvements.
Opinions expressed by DZone contributors are their own.