Byte Code Engineering
What's really happening inside those .classes? For most developers, this is still a mystery. But it doesn't have to be.
Join the DZone community and get the full member experience.
Join For FreeThis blog entry is the first of a multi-part series of articles discussing the merits of byte code engineering and its application. Byte code engineering encompasses the creation of new byte code in the form of classes and the modification of existing byte code. Byte code engineering has many applications. It is used in tools for compilers, class reloading, memory leak detection, and performance monitoring. Also, most application servers use byte code libraries to generate classes at run-time. Byte code engineering is used more often than you think. As a matter of fact, you can find popular byte code engineering libraries bundled in the JRE including BCEL and ASM. Despite its widespread usage, there appears to be very few university or college courses that teach byte code engineering. It is an aspect of programming that developers must learn on their own and for those who don’t, it remains a mysterious black art. The truth is, byte code engineering libraries make learning this field easy and are a gateway to a deeper understanding of JVM internals. The intent of these articles is to provide a starting point and then document some advanced concepts, which will hopefully inspire readers to develop their own skills.
Documentation
There are a few resources that anyone learning byte code engineering should have handy at all times. The first is the Java Virtual Machine Specification (FYI this page has links to both the language and JVM specifications). Chapter 4, The Class File Format is indispensable. A second resource, which is useful for quick reference is the Wikipedia page entitled Java bytecode instruction listings. In terms of byte code instructions, it is more concise and informative that the JVM specification itself. Another resource to have handy for the beginner is a table of the internal descriptor format for field types. This table is taken directly from the JVM specification.
BaseType Character | Type | Interpretation |
---|---|---|
B | byte | signed byte |
C | char | Unicode character code point in the Basic Multilingual Plane, encoded with UTF-16 |
D | double | double-precision floating-point value |
F | float | single-precision floating-point value |
I | int | integer |
J | long | long integer |
L<ClassName>; | reference | an instance of class <ClassName> |
S | short | signed short |
Z | boolean | true or false |
[ | reference | one array dimension |
Most primitive field types simply use the field type's first initial to represent the type internally (i.e. I for int, F for float, etc), however, a long is J and a byte is Z. Object types are not intuitive. An object type begins with the letter L and ends with a semi-colon. Between these characters is the fully qualified class name, with each name separated by forward slashes. For instance, the internal descriptor for the field type java.lang.Integer isLjava/lang/Integer;. Lastly, array dimensions are indicated by the the '[' character. For each dimension, insert a '[' character. For instance a two-dimensional int array would be
[[I, whereas a two-dimensional java.lang.Integer array would be [[Ljava/lang/Integer;
Methods also have an internal descriptor format. The format is (<parameter types>)<return type>. All types use the above field type descriptor format above. A void return type is represented by the letter V. There is no separator for parameter types. Here are some examples:
- A program entry point method of public static final void main(String args[]) would be ([Ljava/lang/String;)V
- A constructor of the form public Info(int index, java.lang.Object types[], byte bytes[]) would be (I[Ljava/lang/Object;[Z)V
- A method with signature int getCount() would be ()I
Speaking of constructors, I should also mention that all constructors have an internal method name of <init>. Also, all static initializers in source code are placed into a single static initializer method with internal method name <clinit>.
Software
Before we discuss byte code engineering libraries, there is an essential learning tool bundled in the JDK bin directory called javap. Javap is a program which will disassemble byte code and provide a textual representation. Let's examine what it can do with the compiled version of the following code:
package ca.discotek.helloworld;
public class HelloWorld {
static String message =
"Hello World!";
public static void main(String[] args) {
try {
System.out.println(message);
}
catch (Exception e) {
e.printStackTrace();
}
}
}
Here is the output from the javap -help command:
Usage: javap ...
where options include:
-c Disassemble the code
-classpath <pathlist> Specify where to find user class files
-extdirs <dirs> Override location of installed extensions
-help Print this usage message
-J<flag> Pass directly to the runtime system
-l Print line number and local variable tables
-public Show only public classes and members
-protected Show protected/public classes and members
-package Show package/protected/public classes
and members (default)
-private Show all classes and members
-s Print internal type signatures
-bootclasspath <pathlist> Override location of class files loaded
by the bootstrap class loader
-verbose Print stack size, number of locals and args for methods
If verifying, print reasons for failure
Here is the output when we use javap to disassemble the HelloWorld program:
javap.exe -classpath "C:\projects\sandbox2\bin" -c -private -s -verbose ca.discotek.helloworld.HelloWorld
Compiled from "HelloWorld.java"
public class ca.discotek.helloworld.HelloWorld extends java.lang.Object
SourceFile: "HelloWorld.java"
minor version: 0
major version: 50
Constant pool:
const #1 = class #2; // ca/discotek/helloworld/HelloWorld
const #2 = Asciz ca/discotek/helloworld/HelloWorld;
const #3 = class #4; // java/lang/Object
const #4 = Asciz java/lang/Object;
const #5 = Asciz message;
const #6 = Asciz Ljava/lang/String;;
const #7 = Asciz <clinit>;
const #8 = Asciz ()V;
const #9 = Asciz Code;
const #10 = String #11; // Hello World!
const #11 = Asciz Hello World!;
const #12 = Field #1.#13; // ca/discotek/helloworld/HelloWorld.message:Ljava/lang/String;
const #13 = NameAndType #5:#6;// message:Ljava/lang/String;
const #14 = Asciz LineNumberTable;
const #15 = Asciz LocalVariableTable;
const #16 = Asciz <init>;
const #17 = Method #3.#18; // java/lang/Object."<init>":()V
const #18 = NameAndType #16:#8;// "<init>":()V
const #19 = Asciz this;
const #20 = Asciz Lca/discotek/helloworld/HelloWorld;;
const #21 = Asciz main;
const #22 = Asciz ([Ljava/lang/String;)V;
const #23 = Field #24.#26; // java/lang/System.out:Ljava/io/PrintStream;
const #24 = class #25; // java/lang/System
const #25 = Asciz java/lang/System;
const #26 = NameAndType #27:#28;// out:Ljava/io/PrintStream;
const #27 = Asciz out;
const #28 = Asciz Ljava/io/PrintStream;;
const #29 = Method #30.#32; // java/io/PrintStream.println:(Ljava/lang/String;)V
const #30 = class #31; // java/io/PrintStream
const #31 = Asciz java/io/PrintStream;
const #32 = NameAndType #33:#34;// println:(Ljava/lang/String;)V
const #33 = Asciz println;
const #34 = Asciz (Ljava/lang/String;)V;
const #35 = Method #36.#38; // java/lang/Exception.printStackTrace:()V
const #36 = class #37; // java/lang/Exception
const #37 = Asciz java/lang/Exception;
const #38 = NameAndType #39:#8;// printStackTrace:()V
const #39 = Asciz printStackTrace;
const #40 = Asciz args;
const #41 = Asciz [Ljava/lang/String;;
const #42 = Asciz e;
const #43 = Asciz Ljava/lang/Exception;;
const #44 = Asciz StackMapTable;
const #45 = Asciz SourceFile;
const #46 = Asciz HelloWorld.java;
{
static java.lang.String message;
Signature: Ljava/lang/String;
static {};
Signature: ()V
Code:
Stack=1, Locals=0, Args_size=0
0: ldc #10; //String Hello World!
2: putstatic #12; //Field message:Ljava/lang/String;
5: return
LineNumberTable:
line 6: 0
line 5: 2
line 6: 5
public ca.discotek.helloworld.HelloWorld();
Signature: ()V
Code:
Stack=1, Locals=1, Args_size=1
0: aload_0
1: invokespecial #17; //Method java/lang/Object."<init>":()V
4: return
LineNumberTable:
line 3: 0
LocalVariableTable:
Start Length Slot Name Signature
0 5 0 this Lca/discotek/helloworld/HelloWorld;
public static void main(java.lang.String[]);
Signature: ([Ljava/lang/String;)V
Code:
Stack=2, Locals=2, Args_size=1
0: getstatic #23; //Field java/lang/System.out:Ljava/io/PrintStream;
3: getstatic #12; //Field message:Ljava/lang/String;
6: invokevirtual #29; //Method java/io/PrintStream.println:(Ljava/lang/String;)V
9: goto 17
12: astore_1
13: aload_1
14: invokevirtual #35; //Method java/lang/Exception.printStackTrace:()V
17: return
Exception table:
from to target type
0 9 12 Class java/lang/Exception
LineNumberTable:
line 10: 0
line 11: 9
line 12: 12
line 13: 13
line 15: 17
LocalVariableTable:
Start Length Slot Name Signature
0 18 0 args [Ljava/lang/String;
13 4 1 e Ljava/lang/Exception;
StackMapTable: number_of_entries = 2
frame_type = 76 /* same_locals_1_stack_item */
stack = [ class java/lang/Exception ]
frame_type = 4 /* same */
}
You should note that the -l flag to output line number information was purposely omitted. The -verbose flag outputs other relevant information including line numbers. If both are used the line number information will be printed twice.
Here is an overview of the output:
Line Numbers | Description |
---|---|
2 | Command line to invoke javap. See javap -help output above for explanation of parameters. |
3 | Source code file provided by debug information included in byte code. |
4 | Class signature |
5 | Source code file provided by debug information included in byte code. |
6-7 | Major and Minor versions. 50.0 indicates the class was compiled with Java 6. |
8-54 | The class constant pool. |
57-58 | Declaration of the message field. |
60 | Declaration of the static initializer method. |
61 | Internal method descriptor for method. |
63 | Stack=1 indicates 1 slot is required on the operand stack. Locals=0 indicates no local variables are required. Args_size=0 is the number of arguments to the method. |
64-66 | The byte code instructions to assign the String value Hello World! to the message field. |
67-77 | If compiled with debug information, each method will have a LineNumberTable. The format of each entry is <line number of source code>: <starting instruction offset in byte code>. You'll notice that the LineNumberTable has duplicate entries and seamingly out of order (i.e. 6, 5, 6). It may not seem intuitive, but the compiler assembles the byte code instructions will target the stack based JVM, which means it will often have to re-arrange instructions. |
72 | Default constructor signature |
73 | Default constructor internal method descriptor |
75 | Stack=1 indicates 1 slot is required on the operand stack. Locals=1 indicates there is one local variable. Method parameters are treated as local variables. In this case, its the args parameter. Args_size=1 is the number of arguments to the method. |
76-78 | Default constructor code. Simply invokes the default constructor of the super class, java.lang.Object. |
79-80 | Although the default constructor is not explicitly defined, the LineNumberTableindicates that the default constructor is associated with line 3, where the class signature resides. |
82-84 | You might be surprised to see an entry in a LocalVariableTable because the default constructor defines no local variables and has no parameters. However, all non-static methods will define the "this" local variable, which is what is seen here. The start and length values indicate the scope of the local variable within the method. The start value indicates the index in the method's byte code array where the scope begins and the length value indicates the location in the array where the scope ends (i.e. start + length = end). In the constructor, "this" starts at index 0. This corresponds to the a_load0 instruction at line 78. The length is 5, which covers the entire method as the last instruction is at index 4. The slot value indicates the order in which it is defined in the method. The name attribute is the variable name as defined in the source code. The Signature attribute represents the type of variable. You should note that local variable table information is added for debugging purposes. Assigning identifiers to chunks of memory is entirely to help humans understand programs better. This information can be excluded from byte code. |
86 | Main method declaration |
87 | Main method internal descriptor. |
89 | Stack=2 indicates 2 slots are required on the operand stack. Locals=2 indicates two local variables are required (The args and exception e from the catch block). Args_size=1 is the number of arguments to the method (args). |
90-97 | Byte code associated with printing the message and catching any exceptions. |
98-100 | Byte code does not have try/catch constructs, but it does have exception handling, which is implemented in the Exception table. Each row in the table is an exception handling instruction. The from and to values indicate the range of instructions to which the exception handling applies. If the given type of instruction occurs between the from and to instructions (inclusively), execution will skip to the target instruction index. The value 12 represents the start of the catch block. You'll also notice the goto instruction after the invokevirtual instruction, which cause execution to skip to the end of the method if no exception occurs. |
102-107 | Main method's line number table which matches source code with byte code instructions. |
109-112 | Main methods' LocalVariableTable, which defines the scope of the args parameter and the e exception variable. |
114-117 | The JVM uses StackMapTable entries to verify type safety for each code block defined within a method. This information can be ignored for now. It is most likely that your compiler or byte code engineering library will generate this byte code for you. |
Byte Code Engineering Libraries
The most popular byte code engineering libraries are BCEL, SERP, Javassist, and ASM. All of these libraries have their own merits, but overall, ASM is far superior for its speed and versatility. There are plenty of articles and blogs entries discussing these libraries in addition to the documentation on their web sites. Instead of duplicating these efforts, the following will provide links and hopefully other useful information.
BCEL
The most obvious detractor for BCEL (Byte Code Engineering Library) has been its inconsistent support. If you look at the BCEL News and Status page, there have been releases in 2001, 2003, 2006, and 2011. Four releases spread over 10 years is not confidence inspiring. However, it should be noted that there appears to be a version 6 release candidate, which can be downloaded from GitHub, but not Apache. Additionally, the enhancements and bug fixes discussed in the download's RELEASE-NOTES.txt file are substantial, including support for the language features of Java 6, 7, and 8.
BCEL is a natural starting place for the uninitiated byte code developer because it has the prestige of the Apache Software Foundation. Often, it may serve the developer's purpose. One of BCEL's benefits is that it has an API for both the SAX and DOM approaches to parsing byte code. However, when byte code manipulation is more complex, BCEL will likely end in frustration due to its API documentation and community support. It should be noted that BCEL is bundled with a BCELifier utility which parses byte code and will output the BCEL API Java code to produce the parsed byte code. If you choose BCEL as your byte code engineering library, this utility will be invaluable (but note that ASM has an equivalent ASMifier).
SERP
SERP is a lesser known library. My experience with it is limited, but I did find it useful for building a Javadoc-style tool for byte code. SERP was the only API that could give me program counter information so I could hyperlink branching instructions to their targets. Although the SERP release documentation indicates there is support for Java 8's invokedynamic instruction, it is not clear to me that it receives continuous support from the author and there is very little community support. The author also discusses its limitations which include issues with speed, memory consumption, and thread safety.
Javassist
Javassist is the only library that provides some functionality not supported by ASM... and its pretty awesome. Javassist allows you to insert Java source code into existing byte code. You can insert Java code before a method body or append it after the method body. You
can also wrap a method body in a try-block and add your own catch-block (of Java code). You can also subsitute an entire method body or other smaller constructs with your own Java source code. Lastly, you can add methods to a class which contain your own Java source code. This feature is extremely powerful as it allows a Java developer to manipulate byte code without requiring an in-depth understanding of the underlying byte code. However, this feature does have its limitations. For instance, if you introduce variables in an insertBefore() block of code, they cannot be referenced later in an insertAfter() block of code. Additionally, ASM is generally faster than Javassist, but the benefits in Javassist's simplicity may outweigh gains in ASM's performance. Javassists is continually supported by the authors at JBoss and receives much community support.
ASM
ASM has it all. It is well supported, it is fast, and it can do just about anything. ASM has both SAX and DOM style APIs for parsing byte code. ASM also has an ASMifier which can parse byte code and generate the corresponding Java source code, which when run will produce the parsed byte code. This is an invaluable tool. It is expected that the developer has some knowledge of byte code, but ASM can update frame information for you if you add local variables etc. It also has many utility classes for common tasks in its commons package. Further, common byte code transformations are documented in exceptional detail. You can also get help from the ASM mailing list. Lastly, forums like StackOverflow provide additional support. Almost certainly any problem you have has already been discussed in the ASM documentation or in a StackOverflow thread.
Useful Links
- Understanding Byte Code
- BCEL
- SERP
- Javassist
- ASM
Summary
Admittedly, this blog entry has not been particularly instructional. The intention is to give the beginner a place to start. In my experience, the best way to learn is to have a project in mind to which you'll apply what you are learning. Documenting a few basic byte code engineering tasks will only duplicate other's efforts. I developed my byte code skills from an interest in reverse engineering. I would prefer not to document those skills as it would be counter-productive to my other efforts (I built a commerical byte code obfuscator called Modifly, which can perform obfuscation transformations at run-time). However, I am willing to share what I have learned by demonstrating how to apply byte code engineering to class reloading and memory leak detection (and perhaps other areas if there is interest).
Next Blog in the Series Teaser
Even if you don't use JRebel, you probably haven't escaped their ads. JRebel's home page claims "Reload Code Changes Instantly. Skip the build and redeploy process. JRebel reloads changes to Java classes, resources, and over 90 frameworks.". Have you ever wondered how they do it? I'll show you exactly how they do it with working code in my next blog in this series.
If you enjoyed this blog, you may wish to follow discotek.ca on twitter.
Published at DZone with permission of Rob Kenworthy, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments