UPDATE This blog post, along with many improvements is a part of my article for RebelLabs' report on Java Bytecode that can be found here: http://zeroturnaround.com/rebellabs/rebel-labs-report-mastering-java-bytecode-at-the-core-of-the-jvm/
The developers who use Java for application development, usually do not need to be aware of the bytecode that is being executed in the VM. However, those developers who implement the state-of-the-art frameworks, compilers, or even Java tooling - may need to understand and may even be using bytecode directly. While special libraries (like ASM, cglib, Javassist) do help regarding bytecode manipulation, it is still important to understand the fundamentals in order to make the effective use of those tools.
Let's start off with a simple example - a POJO, with one field, a getter and a setter.
public class Foo { private String bar; public String getBar(){ return bar; } public void setBar(String bar) { this.bar = bar; } }
First of all, one you compile the class using
javac Foo.java
, you'll have the Foo.class
, which now contains the bytecode. Here's how it looks in the hex editor:Each pair of hex numbers (a byte) is actually translatable to opcodes (mnemonics), but obviously it would be too brutal to start reading it in the binary format. Let's proceed to the mnemonical representation.
Executing
javap -c Foo
will print out the bytecode:public class Foo extends java.lang.Object { public Foo(); Code: 0: aload_0 1: invokespecial #1; //Method java/lang/Object."<init>":()V 4: return public java.lang.String getBar(); Code: 0: aload_0 1: getfield #2; //Field bar:Ljava/lang/String; 4: areturn public void setBar(java.lang.String); Code: 0: aload_0 1: aload_1 2: putfield #2; //Field bar:Ljava/lang/String; 5: return }The class is very simple and it is easy to see the relation between the sourced code and the generated bytecode. First of all we notice that in the bytecode version of the class the compiler inferred the default constructor (as promised by the JVM spec).
Secondly, if we study the Java bytecode instructions (in our example
aload_0
and aload_1
), we can see that some of the instructions have prefixes like aload_0
or istore_2
. This is related to the type of the data that the instruction operates with. The prefix 'a' means that the opcode is manipulating an object reference. The prefix 'i' means the opcode is manipulating an integer. One interesting thing we could spot here is that some of the instructions take a weird operand like
#1
or #2
, which actually refer to the constant pool of the class. This is now a good point to get more information from the class file. Execute the following command: javap -c -s -verbose
(-s to print the signatures, -verbose to print all the details) Compiled from "Foo.java" public class Foo extends java.lang.Object SourceFile: "Foo.java" minor version: 0 major version: 50 Constant pool: const #1 = Method #4.#17; // java/lang/Object."<init>":()V const #2 = Field #3.#18; // Foo.bar:Ljava/lang/String; const #3 = class #19; // Foo const #4 = class #20; // java/lang/Object const #5 = Asciz bar; const #6 = Asciz Ljava/lang/String;; const #7 = Asciz <init> const #8 = Asciz ()V; const #9 = Asciz Code; const #10 = Asciz LineNumberTable; const #11 = Asciz getBar; const #12 = Asciz ()Ljava/lang/String;; const #13 = Asciz setBar; const #14 = Asciz (Ljava/lang/String;)V; const #15 = Asciz SourceFile; const #16 = Asciz Foo.java; const #17 = NameAndType #7:#8;// "<init>":()V const #18 = NameAndType #5:#6;// bar:Ljava/lang/String; const #19 = Asciz Foo; const #20 = Asciz java/lang/Object; { public Foo(); Signature: ()V Code: Stack=1, Locals=1, Args_size=1 0: aload_0 1: invokespecial #1; //Method java/lang/Object."<init>":()V 4: return LineNumberTable: line 1: 0 public java.lang.String getBar(); Signature: ()Ljava/lang/String; Code: Stack=1, Locals=1, Args_size=1 0: aload_0 1: getfield #2; //Field bar:Ljava/lang/String; 4: areturn LineNumberTable: line 5: 0 public void setBar(java.lang.String); Signature: (Ljava/lang/String;)V Code: Stack=2, Locals=2, Args_size=2 0: aload_0 1: aload_1 2: putfield #2; //Field bar:Ljava/lang/String; 5: return LineNumberTable: line 8: 0 line 9: 5 }
Now we can see what are those weird operands are. For instance,
#2
:const #2 = Field #3.#18; // Foo.bar:Ljava/lang/String;
Which refers to:
const #3 = class #19; // Foo
const #18 = NameAndType #5:#6;// bar:Ljava/lang/String;
etc.
One more thing to notice is that every opcode is marked with a number (
0: aload_0
). This is related to the position of instruction within the frame - explained later.To understand how the bytecode works it is worth to have a look at the execution model. JVM uses stack-based model of computation. Each thread has a JVM stack which stores frames. For instance, when running the application in a debugger, you will see those frames:
IntelliJ IDEA debugging session
Each time a method is invoked a new stack frame is created. The frame consists of an operand stack, an array of local variables, and a reference to the runtime constant pool of the class of the current method.
Conceptual stack frame structure. (adopted from developerWorks)
The size of the array of local variables is determined at compile time and is dependent on the number and size of local variables and formal method parameters. The operand stack is a LIFO stack used to push and pop values. Its size is also determined at compile time. Certain opcode instructions push values onto the operand stack; others take operands from the stack, manipulate them, and push the result. The operand stack is also used to receive return values from methods.
public String getBar(){ return bar; } public java.lang.String getBar(); Code: 0: aload_0 1: getfield #2; //Field bar:Ljava/lang/String; 4: areturn
The bytecode for the method above consists of three opcode instructions. The first opcode,
aload_0
, pushes the value from index 0 of the local variable table onto the operand stack. The this reference is always stored at location 0 of the local variable table for constructors and instance methods. The next opcode instruction, getfield
, is used to fetch a field from an object. The last instruction, areturn
, returns a reference from a method.Each method has a corresponding bytecode array. Looking at a .class file with a hex editor, you would see the following values in the bytecode array:
That said, the bytecode for
getBar
method is 2A B4 00 02 B0
. The code 2A
corresponds to aload_0
instruction, and B0
corresponds to areturn
. It might seem strange that the bytecode of the method has 3 instructions, but the byte array holds 5 elements. This is because the getfield
(B4
) requires 2 parameters to be supplied (00 02
), and those parameters occupy positions 2 and 3 in the array, hence the array size is 5 and areturn
instruction is shifted to the position 4. The Local Variables Table
To demonstrate how the local variables are handled let's have a look at another example:
public class Example { public int plus(int a){ int b = 1; return a + b; } }
We can spot 2 local variables in this method - one as the method's parameter, and the other - local variable
int b
. Here's what the bytecode looks like:public int plus(int); Code: Stack=2, Locals=3, Args_size=2 0: iconst_1 1: istore_2 2: iload_1 3: iload_2 4: iadd 5: ireturn LineNumberTable: line 5: 0 line 6: 2 LocalVariableTable: Start Length Slot Name Signature 0 6 0 this LExample; 0 6 1 a I 2 4 2 b I
First, the method loads constant 1 with
iconst_1
and stores it in a local variable number 2 with istore_2
. We can see in the local variables table that slot number 2 is occupied by the variable name b, as expected. Next, iload_1
loads value of a to the stack, iload_2
loads value of b. iadd
pops 2 operands from the stack, adds 'em, and stores the value back to return the value from the method.Exception Handling
Another interesting example of the bytecode is what code is generated for exception handling, i.e. for
try-catch-finally
constructs.public class ExceptionExample { public void foo(){ try { tryMethod(); } catch (Exception e) { catchMethod(); }finally{ finallyMethod(); } } private void tryMethod() throws Exception{} private void catchMethod() {} private void finallyMethod(){} }
Here's what is generated for foo() method:
public void foo(); Code: 0: aload_0 1: invokespecial #2; //Method tryMethod:()V 4: aload_0 5: invokespecial #3; //Method finallyMethod:()V 8: goto 30 11: astore_1 12: aload_0 13: invokespecial #5; //Method catchMethod:()V 16: aload_0 17: invokespecial #3; //Method finallyMethod:()V 20: goto 30 23: astore_2 24: aload_0 25: invokespecial #3; //Method finallyMethod:()V 28: aload_2 29: athrow 30: return Exception table: from to target type 0 4 11 Class java/lang/Exception 0 4 23 any 11 16 23 any 23 24 23 any
So in fact, the compiler generated the code for all the scenarios possible within the
try-catch-finally
execution: the call for finallyMethod() was inferred 3 times (!). The try block is compiled just as it would be if the try
were not present and merged with finally:0: aload_0 1: invokespecial #2; //Method tryMethod:()V 4: aload_0 5: invokespecial #3; //Method finallyMethod:()VIf the block executes successfully, the the
goto
instruction will lead the execution to the position 30
which is the return
opcode.If
tryMethod
throws an instance of Exception, the first (innermost) applicable exception handler in the exception table is chosen to handle the exception. From the exception table we can see that the position to proceed with the exception handling is 11: 0 4 11 Class java/lang/Exception
This will lead to the execution of
catchMethod()
and finallyMethod()
:11: astore_1 12: aload_0 13: invokespecial #5; //Method catchMethod:()V 16: aload_0 17: invokespecial #3; //Method finallyMethod:()V
If any other exception is thrown in the execution, from the exception table we can see that the position to proceed for any type of the exception is 23:
0 4 23 any 11 16 23 any 23 24 23 any
Reading the instructions starting with position 23:
23: astore_2 24: aload_0 25: invokespecial #3; //Method finallyMethod:()V 28: aload_2 29: athrow 30: return
So
finallyMethod()
is executed in any case, with aload_2
and athrow
to rise the unhandled exception.Bottomline
I have covered just a few aspects of the things related to JVM bytecode. Most of the inspiration came from the developerWorks article by Peter Haggar - Java bytecode: Understanding bytecode makes you a better programmer. It must be the best article on the subject that I've managed to find. The article looks a bit outdated already, thought still relevant. Surprisingly the BCEL user's manual page has a decent description of bytecode fundamentals. So I'd suggest to take a look if you're interested. Also, the The JavaTM Virtual Machine Specification might also be useful source of information, but it is a little hard to read and it doesn't visualize the things described there, which is often useful.
Overall, I think that understanding how bytecode works would be essential to the developers who are looking to deepen their proficiency in Java programming, especially if one looks into the framework internals, tooling or compilers for JVM languages.
References
Wikipedia: Java bytecode
Wikipedia: Java bytecode instruction listings
Java bytecode: Understanding bytecode makes you a better programmer
JVM Spec: Chapter7. Compiling for the Java Virtual Machine
BCEL User Manual
23 comments:
Its very grate article...
10 points for article !!!!
I can't remember exact time but somehow 8 year ago I read a book - Internet Java Applications written by Art Gittleman. In this book, he explained "JVM" very well.
Here is link - http://www.cecs.csulb.edu/~artg/internet/toc.txt
Sections 3 and 7 of the ASM user guide are also a great resource.
http://download.forge.objectweb.org/asm/asm-guide.pdf
Rémi
Thank you for such a clear explanation.
@Rémi, yeah! the manuals for the bytcode manipulation libraries are almost the only comprehensive source for this subject. But other than that, quite a few information can be found.
Probably some scientific papers in ACM or IEEE?
Interestingly I don't seem to be able to compile Foo.java verbatim:
> javac Foo.java
Foo.java:10: missing return statement
}
^
1 error
I'm using javac 1.6.0_22. Adding a return null statement fixed the issue, but should this be necessary?
@samaaron that's because the setter method should have void return type. my mistake, sorry for that. thx for spotting this out, corrected now
@anton
Haha, I should have spotted that myself. Clearly, my java-fu is extremely weak :-)
Something else I'm unable to fathom. I compiled Foo.java and ran javap -c -s -verbose Foo. However, the mnemonic version I see differs from yours. Your constructor has the following signature:
Signature: (Ljava/lang/String;)V
Mine seems blank. Here's the whole of it:
public Foo();
Signature: ()V
Code:
Stack=1, Locals=1, Args_size=1
0: aload_0
1: invokespecial #1; //Method java/lang/Object."":()V
4: return
LineNumberTable:
line 1: 0
Do you have any idea why this might be the case?
@samaaron again, this is what happens when you try to post something late night and experiment too much :)
thx for re-checking this - I have copy-pasted another example indeed. (with other constructor). Corrected it now.
Thanks a lot for compiling this info.
I didn't know of few tools/commands you used here.
Yes and people keep on asking where is the CAFEBABE. For those interested please see the hexcode in the post above. The first 2 bytes are reserved for this magic number. This helps the interpreter to distinguish class files from other binary file.
@Sandeep it is 4 bytes, isn't it? :)
Антон, можно ли сделать перевод вашего поста на http://habrahabr.ru ?
@genocyber конечно можно! и даже нужно! :)
How about "Inside the JAVA 2 Virtual Machine" by Bill Venners ? It's like the Java VM Spec in cleartext.
@Pertti absolutely!
I have two books about the JVM that have a lot of information about the bytecodes.
"The Java Virtual Machine Specification" by Lindholm and Yellin published by Sun and the second edition came out in 1999
"Java Virtual Machine: by Meyer and Downing published by O'Reilly
The second one is older, a997, but comes with a diskette (!!?) that has an "assembler" named Jasmin that lets you write directly in bytecodes
Thanks for the lucid thoughts. Very helpful.
-bill
Man! I just like it! Good article!
Very cool article. Greatly appreciated! :)
Thanks!
It's really a nice article.But initially,I wondered how did you get that local variable table for Example class displayed
in the byte codes.Later I found out that we should include the option -g while compiling the program, otherwise this information will not be included in the class file.
Post a Comment