Pages

Sunday, January 2, 2011

Java Bytecode Fundamentals

Java bytecode is a bread and butter of JRebel, the productivity tool for Java developers. This is why I decided that it would be a good thing to write a blog post on this subject. This blog post is a summary of various sources that I've found while googling on the subject. Hopefully someone may find it relevant and useful. What is a little weird is that there's not much information on the subject, either books or articles. BTW, if you have anything to add or comment - don't hesitate to post to comments :)

The developers who use Java for application development, usually do not need to be aware of the bytecode that is being executed in the VM. However, those developers who implement the state-of-the-art frameworks, compilers, or even Java tooling - may need to understand and may even be using bytecode directly. While special libraries (like ASM, cglib, Javassist) do help regarding bytecode manipulation, it is still important to understand the fundamentals in order to make the effective use of those tools.

Let's start off with a simple example - a POJO, with one field, a getter and a setter.

public class Foo {
    private String bar;

    public String getBar(){ 
      return bar; 
    }

    public void setBar(String bar) {
      this.bar = bar;
    }
  }

First of all, one you compile the class using javac Foo.java, you'll have the Foo.class, which now contains the bytecode. Here's how it looks in the hex editor:


Each pair of hex numbers (a byte) is actually translatable to opcodes (mnemonics), but obviously it would be too brutal to start reading it in the binary format. Let's proceed to the mnemonical representation.

Executing javap -c Foo will print out the bytecode:
public class Foo extends java.lang.Object {
public Foo();
  Code:
   0:   aload_0
   1:   invokespecial   #1; //Method java/lang/Object."<init>":()V
   4:   return

public java.lang.String getBar();
  Code:
   0:   aload_0
   1:   getfield        #2; //Field bar:Ljava/lang/String;
   4:   areturn

public void setBar(java.lang.String);
  Code:
   0:   aload_0
   1:   aload_1
   2:   putfield        #2; //Field bar:Ljava/lang/String;
   5:   return

}
The class is very simple and it is easy to see the relation between the sourced code and the generated bytecode. First of all we notice that in the bytecode version of the class the compiler inferred the default constructor (as promised by the JVM spec).



Secondly, if we study the Java bytecode instructions (in our example aload_0 and aload_1), we can see that some of the instructions have prefixes like aload_0 or istore_2. This is related to the type of the data that the instruction operates with. The prefix 'a' means that the opcode is manipulating an object reference. The prefix 'i' means the opcode is manipulating an integer.

One interesting thing we could spot here is that some of the instructions take a weird operand like #1 or #2, which actually refer to the constant pool of the class. This is now a good point to get more information from the class file. Execute the following command: javap -c -s -verbose (-s to print the signatures, -verbose to print all the details)

Compiled from "Foo.java"
public class Foo extends java.lang.Object
  SourceFile: "Foo.java"
  minor version: 0
  major version: 50
  Constant pool:
const #1 = Method       #4.#17; //  java/lang/Object."<init>":()V
const #2 = Field        #3.#18; //  Foo.bar:Ljava/lang/String;
const #3 = class        #19;    //  Foo
const #4 = class        #20;    //  java/lang/Object
const #5 = Asciz        bar;
const #6 = Asciz        Ljava/lang/String;;
const #7 = Asciz        <init>
const #8 = Asciz        ()V;
const #9 = Asciz        Code;
const #10 = Asciz       LineNumberTable;
const #11 = Asciz       getBar;
const #12 = Asciz       ()Ljava/lang/String;;
const #13 = Asciz       setBar;
const #14 = Asciz       (Ljava/lang/String;)V;
const #15 = Asciz       SourceFile;
const #16 = Asciz       Foo.java;
const #17 = NameAndType #7:#8;//  "<init>":()V
const #18 = NameAndType #5:#6;//  bar:Ljava/lang/String;
const #19 = Asciz       Foo;
const #20 = Asciz       java/lang/Object;

{
public Foo();
  Signature: ()V
  Code:
   Stack=1, Locals=1, Args_size=1
   0:   aload_0
   1:   invokespecial   #1; //Method java/lang/Object."<init>":()V
   4:   return
  LineNumberTable:
   line 1: 0

public java.lang.String getBar();
  Signature: ()Ljava/lang/String;
  Code:
   Stack=1, Locals=1, Args_size=1
   0:   aload_0
   1:   getfield        #2; //Field bar:Ljava/lang/String;
   4:   areturn
  LineNumberTable:
   line 5: 0

public void setBar(java.lang.String);
  Signature: (Ljava/lang/String;)V
  Code:
   Stack=2, Locals=2, Args_size=2
   0:   aload_0
   1:   aload_1
   2:   putfield        #2; //Field bar:Ljava/lang/String;
   5:   return
  LineNumberTable:
   line 8: 0
   line 9: 5
}

Now we can see what are those weird operands are. For instance, #2:

const #2 = Field #3.#18; // Foo.bar:Ljava/lang/String;

Which refers to:

const #3 = class #19; // Foo
const #18 = NameAndType #5:#6;// bar:Ljava/lang/String;

etc.

One more thing to notice is that every opcode is marked with a number (0: aload_0). This is related to the position of instruction within the frame - explained later.

To understand how the bytecode works it is worth to have a look at the execution model. JVM uses stack-based model of computation. Each thread has a JVM stack which stores frames. For instance, when running the application in a debugger, you will see those frames:

IntelliJ IDEA debugging session

Each time a method is invoked a new stack frame is created. The frame consists of an operand stack, an array of local variables, and a reference to the runtime constant pool of the class of the current method.

Conceptual stack frame structure. (adopted from developerWorks)

The size of the array of local variables is determined at compile time and is dependent on the number and size of local variables and formal method parameters. The operand stack is a LIFO stack used to push and pop values. Its size is also determined at compile time. Certain opcode instructions push values onto the operand stack; others take operands from the stack, manipulate them, and push the result. The operand stack is also used to receive return values from methods.

public String getBar(){ 
    return bar; 
  }

  public java.lang.String getBar();
    Code:
      0:   aload_0
      1:   getfield        #2; //Field bar:Ljava/lang/String;
      4:   areturn

The bytecode for the method above consists of three opcode instructions. The first opcode, aload_0, pushes the value from index 0 of the local variable table onto the operand stack. The this reference is always stored at location 0 of the local variable table for constructors and instance methods. The next opcode instruction, getfield, is used to fetch a field from an object. The last instruction, areturn, returns a reference from a method.

Each method has a corresponding bytecode array. Looking at a .class file with a hex editor, you would see the following values in the bytecode array:


That said, the bytecode for getBar method is 2A B4 00 02 B0. The code 2A corresponds to aload_0 instruction, and B0 corresponds to areturn. It might seem strange that the bytecode of the method has 3 instructions, but the byte array holds 5 elements. This is because the getfield (B4) requires 2 parameters to be supplied (00 02), and those parameters occupy positions 2 and 3 in the array, hence the array size is 5 and areturn instruction is shifted to the position 4.


The Local Variables Table

To demonstrate how the local variables are handled let's have a look at another example:

public class Example {
   public int plus(int a){
     int b = 1;
     return a + b;
   }
 }

We can spot 2 local variables in this method - one as the method's parameter, and the other - local variable int b. Here's what the bytecode looks like:

public int plus(int);
  Code:
   Stack=2, Locals=3, Args_size=2
   0:   iconst_1
   1:   istore_2
   2:   iload_1
   3:   iload_2
   4:   iadd
   5:   ireturn
  LineNumberTable:
   line 5: 0
   line 6: 2

  LocalVariableTable:
   Start  Length  Slot  Name     Signature
   0      6       0     this     LExample;
   0      6       1     a        I
   2      4       2     b        I

First, the method loads constant 1 with iconst_1 and stores it in a local variable number 2 with istore_2. We can see in the local variables table that slot number 2 is occupied by the variable name b, as expected. Next, iload_1 loads value of a to the stack, iload_2 loads value of b. iadd pops 2 operands from the stack, adds 'em, and stores the value back to return the value from the method.


Exception Handling

Another interesting example of the bytecode is what code is generated for exception handling, i.e. for try-catch-finally constructs.

public class ExceptionExample {

  public void foo(){
    try {
      tryMethod();
    }
    catch (Exception e) {
      catchMethod();
    }finally{
      finallyMethod();
    }
  }

  private void tryMethod() throws Exception{}

  private void catchMethod() {}

  private void finallyMethod(){}

}

Here's what is generated for foo() method:

public void foo();
  Code:
   0:   aload_0
   1:   invokespecial   #2; //Method tryMethod:()V
   4:   aload_0
   5:   invokespecial   #3; //Method finallyMethod:()V
   8:   goto    30
   11:  astore_1
   12:  aload_0
   13:  invokespecial   #5; //Method catchMethod:()V
   16:  aload_0
   17:  invokespecial   #3; //Method finallyMethod:()V
   20:  goto    30
   23:  astore_2
   24:  aload_0
   25:  invokespecial   #3; //Method finallyMethod:()V
   28:  aload_2
   29:  athrow
   30:  return
  Exception table:
   from   to  target type
     0     4    11   Class java/lang/Exception
     0     4    23   any
    11    16    23   any
    23    24    23   any

So in fact, the compiler generated the code for all the scenarios possible within the try-catch-finally execution: the call for finallyMethod() was inferred 3 times (!). The try block is compiled just as it would be if the try were not present and merged with finally:
0:   aload_0
   1:   invokespecial   #2; //Method tryMethod:()V
   4:   aload_0
   5:   invokespecial   #3; //Method finallyMethod:()V
If the block executes successfully, the the goto instruction will lead the execution to the position 30 which is the return opcode.

If tryMethod throws an instance of Exception, the first (innermost) applicable exception handler in the exception table is chosen to handle the exception. From the exception table we can see that the position to proceed with the exception handling is 11:

0 4 11 Class java/lang/Exception

This will lead to the execution of catchMethod() and finallyMethod():

 
   11:  astore_1
   12:  aload_0
   13:  invokespecial   #5; //Method catchMethod:()V
   16:  aload_0
   17:  invokespecial   #3; //Method finallyMethod:()V

If any other exception is thrown in the execution, from the exception table we can see that the position to proceed for any type of the exception is 23:
 
    0     4    23   any
    11    16    23   any
    23    24    23   any

Reading the instructions starting with position 23:
 
   23:  astore_2
   24:  aload_0
   25:  invokespecial   #3; //Method finallyMethod:()V
   28:  aload_2
   29:  athrow
   30:  return

So finallyMethod() is executed in any case, with aload_2 and athrow to rise the unhandled exception.

Bottomline

I have covered just a few aspects of the things related to JVM bytecode. Most of the inspiration came from the developerWorks article by Peter Haggar - Java bytecode: Understanding bytecode makes you a better programmer. It must be the best article on the subject that I've managed to find. The article looks a bit outdated already, thought still relevant. Surprisingly the BCEL user's manual page has a decent description of bytecode fundamentals. So I'd suggest to take a look if you're interested. Also, the The JavaTM Virtual Machine Specification might also be useful source of information, but it is a little hard to read and it doesn't visualize the things described there, which is often useful.
Overall, I think that understanding how bytecode works would be essential to the developers who are looking to deepen their proficiency in Java programming, especially if one looks into the framework internals, tooling or compilers for JVM languages.


References

Wikipedia: Java bytecode
Wikipedia: Java bytecode instruction listings
Java bytecode: Understanding bytecode makes you a better programmer
JVM Spec: Chapter7. Compiling for the Java Virtual Machine
BCEL User Manual

23 comments:

Rama krishnan said...

Its very grate article...

danforce said...

10 points for article !!!!

kanu@adatapost said...

I can't remember exact time but somehow 8 year ago I read a book - Internet Java Applications written by Art Gittleman. In this book, he explained "JVM" very well.

Here is link - http://www.cecs.csulb.edu/~artg/internet/toc.txt

Rémi Forax said...

Sections 3 and 7 of the ASM user guide are also a great resource.
http://download.forge.objectweb.org/asm/asm-guide.pdf

Rémi

ant said...

Thank you for such a clear explanation.

arhipov said...

@Rémi, yeah! the manuals for the bytcode manipulation libraries are almost the only comprehensive source for this subject. But other than that, quite a few information can be found.
Probably some scientific papers in ACM or IEEE?

samaaron said...

Interestingly I don't seem to be able to compile Foo.java verbatim:

> javac Foo.java
Foo.java:10: missing return statement
}
^
1 error

I'm using javac 1.6.0_22. Adding a return null statement fixed the issue, but should this be necessary?

Anton Arhipov said...

@samaaron that's because the setter method should have void return type. my mistake, sorry for that. thx for spotting this out, corrected now

samaaron said...

@anton

Haha, I should have spotted that myself. Clearly, my java-fu is extremely weak :-)

samaaron said...

Something else I'm unable to fathom. I compiled Foo.java and ran javap -c -s -verbose Foo. However, the mnemonic version I see differs from yours. Your constructor has the following signature:

Signature: (Ljava/lang/String;)V

Mine seems blank. Here's the whole of it:

public Foo();
Signature: ()V
Code:
Stack=1, Locals=1, Args_size=1
0: aload_0
1: invokespecial #1; //Method java/lang/Object."":()V
4: return
LineNumberTable:
line 1: 0

Do you have any idea why this might be the case?

Anton Arhipov said...

@samaaron again, this is what happens when you try to post something late night and experiment too much :)
thx for re-checking this - I have copy-pasted another example indeed. (with other constructor). Corrected it now.

fromdev.com said...

Thanks a lot for compiling this info.
I didn't know of few tools/commands you used here.

Sandeep said...

Yes and people keep on asking where is the CAFEBABE. For those interested please see the hexcode in the post above. The first 2 bytes are reserved for this magic number. This helps the interpreter to distinguish class files from other binary file.

Anton Arhipov said...

@Sandeep it is 4 bytes, isn't it? :)

genocyber said...

Антон, можно ли сделать перевод вашего поста на http://habrahabr.ru ?

Anton Arhipov said...

@genocyber конечно можно! и даже нужно! :)

Pertti said...

How about "Inside the JAVA 2 Virtual Machine" by Bill Venners ? It's like the Java VM Spec in cleartext.

Anton Arhipov said...

@Pertti absolutely!

JED said...

I have two books about the JVM that have a lot of information about the bytecodes.

"The Java Virtual Machine Specification" by Lindholm and Yellin published by Sun and the second edition came out in 1999

"Java Virtual Machine: by Meyer and Downing published by O'Reilly

The second one is older, a997, but comes with a diskette (!!?) that has an "assembler" named Jasmin that lets you write directly in bytecodes

bill shelton said...

Thanks for the lucid thoughts. Very helpful.

-bill

Konstantin said...

Man! I just like it! Good article!

Oblivious! said...

Very cool article. Greatly appreciated! :)
Thanks!

GaneshBhuddhan said...

It's really a nice article.But initially,I wondered how did you get that local variable table for Example class displayed
in the byte codes.Later I found out that we should include the option -g while compiling the program, otherwise this information will not be included in the class file.

Disqus for Code Impossible