Over a million developers have joined DZone.

Java Memory Model From a Programmer's Point-of-View

A memory model describes, given a program and an execution trace of that program, whether the execution trace is a legal execution of the program.

· Java Zone

Check out this 8-step guide to see how you can increase your productivity by skipping slow application redeploys and by implementing application profiling, as you code! Brought to you in partnership with ZeroTurnaround.

1 INTRODUCTION

Programmers at large often feel difficult to learn Java Memory Model (JMM for short) from either Java Language Specification Third Edition (JLSv3 for short)[11] or Java Virtual Machine Specification Second Edition (JVMSv2 for short)[12] due to several reasons.

One is he/she lacks basic understanding of out-of-order execution in modern CPUs and compilers.
Next is there are so many academic terms in JLSv3 and JVMSv2 documentations. Informal explanations along with detailed examples are often needed before going further.

Lastly a programmer really doesn’t need to know all of JMM in order to make thread safe programs.
This article tackles JMM from these three angles.

Here is the quote for JMM definition from JLSv3 “A memory model describes, given a program and an execution trace of that program, whether the execution trace is a legal execution of the program. The Java programming language memory model works by examining each read in an execution trace and checking that the write observed by that read is valid according to certain rules.

Don’t worry if you don’t get it for now. Hopefully you will understand it by the end of this article.

2  OUT-OF-ORDER IN CPUS AND COMPILERS

2.1  OUT-OF-ORDER EXECUTION IN CPUS

For our demo purpose we break down a CPU instruction cycle without loss of generality as follows:

  1. Fetch the instruction
  2. Decode the instruction which loads any required data from the main memory, among other things.
  3. Execute the instruction
  4. Write-Back the result which stores any generated result to the main memory, among other things.

(Each CPU can have different cycles based on different instruction sets. For example the classic RISC cycle also has a Memory Access stage right after Execute dedicated for main memory loads and stores, and its Write-Back is for writing results into registers only[1]. Our simplified four stages have similar functions at a high level and are easier for reasoning)

The CPU can serially perform all stages for one instruction at a time; or parallelize all stages for multiple independent instructions which improve instruction overall throughput.
In the later case the CPU as a whole operates in an assembly line fashion with instructions coming in one side and results out the other. Hence we call it instruction pipeline which can be best illustrated by Figure 1 extracted from Wikipedia[1].

The horizontal line from left to right in Figure 1 represents the time dimension while the vertical line from top to bottom represents “chained” instruction stages. The Figure 1 shows an idea case where an independent instruction enters the pipeline and goes through all the stages as the clock ticks. Overall there are always four concurrent instructions being performed.

Because CPU issues instructions at the processing rate of the slowest stage instead of all the stages together as in non-pipelined case, the CPU frequency can be increased as much as slow stages can be optimized.
Stage 2 and 4 involve load and store main memory operations, respectively, and they are the slowest because modern CPUs are much faster than modern memory systems (can be several orders of magnitude).

In order to reduce the latency in load and store operations, CPUs use registers, caches[4] (their latency is equal or very close to the clock cycle and map to the “working memory” mentioned in section 8.1 of JVMSv2) and buffers / queues (they facilitate asynchronous execution and it is why section 8.1 of JVMSv2 says “the transfer of data between the main memory and a thread's working memory is loosely coupled”).
Image title

                                                 Figure 1


Specifically the decode module can have a dispatch queue where fetched instructions remain until their requested data were loaded from main memory to cache or their dependent instructions were finished (see the following Data Dependency). While some instructions are waiting (or stalled), ready instructions are simultaneously decoded and pushed down the pipeline.

The Write-Back module will put the store request into a store buffer if the old data is not in cache yet (The cache controller stores and loads data on a cache line basis. Each cache line is usually larger than an individual memory access. This mechanism is similar to DISK IO block access) and begins to process the next independent instruction. After the old data was put into cache or if it is already in cache, the instruction will override the cache with the new result. The new data will eventually be flushed to main memory asynchronously depending on different policies[4] (such as when the data has to be evicted from cache for a new cache line or along with other data in a batch mode).

When a programmer writes code in List 1, he/she may assume line 7 will be executed / completed before line 8. This assumption is right based on the ideal case in Figure 1. However it can be wrong if you take the CPU registers, caches and buffers into consideration.
For example, if field B is already in cache and A is not, B may be stored into main memory earlier than A. Even though both A and B are in cache, B can still be stored into main memory earlier than A if B’s cache was evicted first. You can argue similarly for loads (for example, A was loaded before B from main memory) or a combination of stores and loads (for example A was loaded before B was stored).

Simply put, the way statements are ordered in the original code like in method writer() is called program orders; the order individual memory references(loads or stores) are completed is called execution order.
Because CPU cache, buffer and speculative execution (see below) add so many asynchronies to the instruction completed time, the execution order is not necessarily the same as its program order, which is how out-of-order execution / reordering happens in CPUs.

class ReadWriteExample {                                             int A = 0;                                                       boolean B = false;                                                                                                                //CPU1 (thread1) runs this method                                void writer()                                                      A = 10;   //stores 10 to memory location A                    B = true; //stores true to memory location B                }                                                                                                                                 //CPU2 (thread2) runs this method                                void reader() {                                                    while (!B) continue; //loads data from memory location B      // I do care about A and B store order in method writer()        assert A == 10; //loads data from memory location A         }                                                            }                                                                                             List 1                              

If your program is single threaded or field A and B in method writer() are only accessed by one thread, you really don’t care about reordering because the two stores in method writer() are independent and program semantics are still maintained even the two stored were reordered.
However if your program is multi-threaded, you do care about execution order sometimes.
For example, CPU1 executes method writer() while CPU2 executes method reader().
(because threads communicate using shared main memory and cache memory is transparent to accesses thanks to the CPU cache coherent protocol, when we say “loads data from memory”, it either means “from main memory if the data has never been loaded by any CPU or “from another CPU’ cache” if that CPU owns the data or “from its own cache” if it owns the data. Please read reference [6] for details)

The above assert in line 15 will fail if CPU1 executed method writer() out-of-order. Even CPU1 executed method writer() in program order, line 15 can still fail because CPU2 can execute line 15 before line 13. You may say no because 15 logically shouldn’t be executed until 13 is done (this is called control dependency). However CPU2 is free to speculatively execute 15 first[3].
Basically when CPU sees a branch such as an if or while statement, it doesn’t know where to fetch the next instruction until the branch instruction finishes.  However Figure 1 shows the CPU performance will be decreased if it waits for the branch instruction without being able to find enough independent instructions. So CPU1 may speculatively execute line 15 based on its prediction. It will commit the execution when it later can approve its prediction path is right. In the reader() case, it means B == true was found in line 13 by CPU1 after line 15, which is possible.

Because CPUs don’t know you care about the order of A and B (see below for cases when they do care about code ordering semantics), you must tell them the order using so called memory barrier or fence (Java has no such a construct. You must use a high level synchronizer to implicitly call it) in order to enforce your ordering semantics. Here is the revised version:

class ReadWriteExample {                                                 int A = 0;                                                           boolean B = false;                                                                                                                        //CPU1 (thread1) runs this method                                    void writer () {                                                       A = 10; //stores 10 to memory location A                          membar; //pseudo memory barrier to make sure                              //line 7 is executed before the next one                B = true; //stores true to memory location B                    }                                                                                                                                         //CPU2 (thread2) runs this method                                    void reader () {                                                       while (!B) continue; //loads from memory location B               // I do care about the A and B store order in method writer()        membar; //pseudo memory barrier to make sure                              //line 15 is executed before the next one                assert A == 10; //loads from memory location A                  }                                                                }                                                                                                  List 2                                 

The membar in writer() only enforces the store order on CPU1; you still need to enforce the load order in reader(), which generalizes as “membar must be used in pair”.

Readers are strongly recommended to read reference [6] for memory barrier details.  Reference [6] shows both the popular X86 (Both 32 and 64 bit) and SPARC TSO actually enforce all memory reference orders except the STORE-LOAD reorder (usually LOAD is not semantically dependent on the previous STORE). So the membar in List 2 actually are no-op.
It also shows all CPUs except ALPHA also honor the data dependency. In List 3 field B’s assignment in line 2 will wait in the dispatch queue until line 1 has executed (completed):

A = 10;                                                    B = A * 2; //B’s value dependents on its previous A’s value.                            List 3                               

Lastly none of CPUs will reorder a given operation with a store if both are referencing the same memory locations (so they have data dependency) otherwise your program semantics will be violated.
(A curious reader should ask what the membar and data dependency mean to the above pipeline figure. It means the next dependent instruction has to wait for the result of its previous instruction, which puts stalls or bubbles into the pipeline. So the number of concurrent instruction in pipeline will be less than the idea four. Instruction scheduling by CPUs and compilers (see Section 2.2) can eliminate or reduce stalls)

Based on above analyses, we conclude that single-threaded programs run under the illusion of as-if-serial semantics. Effects of reordering are only observable to multi-threaded programs (or reordering in one thread is only observable / matters to other threads). When CPU doesn’t intrinsically know your intended ordering semantics, you must use a pair of membars (implicitly called by high level Java synchronizers).

2.2  OUT-OF-ORDER EXECUTION IN COMPILERS

The compiler here is not the bytecode compiler; instead we mean JIT compiler. A compiler is free to reorder your code either physically or logically based on its optimizations as long as it doesn’t violate your program semantics. Modern compilers have many powerful code transformations. Here are few examples.

2.2.1  Instruction Scheduling

Simply put it avoids or reduces pipeline stalls by clustering independent instructions together.

  A = 10;                                                           B = A + 10; //It has to waits for the previous statement to finish  C = 20;     //it has no dependency on its previous 2 statements                                 List 4                                

Suppose the compiler finds through complex analyses that A is not in cache yet and C is. So line 1 will trigger multi-cycle-long data loading while line 3 can be done in a single cycle. The compiler cans physically move line 3 in between line 1 and 2 to reduce stalls by one. If the compiler can find more independent instructions, it can do the same reordering by reducing more stalls[8].

2.2.2  Avoid redundancy / Forward substitution

Reuse results that are already computed or loaded and store them for use later instead of re-computing or re-loading them.

//suppose A and B are some objects; r variables are all local//thread1 runs this method                                   void foo() {                                                   r1 = A;                                                  r2 = r1.x;                                               r3 = B;                                                  r4 = r3.x;                                               r5 = r1.x;                                               r6 = r1.y;                                             }                                                                                                                         //thread2 runs this method                                   void bar() {                                                   r7   = A;                                                r7.x = 3;                                                r7.y = 5;                                              }                                                                                        List 5                           

The compiler is allowed to reuse the value read by r2 in line 5 for r5 in line 8 (probably through a register) because they are both reads of r1.x without intervening write. So line 8 can be optimized as r5 = r2;
Suppose initially A.x == 0 and A.y == 0 and line 15 and 16 in bar() happens between line 5 and 8 in method foo(). So r5 has value 0 and r6 has value 5, which shows the two assignments in line 15 and 16 appear to be reordered in foo().

Both above transformations maintain your code semantics. However if you do care about the ordering in either case, you must use Java synchronizers (details in Section 3) to tell the compiler to stop reordering which is similar to the CPU membar usage.

3  JAVA SYNCHRONIZER CONSTRUCTS

Section 2 shows both CPUs and compilers add so many uncertainties to the code execution. So without using membar or Java synchronizer in List 1, you can imagine there are quite many execution traces (how the statements in writer() and reader() interleave each other). It usually doesn’t make sense to reason about the assert in reader() in this case, even its execution trace is still legal to JMM.
Although multi-threading is so non-deterministic, you can still use some Java synchronizer (it implicitly calls membar) to establish your intended ordering, and predicate the assert in reader() will succeed.

3.1  SYNCHRONIZED STATEMENT (LOCK AND UNLOCK ACTIONS)

There is a lock action on the synchronized object’s monitor upon entry to the synchronized method or block; and correspondingly an unlock action upon exit to the synchronized method or block. We will use lock and unlock instead of synchronized statement for explanations in the future.
The unlock action comprises a membar action (it is no-op if CAS implicitly calls membar on some processors) followed by a CAS (atomically Compare And Swap a flag to indicate the monitor lock was just released.). It also has an unpark function to unblock a thread waiting for monitor.
The lock action comprises a CAS (atomically Compare And Swap a flag to indicate the monitor lock was acquired) followed by a membar action (it is no-op if CAS implicitly calls membar on some processors). It also has a park function to block itself if the monitor is not available.

In List 2 the programmer first stores a new value 10 to A and then set the ready flag B to true in writer() so that reader() can read the new value. Line 8 and 10 approximate the unlock action while line 15 and 17 approximate the lock action (actually line 15 approximates the spinning lock Sun’s Hotspot VM will use for contended synchronizations[13]. Otherwise the lock and unlock actions also have to call OS routines to park and un-park threads). Here is the new list using Java pseudo synchronization.

class ReadWriteExample {                                              int A = 0;                                                                                                                          //thread1 runs this method                                        void writer () {                                                    lock monitor1;   //a new value will be stored                  A = 10;          //stores 10 to memory location A              unlock monitor1; //a new value is ready for reader to read    }                                                                                                                                   //thread2 runs this method                                        void reader () {                                                    lock monitor1;  //a new value will be read                     assert A == 10; //loads from memory location A                 unlock monitor1;//a new value was just read                 }                                                              }                                                                                               List 6                              

Suppose writer() gets the monitor first . Because line 8 synchronizes with line 13, we say line 13 happen-before or executes-before or completes-before line 13 and an edge was established from line 8 to 13.
We also know that line 7 happens-before line 8 (it is actually the membar in line 8 that ensures line 7 happens-before the CAS in line 8), and line 13 happens-before line 14 (it is actually the membar in line 13 that ensures the CAS in line 13 happens-before line 14). From the ordering transitivity, we conclude that line 7 happens-before line 14. This execution trace is legal to JMM.
What if reader() gets the monitor first? In order to enforce your program semantics, you need to call wait() between line 13 and 14 and notify() in between line 7 and 8. This will be left to the reader as an exercise (Hint: wait() associates with both lock and unlock actions).

We use happens-before relationship for argument. Generally if one action happens-before another, then the first is visible to and ordered before the second.
We can also use flush and invalidate for argument, which turns out to be easier for understanding. Particularly the unlock action in List 6 “conceptually” flushes all variables from its CPU cache to main memory; and the following lock action “conceptually” invalidates all variables from its CPU cache so that following variable uses will load from main memory. You know this didn’t necessarily happen physically based on reference [6].

In List 6 you can produce other field values besides A before line 8 so that reader() can consume them after line 13. These field value assignments are not ordered themselves but they are with line 8. Similarly these field value uses are not order either but they are with line 13. This is usually what you want. But if you want to enforce ordering among field variables, you have to use volatile variables.

3.2  VOLATILE FIELDS

Accesses (reads or writes) to volatile variables couldn’t be reordered with each other, nor can they be reordered with normal field accesses around them. Here is the new list using volatile variables for List 2:

class ReadWriteExample {                                                int A = 0;  //normal field                                          volatile boolean B = false; //a data ready flag                                                                                         //thread1 runs this method                                          void writer () {                                                      A = 10;   //stores 10 to memory location A                       B = true; //stores true to memory location B                   }                                                                                                                                       //thread2 runs this method                                          void reader () {                                                      //we use a while statement to spin CPU for demo purpose;            // using wait and notify may be more efficient.                     while (!B) continue; //loads from memory location B              assert A == 10; //loads from memory location A                 }                                                               }                                                                                               List 7                                  

Line 8 synchronizes with line 15 (actually the subsequent read of B in line 15). Based on the above rules, line 8 happens-before line 15 and an edge was established from line 8 to 15. We also know that line 7 happens-before line 8 and line 15 happens-before line 16. From the ordering transitivity, we conclude that line 7 happens-before line 16. This execution trace is also legal to JMM.
You’ve probably already guessed that the volatile write in line 8 and read in line 15 both implicitly call membar directly or indirectly. You are probably right (some JVM may implement volatile using a CPU atomic instruction which may embed a membar).
Volatile write has the same memory effect as monitor unlock and volatile read has the same memory effect as monitor lock. Please read reference [9] for more details.
Again you can argue using flush/invalidate mechanism.
Finally volatile variables obviously can’t be allocated in CPU registers, which should sound familiar to C/C++ developers.

3.3  FINAL FIELDS

An important guarantee on final field is quoted from Section 17.5 of JLSv3 “An object is considered to be completely initialized when its constructor finishes. A thread that can only see a reference to an object after that object has been completely initialized is guaranteed to see the correctly initialized values for that object's final fields.”
Here is the example from the same section 17.5 with additional comments:

class FinalFieldExample {                                       final int x;                                                  int y;                                                        static FinalFieldExample f;                                                                                                 public FinalFieldExample() {                                    x = 3;                                                   y = 4;                                                   // f = this; //don’t do this                             }                                                                                                                           //one thread executes this method                             static void writer() {                                          f = new FinalFieldExample();                          }                                                                                                                           //another thread executes this method                         static void reader() {                                          if (f != null) {                                           int i = f.x;   //guaranteed to see 3                       int j = f.y;   //could see 0                             }                                                           }                                                           }                                                                                         List 8                            


Because the compiler will insert a membar between line 7 and 14, line 7 happens-before 14 and the reader() will be guaranteed to see the properly initialized value for f.x if  line 14 also happens-before line 19(19 happens-before 20 due to data dependency). This execution trace is also legal to JMM.
If you uncomment line 9, the above guaranteed will not be held because there is no ordering guaranteed between line 7 and 9.

An important application thanks to the above guarantee is quoted from Section 17.5 of JLSv3 “Final fields allow programmers to implement thread-safe immutable objects without synchronization. A thread-safe immutable object is seen as immutable by all threads, even if a data race is used to pass references to the immutable object between threads. This can provide safety guarantees against misuse of an immutable class by incorrect or malicious code. Final fields must be used correctly to provide a guarantee of immutability.”

If we remove field variable y from List 8, object FinalFieldExample becomes immutable. However without the guarantee on final fields, line 20 can see the default value 0 before seeing value 3, which means object FinalFieldExample is mutable to the reader() thread due to data races.
On the internet, you can find many examples regarding immutable object’s safety guarantees for multi-threaded programs.

3.4  OTHERS

Here are other synchronizer constructs that are easier to understand and accordingly we will not elaborate:

  • Each action in a thread happens-before every action in that thread that comes later in the program's order.
  • A call to start on a thread happens-before any action in the started thread.
  • The write of the default value (zero, false or null) to each variable happens-before the first action in every thread. Conceptually every object is created at the start of the program with its default initialized values.
  • All actions in a thread T1 happens-before any action in another thread T2 that detects that T1 has terminated. T2 may accomplish this by calling T1.isAlive() or T1.join().
  • If thread T1 interrupts thread T2, the interrupt by T1 happens-before any point where any other thread (including T2) determines that T2 has been interrupted (by having an InterruptedException thrown or by invoking Thread.interrupted or Thread.isInterrupted).
  • The completion of an object’s constructor happens-before the execution of its finalizer method.
  • More synchronizers and happens-before relationships are introduced in java.util.concurrent package. Please read the package’s javadoc description for more details.


Finally we say happens-before is a partial order over all actions in a program’s execution trace because CPUs or compilers can freely execute action A before action B or vise versa or execute both concurrently (they have no ordering relationship) if they are not constrained by any combination of the above happens-before rules. This freedom is the key to good performance based on our analyses in Section 2.
We can think it another way: a thread is free to run in isolation without caring about any other threads if it didn’t see any Java synchronizer; otherwise it is obligated to show its impact to other threads through Java synchronizers.

4  SAFE MULTI-THREADING

Although JMM is quite complex for implementation and the underlying reordering is difficult to understand, a programmer only needs to ensure his/her program is correctly synchronized for safe multi-threading. He/she doesn’t need to worry that reordering will affect the code.
This is a strong guarantee for programmers from Java’s sound memory model.

4.1  DATA RACE AND CORRECTLY SYNCHRONIZED

Before we explain how to synchronize our programs correctly, we first borrow two concepts from JLSv3: conflicting and data race.
Two accesses to (reads of or writes to) the same variable are said to be conflicting if at least one of the accesses is a write.
When a program contains two conflicting accesses that are not ordered by a happens-before relationship, it is said to contain a data race.

A program is correctly synchronized if and only if all its sequential execution traces (program orders) are free of data races (the authors disagree with “sequentially consistent executions” used by JLSv3 because those terms already rule out data races based on JLSv3’s own definition on sequentially consistent).
Incorrectly synchronized program often exhibits counterintuitive behaviors as shown in many examples in JLS and JVMS. It usually doesn’t make sense to reason about incorrectly synchronized program.

In List 1, there are both a reader and writer for field A and B, so the reader and writer are conflicting on both A and B.  Because there is no Java synchronizer to establish any happens-before relationship, the program’s any execution trace has a data race on both A and B. Accordingly the program is incorrectly synchronized.
Its new versions in List 6 has two sequential execution traces: one is the writer() followed by the reader(); the other is the reader()followed by the writer(). Because of the synchronized statement (lock and unlock actions), neither execution trace has data race on field A. Accordingly the program is correctly synchronized.
List 7 is also correctly synchronized. Its analysis is left to the reader.
Another very good example was discussed in section 17.4.5 of JLSv3. Readers are strongly recommended to read this example.

4.2  SYNCHRONIZE ON BOTH READ AND WRITE ACCESSES

Some developers naively think they only need to synchronize on write because the synchronization will flush the new value to main memory so that the later read will find it.
Going back to List 6, a developer may want to remove the synchronization in line 13 and 15 for method reader(). But line 7 will not guarantee to happen-before 14 based on our analysis in Section 3.1.
You can even use the flush and invalidate for reasoning: Although the synchronization on the writer() thread flushes the new value to main memory, the reader() thread may still read its old cached value without synchronization.

Based on the definitions in Section 4.1, there is a data race and the program is not correctly isochronized if you only synchronize on write access.

4.3  ATOMIC OPERATIONS

The JMM guarantees that accesses to all Java primitive types and references (not the referenced objects) except double and long are atomic. Although a single write to a non-volatile long or double value is not atomic, writes and reads of volatile long and double values are always atomic.

Sometimes the atomic operations on primitive types lead programmers to wrong assumptions because they think the purpose of using Java synchronizer is to create atomic operation to prevent shared data structure from being corrupted. But from our analyses, you know they missed its another important application – ordering guarantee.
Going back to List 7, some programmer wants to remove the volatile keyword because they think both field A and B are of primitive types and all assesses to them are atomic.  However without using volatile you already know line 7 can’t be guaranteed to happen-before line 16.  More specifically the compiler is free to read B just once into a CPU register, and reuse the cached value repeatedly in line 15. This would mean that line 15 never terminates, even another thread set B to true in method writer().

4.4  DOUBLE-CHECKED LOCKING

Many programmers have seen the following double-checked locking idiom:

class SingletonDemo {                                                private int f; //instance field                                 private static SingletonDemo instance = null;                                                                               private SingletonDemo () {                               this.f = 10;                                                     }                                                                                                                                     public static SingletonDemo getInstance() {                          if (instance == null) { //1st check for lazy initialization          synchronized (SingletonDemo.class) {                                 if (instance == null) {//2nd check for lazy initialization           instance = new SingletonDemo ();                            }                                                                }                                                                }                                                                  return instance;                                                 }                                                                }                                                                                                 List 9                              

Line 13 triggers two writes: one is the initialization of SingletonDemo object (the private constructor at line 5 is called); another is the reference write to the class variable in line 3. Because the two writes can be reordered either by CPU or compiler, some thread will get a partially constructed SingletonDemo object.
There are at least three solutions. The first is use the volatile keyword in line 3, which will ensure the initialization of SingletonDemo object happens-before the reference write to class variable in line 13.
The second is use the final keyword in line 2.  This is left to the reader for further analysis.
The third is rely on JVM’s synchronized initialization for static fields as follows. Readers are recommended to read section 12.4.2 of JLSv3 for JVM’s detailed thread-safe initialization procedure.

class SingletonDemo {                                               private int f; //instance field                                   //JVM guarantees line 4 is thread safe and happens-before line 10    private static SingletonDemo instance = new SingletonDemo();  private SingletonDemo () {                                          this.f = 10;                                                    }                                                                                                                                   public static SingletonDemo getInstance() {                   return instance;                                                }                                                               }                                                                                              List 10                              

4.5  CAUSILITY

This is a hard one. I pull out the same example with more comments as in section 17.4.8 of JLSv3:

//initially both field x and y are zeros                          //Thread 1                                                        r1 = x;                                                      if (r1 != 0) y = 1;                                                                                                             //Thread 2                                                        r2 = y;                                                       if (r2 != 0) x = 1;                                                                      List 11                              

The code in List 11 is correctly synchronized because neither of writers in line 4 and 8 will occur (so no data races) if this code is executed in a sequential program order.
Since the code is correctly synchronized, the only behaviors we can allow are sequentially consistent behaviors. However, under the happens-before or release consistency memory model[5], both compiler and CPU are allowed to speculatively execute the write in line 4 or 8 first before line 3 and 7, respectively as long as it can later approve its if statement is true.
Suppose thread 2 speculatively execute x = 1 first. Here is the execution trace:

x = 1;  //speculatively executed by thread 2                    r1 = x; //thread 1 sees write of x = 1 by thread 2              y = 1;  //executed by thread 1; caused by 6                     r2 = y; //thread 2 sees write of y = 1 by thread 1 and approves 5                      List 12

Because line 3 can cause line 4 to execute (if x’s value is 1) and line 7 can cause line 8 to execute (if y’s value is 1) in List 11, the above execute trace form a circular causality. It is illegal to JMM.

We refer to the issue of when reads (line 2 in List 12) can see future writes (line 1 in List 12) as causality, which sometimes causes unacceptable out-of-thin-air behavior like the above one.
Here is the good news. Because the code is correctly synchronized, JVM will ensure its sequential consistent behavior without a programmer to do anything. Please read section 17.4.8 of JLSv3 if you really want to know how JVM does so.

5  LIGHT-WEIGHT TASK FRAMEWORKS BEYOND JAVA HEAVY THREAD

A reader may have noticed two contradictory directions so far. On one hand we allow CPUs and compilers to reorder code as much as possible in order to have higher instruction level parallelism; on the other hand a programmer must use Java synchronizers to avoid reordering by CPU or compiler for conflicting data accesses, which lowers instruction level parallelism. The more Java synchronizers the more program performance degrades.
Also as more and more threads are created to handle requested tasks, the cost of thread’s lifecycle management and context switches (because Java threads map one-to-one to OS native threads, they invoke a system call for each action) can be greater than the computation time of the tasks themselves.  

So some light-weight task framework built upon Java heavy thread is often preferred. The Java executor framework is the first step forward. It allows users to submit fine-grained tasks. It basically fixed the thread’s lifecycle overhead.
The upcoming release of Java SE 7 will include a parallel programming model called Fork-Join[15]. It was built on the previous executor framework. Its lightweight task scheduling mechanism based on the work-stealing from Cilk greatly reduces Java synchronizer’s contention, and facilitates more fine-grained task submission through recursive divide-and-conquer.

All multi-threading frameworks we’ve talked about so far are based on shared-memory architectures such as SMP, which is generally error prone to data race conditions and even dead locks. Ensuring data race free (correctly synchronized) code by programmers at large is often difficult. On the hardware side, shared-memory architectures only scale linearly to a small number of processors (usually 32) due to their cache coherent protocol overhead.
The Actor model can overcome the above shortcomings through its native fit of distributed computing. Please refer to [16] for introduction.

6  RESOURCES

The following five references give you an introduction on out-of-order execution on modern CPUs:
[1] Instruction Pipeline by Wikipedia
[2] Out-of-Order Execution by Wikipedia
[3] Speculative Execution by Wikipedia
[4] CPU Cache by Wikipedia
[5] Release Consistency by Wikipedia
This reference tells you the current memory model development in Java and other mainstream languages:
[6] Memory Barriers: a Hardware View for Software Hackers by Paul E. McKenney
This reference tells you how mainstream processors handle out-of-order memory references:
[7] Memory Models: A Case for Rethinking Parallel Languages and Hardware by Sarita V. Adve et al.
This reference tells you an important compiler reorder mechanism:
[8] Instruction scheduling By Wikipedia
The following four references tell you everything about Java memory model (JMM) JSR-133:
[9] The JSR-133 Cookbook by Doug Lea
[10] The Java Memory Model by William Pugh
[11] Chapter 17 “Thread and Locks” in the Java Language Specification, Third Edition
[12] Chapter 8 “Threads and Locks” in the Java Virtual Machine Specification, Second Edition
This reference introduces you some of Sun’s current JIT compiler optimizations:
[13] The Java HotSpot Performance Engine Architecture
The following two references introduce you Java thread executor and fork-join frameworks:
[14] Concurrency JSR-166 Interest Site
[15] A Java Fork/Join Framework by Doug Lea
This reference introduces you distributed computing by actor model:
[16] Actor Model by Wikipedia

ABOUT THE AUTOR

Yongjun Jiao has been a professional software developer for the past 10 years. His expertise covers Java SE, Java EE, Oracle, application tuning and high performance and low latency computing. He is a technical manager with SunGard Consulting Services.
Lili Zhou has been conducting academic researches using C/C++ and Matlab under cluster environments for over 5 years. Her expertise is medical image processing. She holds a PhD degree from Deptment of Electrical and Computer Engineering, Stony Brook University.

The Java Zone is brought to you in partnership with ZeroTurnaround. Check out this 8-step guide to see how you can increase your productivity by skipping slow application redeploys and by implementing application profiling, as you code!

Topics:
java,java memory mode,java memory model

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}