{{announcement.body}}
{{announcement.title}}

DZone 's Guide to

### This deep dive into latches, CyclicBarriers, and phasers uses a stencil algorithm to demonstrate how all of them work and how to reduce synchronization contention.

· Java Zone ·
Free Resource

Comment (1)

Save
{{ articles.views | formatCount}} Views

I recently came across Java phasers while reviewing different constructs for inter-thread communication. Phasers are associated to the concepts of fuzzy barriers and point-to-point synchronization.

As for this post, I will borrow the example of the one-dimensional stencil to explain very intuitively the motivation behind the concept of phasers.

The code corresponding to all the examples in this post can be found on this repository.

## One-Dimensional Stencil

In order to justify the need for phasers, we will start by demonstrating the use of other synchronization constructs to solve the computational problem of the one-dimensional stencil.

Given a one-dimensional array A with N elements, the value of each element is calculated according to this function: In other words, the value of the first and last elements is fixed (these values do not need to be the same though), whereas the value of each of the remaining N-2 elements is the average of the value of its neighbors. This is a recursive process: As the value of its neighbors change, each element needs to re-compute its own value. Ideally, after a few iterations, the value of all the elements should converge and the recursive iterations stop.

Just a brief digression for those curious about physics: The aforementioned Wikipedia link says that “stencil codes are most commonly found in the codes of computer simulations.” No wonder! You can picture the one-dimensional stencil as a metal bar connected to a heat source by its sides. The temperature of the sides remains constant whereas the temperature along the bar changes over time until reaching thermal equilibrium.

To get some intuition about the stencil problem, let’s work through an example. To make it simple, consider an array of length 4 whose first and last elements have a fixed value 1. The remaining elements have an initial value of 2. Therefore, the initial array is [1,2,2,1]. After the first iteration, the array becomes [1,1.5,1.5,1]. The following diagram shows how the values change on each iteration and how, eventually, the values converge to 1. Stencil algorithm

Implementation detail: In order to avoid creating a new array on each iteration, the implementation of the algorithm can use just two arrays: one containing the values at the beginning of the iteration and another containing the resulting values at the end. Then, for the next iteration, the arrays are flipped. This diagram illustrates the process: Stencil algorithm simplified

And the snippet below is the sequential implementation of the algorithm (the code snippets presented in this post are written in Scala, although any Java developer can easily read them):

/**
* Sequential version
* @param iter Number of iterations
* @param arr Array with fixed boundary values. The other values are calculated as the average value of their neighbours
* @return Array with the final values after completing all iterations
*/
def seq(iter: Int, arr: Array[Double]): Array[Double] = {
//defensive copy to keep arr from being mutated outside this method
var myVal = arr.clone()
var myNew = arr.clone()

for(j <- (1 to iter)) {
for (i <- (1 until arr.length-1)) {
myNew(i) = (myVal(i - 1) + myVal(i + 1)) / 2.0
}
val temp = myNew
myNew = myVal
myVal = temp
}
myVal
}

## Parallel Implementations

Working on top of the sequential solution, this section describes other approaches to solve the problem using different parallel-programming constructs: Latch, Cyclic Barrier, and Phasers.

### Latch

According to CountDownLatch, a latch is “A synchronization aid that allows one or more threads to wait until a set of operations being performed in other threads completes.”

The solution to the stencil problem with latches involves the following steps:

• Create a new latch for each iteration with a count equal to N-2.
• Create N-2 tasks to be executed in parallel by threads of a dedicated execution context while the main thread waits until all tasks are completed; each task calculates the value of one element of the array.
• Once a task is finished, the last action of the executing thread is to let the main thread know that the task is completed.
• When all tasks are terminated, the main thread resumes its activity by flipping the old and new arrays and starting a new iteration.

It is worth noting that:

• The communication through the latch happens between each of the threads of the dedicated execution context and the main thread.
• The latch represents a barrier to the execution of the main thread only.

The following diagram represents 2 iterations of the process just described: Latch

### Cyclic Barrier

From the definition of CyclicBarrier, a cyclic barrier is “A synchronization aid that allows a set of threads to all wait for each other to reach a common barrier point.”

Here are the main characteristics of this solution:

• Each of the N-2 tasks calculates the value of one element of the array from beginning to end — that is, through all the iterations.
• These tasks are not independent, as the values of the neighbors that each task needs to calculate its own value are calculated by other tasks.
• Therefore, before moving onto a new iteration, each task needs to wait for all others to complete.

This problem seems a good match for a cyclic barrier and comprises the following steps:

• Create a cyclic barrier with N-2 parties.
• Create N-2 tasks to be executed in parallel by threads of a dedicated execution context. Each task calculates the value of one element of the array through all iterations.
• At the end of each iteration, each thread waits at the barrier until all other threads reach it.
• Finally, the arrays are flipped and a new iteration begins.

The main features of this algorithm are:

• The communication through the barrier happens among all threads of the dedicated execution context.
• The cyclic barrier represents a common barrier for all threads.

The following diagram represents two iterations of the process: Cyclic barrier

### Phasers

Phasers allow you to reduce the synchronization contention cost when there is a large number of parties. In order to understand this, let’s revisit the cyclic barrier example: We said that “before moving on to a new iteration, each task needs to wait for all others to complete,” and as a consequence, there is a common barrier for all threads.

Well, the above statement is not completely true. Strictly speaking, each task needs to wait only for the adjacent tasks that compute the value of its neighbors. Therefore, the situation is better described by the following diagram, where each thread shares a barrier with the adjacent ones. This fine-grained synchronization is achieved through the use of phasers. The implementation details can be found in the source code. Phasers

### A Final Twist: Fuzzy Barriers

When dealing with cycling barriers and phasers, it was mentioned that the number of threads needed is N-2. For large arrays (think of millions of elements), this is not feasible. You will get the dreaded

An exception or error caused a run to abort: unable to create new native thread
java.lang.OutOfMemoryError: unable to create new native thread
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1357)

And even if it were possible to create so many threads, the operating system would struggle to handle them all with a limited number of CPUs.

In order to get around this issue, the elements of the array can be grouped and then create a task for each group. Each task will calculate the values of all the elements of that group in a sequential fashion, and the creation and synchronization of barriers will happen at the group level.

As discussed in the previous section, the different approaches could be:

• CyclicBarrier: In order to progress to the next phase, all the elements of each group must be calculated.
• Phasers: To move to the next phase, a thread only needs to wait for the elements of the adjacent groups to be calculated

Let’s take a closer look at the last point. In reality, a thread only needs to wait for the first/last element of the adjacent groups to be calculated. Therefore, each thread can:

• First, calculate the first and last elements of its group.
• Signal to the other threads that the first/last elements are calculated (this is interpreted as the thread reaching the first point of the synchronization barrier) and, therefore, those other threads can progress to the next phase (if they are ready to do so).
• Calculate the internal elements of its group (which are not depended upon by the other threads).
• Wait at the second point of the barrier for the threads of the adjacent groups to reach the first point of the barrier

This is called Phasers with split-phase barriers or fuzzy barriers. This capability to perform local work within the barrier is another main characteristic of phasers, together with the capability to make fine-grained synchronization possible.

The following diagram illustrates the idea of fuzzy barriers. The diagram represents a 12-element array divided in 3 groups of 4 elements. To keep things simple, we will focus on the barrier shared by the first and second groups. Fuzzy barrier

Each thread Tx calculates the value of all elements in that group, starting with the values on the sides.

For instance, T1 calculates A3, reaches PA, and signals it out to the other threads. Then, it proceeds to calculate the rest of the elements in the group and, when it is done, waits on PB until T2 reaches PA.

When T2 has calculated A4 and A7, it reaches PA and emits the corresponding signal. At this point, T1 can progress to phase 2 while T2 completes phase 1 by calculating the rest of the elements of its group, and so on.

## Conclusion

The journey from the latch solution to phasers has led us to reduce the synchronization contention costs by means of:

1. Changing from CountDownLatch to CyclicBarrier to swap the iterative actions of the algorithm: for each iteration { for each array element { ... } } -> for each array element { for each iteration { ... } }
2. Changing from CyclicBarrier to phasers to break up the large common synchronization barrier into small individual ones
3. Grouping elements into groups that are processed in parallel (while the content of each group is processed sequentially): for each group {for each iteration { for each array element { ... } }}
4. Introducing fuzzy barriers to perform local work inside the barrier, therefore further reducing the contention

The code for all the examples discussed in this post can be found on this repository.

Topics:
fuzzy barriers ,java ,java phasers ,stencil algorithm ,synchronization ,tutorial

Comment (1)

Save
{{ articles.views | formatCount}} Views

Published at DZone with permission of Francisco Alvarez , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.