Over a million developers have joined DZone.

You + Java = Hardware-Aware Programming But Not Hardware-Dependent

·

What’s lying behind a running program? I would like to say the answer is you + OS + hardware. That’s why this is titled as it is. The core of cores of concept that I am going to deliver here is this, no matter what programming language you are using, ultimately it all turns into the target CPU recognizable instructions, addresses and data, in this sense you better know how the CPU recognizable stuff of your codes will go through the CPU pipeline if you want to pursue extreme performance. You don’t necessarily have to be eligible for JVM professionalism and hardware professionalism, but being aware of how they work is definitely beneficial for aligning your codes with the way hardware works to fit into and adapt to proactively. Imagine how could a person who is a car driver with no knowledge about how his car’s working be an excellent driver? As a software programmer, in analogy, it’s same.

I am a Java guy and working on an instant messaging project that is intended to fulfill high concurrency, high throughput and high scalability, besides streaming. In researching related technologies, I was drawn to OS and hardware closer and closer. This drive me to long for having a full picture of how operating system and hardware work on earth as I keep questioning myself if there is more way to make it better and better. After all, it’s undeniable that when a program is running there, the OS and related hardware must be doing something, so just cannot be ignored. My searching track on this project is focusing on performance all the time, in the process my learning is hopping among Java, JVM, OS and hardware. Eventually I wrap up all things in one figure as below.

Image title

This figure roughly illustrates the concept of distance between data and CPU where T# denotes the time CPU needs to take to fetch data from the location where T# is located. Apparently the components sitting in the same row don’t require same amount of time for CPU to fetch the data they are holding. Presumably a lot of programmers, like me before, never seriously think about actually what’s going on behind codes at run time in day-to-day coding. Part of the reason, I think, is normally they don’t care in what environment their codes are going to run. Now I would like to advise to be aware of OS and hardware, because they do matter tremendously with program’s performance at many levels. Hopefully, this figure like a map can guide you to where you may need to look to in your case.

Java is known as highly abstract programming language and basically hardware-ignorant, that’s true if you say it doesn’t give an explicit way to get control over memory or shoot some magic CPU instructions directly when you want to apart from JNI and JNA. But, that doesn’t mean that Java guys have no headroom to make their program perform better. When you hook up Java, JVM, OS and hardware all together in your mind in coding, you definitely got a chance to make your codes perform better even a few orders of magnitude. You can check your code element by element shown in the figure to see if you have already squeezed every little bit of them to take full advantage of them.

Modern CPUs are far faster than main memory. Both software people and hardware people are fighting this. Interestingly, it’s technically feasible but economically unfeasible to bridge the gap in manufacturing. This article is going to only focus on T1 with some real codes to demonstrate or spark some questions. All codes are running on this:

  • Windows 8.1 64bit
  • 4 x AMD A10-7300 Radeon R6, 10 Compute Cores + 6G
  • 16G DDR3
  • L1 Cache 192KB
  • L2 Cache 4096KB
  • JDK 8

I really love to borrow some basic numbers below before diving into tests to give you a sense of overall.

  • Register 0 cycles
  • L1 Cache 3 cycles
  • L2 Cache 9 cycles
  • L3 cache 21 cycles
  • Main Memory 150-400 cycles


1. Test on One Dimensional Array

Iterate an int array to see how the write speed is different between Forward, Backward, Random, Stride four cases as adjusting the size of array and the size of stride with respect to 64 bytes of cache line.

import java.util.concurrent.ThreadLocalRandom;

public class CacheTestA {
	private static void test(int arraySize, int strideLength){
		ThreadLocalRandom tlr=ThreadLocalRandom.current();
		int[] d=new int [arraySize];
		long total=0;
		for(int i=0;i<arraySize;i++){
			long s=System.nanoTime();
			d[i]=i;
			long e=System.nanoTime();
			total+=e-s;
		}
		int forward=(int)(total/arraySize);
		
		total=0;
		for(int i=arraySize-1;i>=0;i--){
			long s=System.nanoTime();
			d[i]=i;
			long e=System.nanoTime();
			total+=e-s;
		}
		int backward=(int)(total/arraySize);
		
		total=0;
		for(int i=0;i<arraySize;i++){
			int ind=tlr.nextInt(arraySize);
			long s=System.nanoTime();
			d[ind]=i;
			long e=System.nanoTime();
			total+=e-s;
		}
		int random=(int)(total/arraySize);
		
		total=0;
		int cnt=0;
		for(int i=0;i<arraySize;i+=strideLength){
			cnt++;
			long s=System.nanoTime();
			d[i]=i;
			long e=System.nanoTime();
			total+=e-s;
		}
		int stride=(int)(total/cnt);
		
		System.out.println("Forward: "+forward+" Backwrod: "+backward+" Random: "+random+" Stride: "+stride);
	}
	public static void main(String[] args) {
		int L1CacheKB=192;
		int L2CacheKB=4096;
		int bytesOfInt=4;
		int strideLength=16;//assuming cache line size is 64 bytes
//		int arraySize=L1CacheKB*1024/bytesOfInt;
		int arraySize=L2CacheKB*1024/bytesOfInt;
//		int arraySize=15*1024*1024/bytesOfInt;
		for(int i=0;i<20;i++)
			test(arraySize, strideLength);
	}
} 
Result:
Forward: 44 Backwrod: 43 Random: 59 Stride: 67
Forward: 44 Backwrod: 43 Random: 58 Stride: 34
Forward: 45 Backwrod: 42 Random: 60 Stride: 42
Forward: 45 Backwrod: 40 Random: 58 Stride: 50
Forward: 41 Backwrod: 46 Random: 60 Stride: 45

Expected:

  • If array size is less than L1 cache size, fastest.
  • If array size is between L1 cache size and L2 size, slower.
  • If array size is greater than L1 + L2, slowest.
  • If stride size doesn’t equal to cache line size, slower relative to equal.
  • Random, slower in any cases of varying over both array size and stride size.

Outcome:

Only E is true.

Explanation:

In high probability, the next array element doesn’t in the same cache line at all, it’s only 1/16 hit, this explains E.

In nature, data in main memory is managed in a linear way, whereas cache line is the smallest unit of memory transferred between main memory and cache. The assumption that any size of block of memory more than cache line may be transferred is not true. This explains A – C.

As long as your memory access behavior conform to a regular pattern rather than jump around, CPU pre-fetcher can predict what’s next very well. This explains D.

2. Data Shared by VS Private to Thread

Take one dimensional int array as observing target to see the difference of element reading performance between sharing a public array and creating one array for each thread.

import java.util.concurrent.Callable;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;

public class CacheTestPrivateToThread {
	static class TaskMeasurePerElement implements Callable<Integer>{
		private int[] ar;
		private boolean useSharedArray;
		
		public TaskMeasurePerElement(int[] ar, boolean useSharedArray){
			this.useSharedArray=useSharedArray;
			if(!useSharedArray){
				this.ar=new int[ar.length];
				this.ar[0]=1;
			}else{
				this.ar=ar;
			}
		}
		
		public Integer call() {
			int oddCounter=0;
			long loopStart=System.currentTimeMillis();
			for(int x=0;x<ar.length;x++){
				int a=ar[x];
				if((a & 1)==1){
					oddCounter++;
				}
			}
			long loopEnd=System.currentTimeMillis();
			System.out.println("useSharedArray="+useSharedArray+"\tspent="+(loopEnd-loopStart)+"millis per loop");
			return oddCounter;
		}
	}
	static class TaskMeasurePerLoop implements Callable<Integer>{
		private int[] ar;
		private boolean useSharedArray;
		
		public TaskMeasurePerLoop(int[] ar, boolean useSharedArray){
			this.useSharedArray=useSharedArray;
			if(!useSharedArray){
				this.ar=new int[ar.length];
				this.ar[0]=1;
			}else{
				this.ar=ar;
			}
		}
		
		public Integer call() {
			int oddCounter=0;
			long totalElementSpent=0;
			for(int x=0;x<ar.length;x++){
				long start=System.nanoTime();
				int a=ar[x];
				long end=System.nanoTime();
				totalElementSpent+=end-start;
				if((a & 1)==1){
					oddCounter++;
				}
			}
			System.out.println("useSharedArray="+useSharedArray+"\tspent="+totalElementSpent/ar.length+"millis per element");
			return oddCounter;
		}
	}
	
	public static void main(String[] args)throws Exception {
		int nThreads=4;
		int L1CacheKB=192;
		int L2CacheKB=4096;
		int bytesOfInt=4;
//		int arraySize=L1CacheKB*1024/bytesOfInt;
//		int arraySize=L2CacheKB*1024/bytesOfInt;
		int arraySize=300*1024*1024/bytesOfInt;
		int[] ar=new int[arraySize];
		ExecutorService es=Executors.newFixedThreadPool(nThreads);
		int loopSize=50;
		test(loopSize, nThreads, ar, es);
		
		es.shutdownNow();
	}

	private static void test(int loopSize, int nThreads, int[] ar, ExecutorService es) throws InterruptedException, ExecutionException {
		for(int c=0;c<loopSize;c++){
			boolean measuerPerLoop=true;
			boolean useSharedArray=true;
			test_(useSharedArray, measuerPerLoop, nThreads, ar, es);
			
			useSharedArray=false;
			test_(useSharedArray, measuerPerLoop, nThreads, ar, es);
			
			measuerPerLoop=false;
			useSharedArray=true;
			test_(useSharedArray, measuerPerLoop, nThreads, ar, es);
			
			useSharedArray=false;
			test_(useSharedArray, measuerPerLoop, nThreads, ar, es);
		}
	}

	private static void test_(boolean useSharedArray, boolean measuerPerLoop, int nThreads, int[] ar, ExecutorService es) throws InterruptedException, ExecutionException {
		Future<Integer>[] fs=new Future[nThreads];
		for(int t=0;t<nThreads;t++){
			if(measuerPerLoop)
				fs[t]=es.submit(new TaskMeasurePerLoop(ar, useSharedArray));
			else
				fs[t]=es.submit(new TaskMeasurePerElement(ar, useSharedArray));
		}
		for(int t=0;t<nThreads;t++){
			fs[t].get();
		}
	}
}
Result:
useSharedArray=true	spent=162millis per loop
useSharedArray=true	spent=164millis per loop
useSharedArray=true	spent=169millis per loop
useSharedArray=true	spent=171millis per loop
useSharedArray=false	spent=122millis per loop
useSharedArray=false	spent=121millis per loop
useSharedArray=false	spent=123millis per loop
useSharedArray=false	spent=112millis per loop
useSharedArray=true	spent=68millis per element
useSharedArray=true	spent=69millis per element
useSharedArray=true	spent=69millis per element
useSharedArray=true	spent=68millis per element
useSharedArray=false	spent=69millis per element
useSharedArray=false	spent=69millis per element
useSharedArray=false	spent=68millis per element
useSharedArray=false	spent=68millis per element

Expected:

Shared is faster than private to thread.

Outcome:

Per loop measuring, yes, but slightly;

Per element measuring, no;

Explanation:

What I can reason is just “int [x];” is too fast, and “” is not enough, so accumulatively the result of per loop measuring appears different, but per element not. You might guess that it may make difference to instantiate array inside “call()” method instead of using a field variable of array instantiated in constructor, actually not.

As some demonstrated, you can count odd numbers within a two dimensional int array by striping into multiple threads then aggregating. Some may use this case (see below codes) to justify sharing is much slower than not shared, but really not much, just as shown above if you purify the test case.


import java.util.concurrent.Callable;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;

public class CacheTestB {
	private static int[][] init(int size){
		ThreadLocalRandom tlr=ThreadLocalRandom.current();
		int[][] rtn=new int[size][size];
		for(int i=0;i<size;i++)
			for(int j=0;j<size;j++)
				rtn[i][j]=tlr.nextInt(size);
		return rtn;
	}
	private static void count(int[][] ar, int nThreads, boolean useSharedCounter)throws Exception{
		int size=ar.length;
		int[] sharedCounter=new int[nThreads];
		ExecutorService es=Executors.newFixedThreadPool(nThreads);
		long s=System.currentTimeMillis();
		int avg=(int)size/nThreads;
		for(int i=0;i<nThreads;i++){
			int rowStart=i*avg;
			int rowEnd=0;
			if(i==nThreads-1){
				rowEnd=size;
			}else{
				rowEnd=(i+1)*avg;
			}
			StripeTaskPrivateCounter stpc=new StripeTaskPrivateCounter(useSharedCounter, ar, rowStart, rowEnd, i, sharedCounter);
			es.submit(stpc);
		}
		int total=0;
		es.shutdown();
		es.awaitTermination(1, TimeUnit.DAYS);
		long e=System.currentTimeMillis();
		for(int i=0;i<nThreads;i++){
			total+=sharedCounter[i];
		}
		es.shutdownNow();
		System.out.println("Counter shared: "+useSharedCounter+" elapsed: "+(e-s) +" total: "+total);
	}
	static class StripeTaskPrivateCounter implements Callable<Integer>{
		private int[][] ar;
		private int rowStart, rowEnd;
		private int threadIndex;
		private int[] sharedCounter;
		private boolean useSharedCounter;
		
		public StripeTaskPrivateCounter(boolean useSharedCounter, int[][] ar, int rowStart, int rowEnd, int threadIndex, int[] sharedCounter){
			this.ar=ar;
			this.rowStart=rowStart;
			this.rowEnd=rowEnd;
			this.threadIndex=threadIndex;
			this.sharedCounter=sharedCounter;
			this.useSharedCounter=useSharedCounter;
		}
		
		public Integer call() throws Exception{
			int counter=0;
			for(int i=rowStart;i<rowEnd;i++){
				for(int j=0;j<ar.length;j++){
					if((ar[i][j] & 1)==1){
						if(useSharedCounter){
							sharedCounter[threadIndex]++;
						}else{
							counter++;
						}
					}
				}
			}
			if(!useSharedCounter){
				sharedCounter[threadIndex]=counter;
			}
			return counter;
		}
	}
	public static void main(String[] args)throws Exception {
		int nThreads=4;
		int[][] ar=init(1024*4);
		
		for(int i=0;i<100;i++){
			System.out.println();
			count(ar,nThreads, true);
			count(ar,nThreads, false);
		}
	}
}

Result:
Counter shared: true elapsed: 252 total: 8385316
Counter shared: false elapsed: 56 total: 8385316

Counter shared: true elapsed: 226 total: 8385316
Counter shared: false elapsed: 50 total: 8385316

Counter shared: true elapsed: 237 total: 8385316
Counter shared: false elapsed: 56 total: 8385316

If you don’t thing about it seriously, you might be misled to conclusion that sharing is much slower than not shared, the fact is NOT. Why? Because,

“sharedCounter[threadIndex]++;” goes like this (using javap),

        0: aload_1
        1: aload_0
        2: getfield      #13                 // Field ind:I
        5: dup2
        6: iaload
        7: iconst_1
        8: iadd
        9: iastore

And “counter++;” goes like this,

2: iinc          1, 1

I believe that the number of corresponding underlying assembly instructions is much different also. This explains why it appears differently at order of magnitude level rather than sharing or not.

3. Test on Two Dimensional Array

For traversal of the same two dimensional array, there are two ways, one is row by row, the other one is column by column. Without thinking, you might think that the traversing time length basically has no difference, simply reasoning with same amount of elements and same memory allocation, why not.

public class TestCacheLine {
	private static void memLocality(int n){
		int[][] a=new int[n][n];
		for(int i=0;i<n;i++){
			for(int j=0;j<n;j++){
				a[i][j]=i*j;
			}
		}
		int bytes=n*n*4/1024;
		
		/**
		 * access by row
		 */
		long s=System.currentTimeMillis();
		for(int i=0;i<n;i++){
			for(int j=0;j<n;j++){
				a[i][j]=i*j;
			}
		}
		long e=System.currentTimeMillis();
		System.out.println("n="+n+"\tbytes of array="+bytes+" kb\tspent="+(e-s)+" millis, Row-Major");
		
		/**
		 * access by column
		 */
		s=System.currentTimeMillis();
		for(int i=0;i<n;i++){
			for(int j=0;j<n;j++){
				int x=a[j][i];
			}
		}
		e=System.currentTimeMillis();
		System.out.println("n="+n+"\tbytes of array="+bytes+" kb\tspent="+(e-s)+" millis, Column-Major");
		
	}
	public static void main(String[] args) {
		for(int n=128;n<1024*1024*1024;n+=128)
		for(int i=0;i<5;i++)
		memLocality(n);
	}
}
Result:
n=7040	bytes of array=193600 kb	spent=71 millis, Row-Major
n=7040	bytes of array=193600 kb	spent=503 millis, Column-Major
n=7040	bytes of array=193600 kb	spent=71 millis, Row-Major
n=7040	bytes of array=193600 kb	spent=469 millis, Column-Major
n=7040	bytes of array=193600 kb	spent=74 millis, Row-Major
n=7040	bytes of array=193600 kb	spent=445 millis, Column-Major

Understanding below key concepts gives full answer.

  • Row/Column-major, memory allocation goes row-by-row/column-by-column, this is specific to programming language. Java is row-major.
  • Inter-Loop, compiler may smartly interchange inner loop and outer loop.
  • Cache Line, the smallest unit of memory moving between cores and between caches and main memory, usually 64 bytes.

In nature, physically memory allocation is conducted linearly, therefore, for Java, column-by-column access definitely lead to cache penalty.

4. Talks between Threads, ExecutorService (ES) VS Super-Fast

In the past, I once was faced a challenge of processing nearly a million of messages per second that were concurrently produced. Firstly, I used ES, just throwing processing tasks at it, but really worse when messages come in exposively. Afterwards, I made a small producer-consumer tool, just a little bit improvement far away from what is wonderfully invented by Martin Thompson, called Disruptor, absolutely awesome. His innovative way can attain millions of operations per second in consuming a kind of queue (actually a ring buffer) but not the kind of BlockingQueue by ES. This new way uses combination of technologies, including Unsafe, volatile, final, atomic*, memory fence, out-of-order execution, false sharing, JMM etc. To understand it, really requires deep insight into the way CPU works.

I wrote a test to compare ES and Disruptor with the same workload of producing and consuming, the result is amazing. I won’t post codes here due to too many.

What I did is to use four threads as producer without sleeping just keeping throwing. In ES case, CPU occupation bikes up to 90+% and memory usage keeps growing, both very soon, then the output looks jittering; In Disruptor case, just goes steadily, both CPU (~60%) and memory (70MB) go flat smoothly and lovely. The intrinsic limit of ES lies right in the contention on the queue.

5. NUMA – Non Unified Memory Access

What if a Java program runs on multiple NUMA nodes? Basically I didn’t see any practical examples about it within first a few pages of Google search. JVM does give options like UseNUMA, and claimed 40% performance (64 bit OS) improvement. In Java HotSpot JVM there is a NUMA-aware allocator that partitions memory over NUMA nodes. Oracle states, the allocator relies on a hypothesis that a thread that allocates the object will be the most likely to use the object. To ensure the fastest access to the new object, the allocator places it in the region local to the allocating thread. The regions can be dynamically resized to reflect the allocation rate of the application threads running on different nodes. That makes it possible to increase performance even of single-threaded applications. In addition, "from" and "to" survivor spaces of the young generation, the old generation, and the permanent generation have page interleaving turned on for them. This ensures that all threads have equal access latencies to these spaces on average. I have no multiple NUMA nodes machine in hand, no verified codes to post here. It’s very reasonable that aligning your Java codes with Oracle’s hypothesis would be a good practice. Hopefully some can share some explorations to present useful practice and general rules. I’m wondering when the worse cases like remote access across NUMA nodes even multiple hopping through forwarding occur and how to avoid.

Topics:
Java ,JVM ,CPU cache ,performance ,hardware ,jmm ,memory access pattern ,memory cache ,memory allocation ,cache hit

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}