# Everything You Want to Know About ThreadLocal Random Implementation

### A very detailed look at ThreatLocal Radom in Java, various encryption algorithms that need random numbers, and parallel random number generators.

Join the DZone community and get the full member experience.

Join For Freein this blog post, i’d like to present some things i’ve found while trying to understand the implementation of the jdk 8 threadlocalrandom class. what was my motivation for doing something like that? over the weekend, i was reading the second volume from knuth’s taocp “seminumerical algorithms,” where he devotes the whole of chapter 3 describing techniques for generating random numbers with a computer, and then explaining how to test those random number generators (of course we are talking about pseudo-random number generators or prngs). near the end of the chapter knuth proposes the following exercise:

look at the subroutine library of each computer installation in your organization, and replace the random number generators with good ones. try to avoid being too shocked at what you find.

that seemed like a good exercise, but while i didn’t want to replace the prngs of my organization, i decided to take a look at the implementation and see for myself. i grabbed the code from openjdk and went straight to check the implementation of threadlocalrandom . the class threadlocalrandom (tlr from now on), is an implementation of the interface offered by the random class. random is an implementation of the linear congruence method presented by knuth on taocp. a quick foray into tlr revealed something totally different.

in trying to understand tlr, i downloaded the source code for the class, and got it into intellij. i tried to just ignore the comments, and go and read the code as is. what an interesting experience! the code is full of hardcoded constants like the following:

```
private static int mix32(long z) {
z = (z ^ (z >>> 33)) * 0xff51afd7ed558ccdl;
return (int)(((z ^ (z >>> 33)) * 0xc4ceb9fe1a85ec53l) >>> 32);
}
```

what is
```
0xff51afd7ed558ccdl
```

— or
```
0xc4ceb9fe1a85ec53l
```

? how do they come up with those constants? and there’s even a couple more:

`0x9e3779b97f4a7c15l 0x9e3779b9 0xbb67ae8584caa73bl`

a little googling revealed that
```
0x9e3779b97f4a7c15l
```

and
```
0x9e3779b9
```

come from the r5 encryption algorithm by rivest (the same person that gave the r to rsa). in the paper, we can see that rivest took the
golden ratio
number and used it in the following formula to produce either
```
0x9e3779b97f4a7c15l
```

or
```
0x9e3779b9
```

:

`q = odd((φ - 1)2^w)`

where odd is a function that returns the nearest integer to the input parameter. then
```
w
```

is the wordsize we want to use, like 32 or 64. as you might have guessed,
```
0x9e3779b97f4a7c15l
```

is the 64-bit version of the constant, and
```
0x9e3779b9
```

is the 32bit version. note that
```
0x9e3779b9
```

is also used in the
tiny encryption algorithm
.

the constant
```
0x9e3779b97f4a7c15l
```

is called
```
gamma
```

in the tlr source code and
```
0x9e3779b9
```

is called
```
probe_increment
```

.

then we have
```
0xbb67ae8584caa73bl
```

, which in tlr is called
```
seeder_increment
```

. that one is easier, since it’s just the fractional part of the square root of 3 obtained like this:

`frac(sqrt(3)) * 2^64`

it happens that that constant is used in the sha512 algorithm, which does a similar procedure to the first eight prime numbers to extract eight constants that are used in the algorithm as well.

then we have
```
0xff51afd7ed558ccdl
```

and
```
0xc4ceb9fe1a85ec53l
```

, which are used in the
```
mix32
```

method shown above. google reveals that these constants come from the finalizer step of the
murmurhash3
algorithm.

so far so good, but what are all these numbers used for? to understand this, we need to understand what the purpose of threadlocalrandom. as the name suggests, the idea is to have a source of random numbers, per thread, so we can obtain random numbers concurrently. this means that every time a thread initialises an instance of tlr, the code needs to initialise the random seed for that particular thread. the seed is initialised on a
*
mixed
*
version of
```
seeder_increment
```

; at the same time, a
```
probe
```

value is initialised for that particular thread by adding the
```
probe_increment
```

to the current
```
probegenerator
```

value. what’s that
*
probe
*
value used for? it is used by classes like
```
concurrenthashmap
```

to calculate hashes for the map keys.

so we have constants taken from rc5 and tea, sha512 and murmurhash3. this is starting to make no sense at all, but let’s check how it all works when put together to see if we can make sense out of this.

## obtaining random numbers

to get a random number from threadlocalrandom, we can call the method
```
nextint()
```

, which is part of the interface exposed by
```
random
```

as well. let’s check the implementation of that method:

```
public int nextint() {
return mix32(nextseed());
}
```

so we compute the next seed and that’s
*
mixed
*
and returned to the user. let’s see what
```
nextseed()
```

is doing:

```
final long nextseed() {
thread t; long r; // read and update per-thread seed
unsafe.putlong(t = thread.currentthread(), seed,
r = unsafe.getlong(t, seed) + gamma);
return r;
}
```

that method obtains the current value of the seed and just adds to it the value of the
```
gamma
```

constant presented above. this doesn’t look random at all, for some defintion of the word random anyway. if we go back to knuth, the linear congruence method proposed in his book involves a calculation like this:

`nextseed = (oldseed * multiplier + addend) & mask;`

so we have a multiplier, and addend, and then we apply a mask to that value.

the method used by tlr lacks the multiplier part, but in taocp knuth is very clear that the lack of multiplier has the effect of producing a sequence that’s “everything but random.” the tlr case is like setting the multiplier to 1 in the linear congruence method.

so there’s either something very wrong with threadlocalrandom, or i am missing something. considering that threadlocalrandom is part of the jdk used by millions of developers, i’m pretty sure that i am at fault, and i’m for sure missing something. time to read those comments.

right at the top of the class we have this comment:

even though this class subclasses java.util.random, it uses the same basic algorithm as java.util.splittablerandom.

so let’s check splittablerandom and see what we find out over there, but this time let’s read the comments there as well.

## the theory behind threadlocalrandom

when we open splittablerandom’s source code, we find this illustrative comment:

```
/*
* this algorithm was inspired by the "dotmix" algorithm by
* leiserson, schardl, and sukha "deterministic parallel
* random-number generation for dynamic-multithreading platforms",
* ppopp 2012, as well as those in "parallel random numbers: as
* easy as 1, 2, 3" by salmon, morae, dror, and shaw, sc 2011. it
* differs mainly in simplifying and cheapening operations.
*/
```

here’s the paper deterministic parallel random-number generation for dynamic-multithreading platforms and here we have parallel random numbers: as easy as 1, 2, 3 .

the goal of the first paper by leiserson, schardl, and sukha (from now on lss) is to see how to create an efficient
*
deterministic parallel random-number generators
*
for dthreading platforms (as opposed to posix’s pthreading). dthreading is an implementation of the
fork-join
parallel computation model. the problem is that traditional dprngs don’t scale to hundreds of thousands of
*
strands
*
(read green-threads), since they were made with the pthread model in mind. in that paper they present an algorithm for dprngs called dotmix, which uses a
**
dot
**
product of a
*
pedigree
*
and then
**
mix
**
es the result using the
rc6
algorithm. they claim dotmix has a statistic quality which rivals the one from the
mersenne twister
algorithm and should work for hundreds of thousands of
*
strands
*
.

what they mean by saying those other algorithms don’t scale? the problem of parallel streams of random numbers is that there has to be some way to prevent two streams from producing the same sequence of random numbers. we could keep state around and synchronise using locks among threads, but that will be slow. they are trying to find a way for each thread to be able to have a seed for their random numbers that doesn’t depend on a shared state.

### what’s a pedigree?

they use the following definition to explain a pedigree which if we don’t read the original paper, it won’t make much sense:

a pedigree scheme uniquely identifies each strand of a dthreaded program in a scheduler-independent manner.

let’s try to understand that definition in the context of a fork-join parallel computing model. each thread can fork multiple threads by calling fork (
*
spawn
*
in the lss paper), or could generate a value. at the same time the spawned threads can also do the same: either spawn a new child thread or generate more values. now let’s represent that model using the following tree:

we have that the root task
```
a
```

generated the value
```
6a
```

, forked the thread
```
b
```

and then
```
c
```

, and generated
```
81
```

. then the task
```
b
```

generated three values:
```
12
```

,
```
74
```

and
```
c7
```

. then
```
c
```

forked
```
d
```

and generated
```
b9
```

; and so on.

an oversimplification of what lss claims is that the pedigree of each value, taken as the vector of labels in the tree from the generated value leaf to the root, is unique, independently of the scheduling of the tasks. for example, the value
```
74
```

has the following pedigree: <2, 2>.

we have a unique vector per thread, but how do we generate random numbers from there? the authors of dotmix talk about the idea of
*
compressing
*
the values of the vector to a single machine word. here’s where the
*
dot
*
part of dotmix comes into play. they calculate the dot product of the pedigree vector with another vector of integers “chosen uniformly at random” (see they paper for the exact details).

this integer hash that’s produced from the dot product has a small probability of collision with the ones produced by other threads. the problem is that now two similar pedigrees can produce similar hashes, but the authors of dotmix wanted the produced value to be
*
statistically dissimilar
*
from other hashes. to solve that problem they introduced the mixing part to dotmix. so, what’s mixing?

### what’s mixing?

the mixing step of the algorithm applies a function to the hash value obtained from the pedigree in order to reduce the statistical correlation of two hash values, so it becomes hard to predict their sequence. in the case of dotmix, the mix function swaps the high and low order bits of the hash value, that is, a function that in a 64-bit hash, for example, swaps the first 32bit part with the second one. for dotmix they use a mixing function based on the rc6 encryption algorithm .

## counter-based prngs

from the description of dotmix we can see that threadlocalrandom is using a mixing step that’s applied to the seed obtained from
```
nextseed
```

, but as we saw
```
nextseed
```

just increases the
```
currentseed
```

by the value of the
```
gamma
```

constant, which means this has nothing to do with pedigrees. here’s where the other paper mentioned in the comments comes into play.

the paper “parallel random numbers: as easy as 1, 2, 3” by salmon, morae, dror, and shaw presents the idea of counter based prngs. in their paper, they are trying to solve the problem of having “massively parallel high-performance computations,” for which they say traditional prngs don’t scale well.

the problem with traditional prngs like the method from taocp is that they are serial. so to calculate random number n+1 we need to have calculated the nth random number before. if we want to produce several streams of random numbers, then that approach won’t scale if our goal is to be sure the streams are different, since initialising each stream with its own seed becomes complicated.

to counter that problem they propose a simple function to produce a sequence of numbers:

`f(s) = (s + 1) mod 2^p`

that function is just a simple counter that increases the input value by one and then applies mod to some power of two. since it’s just a counter, this method is called counter based prngs.

at this point, we might start hurting our own eye muscles from so much eye-rolling but stay with me. the counter function could use just 1 as the increment as in that example or use a number a bit more complicated like the
```
gamma
```

value used in threadlocalrandom. still, we don’t have random numbers yet.

what the authors of that paper propose is that we apply a cryptographically secure function to the values produced by the counter function. in particular, they propose using parts of aes or threefish to the value produced by the counter. in their paper, instead of incrementing the counter by 1 they propose a couple of constants that are also used in threadlocalrandom:
```
0xbb67ae8584caa73b
```

and
```
0x9e3779b97f4a7c15
```

which are our
```
seeder_increment
```

and
```
gamma
```

values mentioned before. they say that by using these constants and some variants of aes or threefish they managed to pass the
testu01’s bigcrush tests
for prngs and their prngs produce periods of
```
2^128
```

numbers.

## getting the puzzle together

so now we are managing to put together the puzzle of threadlocalrandom. from dotmix, we have the mixing function and from salmon et al. we get the idea of counter-based prngs. there’s still a missing piece, though: why is threadlocalrandom using what seems to be a custom mixing function?

it happens that threadlocalrandom
```
mix32
```

and
```
mix64
```

are not custom at all. they are in fact based on the murmurhash3 finalizer function. the idea behind that function is to produce an
avalanche effect
on the bits of the value passed to the mix function. the avalanche effect is a technique that by flipping a single bit, manages to change enough bits (the avalanche part) so that the resulting number is quite different from the input. so if we pass two values to the function that are quite similar, the mixing function will make sure they end up quite different. if we look at
```
mix32
```

again we will see a couple of constants there,
```
0xff51afd7ed558ccdl
```

and
```
0xc4ceb9fe1a85ec53l
```

.

`private static int mix32(long z) { z = (z ^ (z >>> 33)) * 0xff51afd7ed558ccdl; return (int)(((z ^ (z >>> 33)) * 0xc4ceb9fe1a85ec53l) >>> 32); }`

according to the creator of murmurhash3 these constants were chosen because they produce an avalanche effect with a probability near to 0.5, but the story does not end here.
```
splittablerandom
```

does not use the same constants!

`// splittablerandom implementation private static long mix64(long z) { z = (z ^ (z >>> 30)) * 0xbf58476d1ce4e5b9l; z = (z ^ (z >>> 27)) * 0x94d049bb133111ebl; return z ^ (z >>> 31); }`

the constants used here were suggested by david stafford in his
blog
who found them after running some experiments. for some reason
```
splittablerandom
```

has the
*
better
*
constants while threadlocalrandom does not.

## threadlocalrandom a random algorithm?

one thing that knuth tries to make clear in his book is that we should not use a random algorithm to produce a prngs. that is, the steps of the algorithm shouldn’t be chosen at random, like grab a piece from here, another from there, put them together, shake it a bit, and tah dah! we got a prng. so far this seems to be the case with threadlocalrandom. what are we missing?

there’s yet another paper called
fast splittable pseudorandom number generators
. its authors are guy steele, doug lea, and christine h. flood, which in case you don’t know, they are all people involved with the development of the jdk. what is that paper about? it explains the algorithm behind
```
splittablerandom
```

, which is the one used for
```
threadlocalrandom
```

as well (with some small differences, as explained above).

in that paper, they explain that they took dotmix and implemented it in scala because the language would permit them a clean implementation which they could analyze for further improvements which then would be translated to java. that’s a pretty interesting use case for scala.

once they had dotmix implemented they tried to refine it, focusing on simplifying its steps, trying to increase the performance of the algorithm, while at the same time keeping it secure enough to pass the testu01 battery of tests. so a pedigree based prng became a counter based one; and cryptographically secure for mixing (but arguably slow) functions for mixing became the finaliser function from murmurhash3. of course, they put their new prng algorithm under the battery of tests offered by testu01 which is the industry standard for testing prngs.

there’s a historical note from that paper that it’s worth noticing. the paper is from october 2014. on december that year a
paper from inria
got submitted for publication which discussed the
```
mrg32k3a
```

algorithm as a way to replace threadlocalrandom implementation. if we read steele’s et al paper we see that they reviewed
```
mrg32k3a
```

but they saw it didn’t fit their selection criteria, because it wouldn’t allow to split the stream beyond 192 sub-threads. i would assume that the authors of the inria paper didn’t know about steele’s paper at the time of publication.

another interesting note from that paper is their commentary on haskell’s system.random implementation:

the api is beautiful; the implementation turns out to have a severe flaw.

## conclusion

on this interesting journey through threadlocalrandom, we saw that while we found quite a few things along the way that seemed to have no logical explanation, there was, in fact, a reason for them to be there. in this case, the authors of threadlocalrandom took two algorithms for producing prngs and refined their implementation until they reached to what splittablerandom is (and subsequently threadlocalrandom). even if that method seems sound, a prng needs testing of its statistical properties, and steele et al tell us that their splitmix passes the testu01 battery of tests.

## credits

the fork-join tree image and the explanation come from the paper by guy steele, et al, mentioned above.

Published at DZone with permission of Alvaro Videla, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Comments