# Comparing Hashing Strategies

### Learn all about hashing: what is it, and various strategies.

Join the DZone community and get the full member experience.

Join For Free### Overview

Chronicle has a number of implementations for hashing, including City and Murmur. It also has its own Vanilla Hash, but how was this tested?

### What is Vanilla Hash?

Vanilla Hash is designed to be as simple as possible and be optimized for the Orthogonal Bits test (See below) This was compared with City 1.1 and Murmur 3 hashing strategies.

This is the 99th percentile latencies for filling the 64 byte/256 byte buffer with new data and generating a 64-bit hash. JMH was used to perform the measurements. See Main64bytes and Main256bytes

Hashing Strategy | 64 byte 99%tile | 256 byte 99%tile |
---|---|---|

Vanilla | 67 ns | 112 ns |

City 1.1 | 90 ns | 182 ns |

Murmur 3 | 104 ns | 211 ns |

The full test of results are here.

### What Tests Can You Do To Check a Hashing Strategy is Good?

There are a number of simple tests you can do. The tests cannot identify a good hash, but they can show a hash to be a poor one. Passing one test could mean it will fail another.

In each case multiple tests are run with different random starting points. A score is taken for the 99%th percentile, i.e. the worst 1%. This is because you don't need a hash which works some of the time, or on average. You need one which works most of the time. (In all cases you can invent a pathological case where any specific hash will break down)

For consistency, lower scores are better. The test should be constructed in such as that a score of 0 indicates the test is broken.

In each test, using an input of 8,192 bits, or 1024 KB, one bit at a time is toggled. From these inputs, 8,192 x 64-bit hashes are generated.

For the random tests however, a sequence of random 64-bit values were taken. These are useful to get an idea of what a good number is for the hashing strategies tested.

### Mask of Hash Score

In this test, each hash is modulus by 16,384 (double the number of hashes) and the number of collisions is reported. Most hashing strategies did well for this test.

### Avalanche Score

In this test, each hash is compared to the previous hash (with the previous bit toggled) to see how likely any given bit will be flipped. The ideal is 50% and the sum of difference to 50% is taken with the worst 1% reported.

### Speed in Latency

In this test, the time it takes to perform the hash is recorded and the worst 1% latency reported.

### Orthogonal Bits

The purpose of this tests is to ensure all the hashes have bits which are different as different as possible to every other hash produced. Think of the 8 Queens Problem, except for 64-bit numbers. The ideal is that every number has the same number of bits different to every other number and this is as high as possible.

In this test, every hash is compared to every other hash. A count of the number of bits which are different is taken. If the number of different bits is less than 18, this is given a penalty score of 2^(17-n). The less bits which are different the greater the penalty on an exponential scale. If any of the 8K hashes comapred to the other 8K hashes are different in less than 5 bits, this is a failure even if all the other pairs are fine.

I have called it an Orthogonal Bits test as you can model a 64-bit number as a 64 dimensional vector of bits. Ideally you want the angle between all the hashes produced as high as possible.

Of all the tests, this one shows the highest difference between the String.hashCode() with HashMap.hash(int) and the other hashing strategies.

### Testing String.hashCode()

String.hashCode() is a very poor hash, especially for the *lower* bits. It is standard and cannot be changed or break backward compatibility. However, this doesn't have to be a problem as HashMap uses an agitate function which brings down some of the higher bits to randomise the lower ones.

```
int hash(int h) {
// This function ensures that hashCodes that differ only by
// constant multiples at each bit position have a bounded
// number of collisions (approximately 8 at default load factor).
h ^= (h >>> 20) ^ (h >>> 12);
return h ^ (h >>> 7) ^ (h >>> 4);
}
```

## Results

The CheckMain class run a suit of tests on each hashing strategy.

VANILLA Orthogonal bits: 99%tile score: 6066 Speed: The 99%tile for latency was 0.223 us Avalanche: The 99%tile of the drift from 50% was 0.55% Mask of Hash: 99%tile collisions: 1815

CITY_1_1 Orthogonal bits: 99%tile score: 7395 Speed: The 99%tile for latency was 0.267 us Avalanche: The 99%tile of the drift from 50% was 0.55% Mask of Hash: 99%tile collisions: 1817

MURMUR_3 Orthogonal bits: 99%tile score: 7524 Speed: The 99%tile for latency was 0.378 us Avalanche: The 99%tile of the drift from 50% was 0.54% Mask of Hash: 99%tile collisions: 1815

STRING32 Orthogonal bits: 99%tile score: **295906433** Speed: The 99%tile for latency was **1.580** us Avalanche: The 99%tile of the drift from 50% was **1.02%** Mask of Hash: 99%tile collisions: 1814

STRING64 Orthogonal bits: 99%tile score: **1939167** Speed: The 99%tile for latency was **1.520** us Avalanche: The 99%tile of the drift from 50% was 0.61% Mask of Hash: 99%tile collisions: 1816

STRING32_WITHOUT_AGITATE Orthogonal bits: 99%tile score: **879390386** Speed: The 99%tile for latency was **1.573** us Avalanche: The 99%tile of the drift from 50% was **3.53%** Mask of Hash: 99%tile collisions: **6593**

RANDOM Orthogonal bits: 99%tile score: 7444 Speed: The 99%tile for latency was **0.058** us Avalanche: The 99%tile of the drift from 50% was 0.53% Mask of Hash: 99%tile collisions: 1817

SECURE_RANDOM Orthogonal bits: 99%tile score: 7449 Speed: The 99%tile for latency was **0.861** us Avalanche: The 99%tile of the drift from 50% was 0.54% Mask of Hash: 99%tile collisions: 1816

SEEDED_VANILLA

Orthogonal bits: 99%tile score: 6000

Speed: The 99%tile for latency was 0.219 us

Avalanche: The 99%tile of the drift from 50% was 0.55%

Mask of Hash: 99%tile collisions: 1814

SEEDED_CITY_1_1

Orthogonal bits: 99%tile score: 7313

Speed: The 99%tile for latency was 0.270 us

Avalanche: The 99%tile of the drift from 50% was 0.54%

Mask of Hash: 99%tile collisions: 1813

SEEDED_MURMUR_3

Orthogonal bits: 99%tile score: 7404

Speed: The 99%tile for latency was 0.359 us

Avalanche: The 99%tile of the drift from 50% was 0.53%

Mask of Hash: 99%tile collisions: 1810

Note: Seeded Vanilla Hash is part of Chronicle Enterprise

### Conclusions

The Vanilla, City and Murmur hashers were the fastest.

While String.hashCode() is simple, the multiplication operation on a per character basis is expensive. By comparison all the others process 8 bytes at a time using longs. See STRINGS32_WITHOUT_AGITATE compared with STRING32. HashMap uses the later.

The 32-bit String hashCode() even with the agitate performed poorly on the Avalanche test. In SMHasher where this test comes from a score over 1% was considered a failure.

The Mask of Hash tests, while simple, appears to be performed well in all cases. The exception being the String.hashCode() which as mentioned doesn't have very random low bits.

What I found interesting is how different the orthogonal test score were. The first three hash stratgies were again consistently low. Even the 64-bit version of String.hashCode() has a high change of producing hashes with less than 18 bits different, in fact a lot of the bits are the same.

### Disclaimer

Vanilla Hash was optimized for the Orthogonal Bits test. As such it is no surprise that it gets a slightly better result. This doesn't mean that Vanilla Hash is better than City or Murmur. It may just mean it is best for the Orthogonal Bits test.

Published at DZone with permission of Peter Lawrey, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Comments