Optimizing a Hashing Strategy

Using two strategies in combination you can develop new hashing strategies to improve performance without having to use more memory or much more CPU.

Peter Lawrey

Sep. 16, 15 · Tutorial

Likes (1)

Comment

Save

13.5K Views

Overview

The strategy that's used for hashing keys can have a direct impact on the performance of a hashed collection such as a HashMap or HashSet.

The built-in hashing functions are designed to be generic and work well in a wide range of use cases. Can we do better, especially if we have a good idea of the use case?

Testing a Hashing Strategy

In a previous article I looked at a number of ways to test hashing strategies and in particular looked at a hashing strategy which had been optimized for "Orthogonal Bits," which looked at making sure each hash result was as different as possible based on just one bit changing.

However, if you have a known set of elements/keys to hash you can optimize for that specific use case, rather than trying to find a generic solution.

Minimizing Collisions

One of the main things you want to avoid in a hashed collection is collisions. This is when two or more keys map to the same bucket. These collisions mean you have to do more work to check whether the key is the one you expected, as there are now multiple keys in the same bucket. Ideally there is at most 1 key in each bucket.

I Just Need Unique Hash Codes, Don't I?

A common misconception is that to avoid collisions all you need is to have a unique hash code. While having unique hash codes is highly desirable, that is not enough.

Say you have a set of keys and all of them have unique 32-bit hash codes. If you then have an array of 4 billion buckets, each key will have its own bucket, and there are no collisions. It is generally undesirable to have such large arrays for all hash collections. In fact, HashMap and HashSet are limited by the largest power of two sizes you can have for an array, which is 2^30 or just over one billion.

What happens when you have a more realistically sized hashed collection? The number of buckets needs to be smaller and the hash codes need to be modulo-ed to the number of buckets. If the number of buckets is a power of 2, you can use a mask of the lowest bits.

Let's look at this example: ftse350.csv. If we take the first column as a key or element, we get 352 strings. These strings have unique String.hashCode()s, but say we take the lower bits of these hash codes. Do we see collisions?

Mask	String.hashCode() masked	HashMap.hash( String.hashCode()) masked
32 bits	No collisions	No collisions
16 bits	1 collision	3 collisions
15 bits	2 collisions	4 collisions
14 bits	6 collisions	6 collisions
13 bits	11 collisions	9 collisions
12 bits	17 collisions	15 collisions
11 bits	29 collisions	25 collisions
10 bits	57 collisions	50 collisions
9 bits	103 collisions	92 collisions

The size of the HashMap for a load factor of 0.7 (the default) is 512, which uses a mask of the lower 9 bits. As you can see around 30% of keys have a collision even though we started with unique hash codes.

The code for HashTesterMain is here.

To reduce the impact of a poor hashing strategy, the HashMap uses an agitating function. In Java 8 it is fairly simple.

From the source for HashMap.hash. You can read the Javadoc for more details:

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

This mixes the high bits of the hash code with the low bits to improve the randomness of the lower bits. For the case above where there is a high collision rate, there is an improvement. See the third column.

A Look at the Hash Function for String

The code for String.hashCode():

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

Note: the implementation for String is defined in the Javadoc so there is little chance we can change it, but we could define a new hashing strategy.

Components of a Hashing Strategy

There are two parts I look at in a hashing strategy.

The magic numbers. You can try different numbers to find the best result.
The structure of the code. You want a structure where you get a good result for any sane choice of magic number.

While magic numbers matter, the reason you don't want them to be too important is that there is always a chance that your choice of magic number wasn't right for a given use case. This is why you also want a code structure which has a low worst case outcome even for a poorly chosen magic number.

Let's try some different multiplying factors instead of 31.

Multiplier	Collisions
1	230
2	167
3	113
4	99
5	105
6	102
7	93
8	90
9	100
10	91
11	91

You can see that the choice of a magic number matters, but also there are lots of numbers to try. We need to write a test to try out a good random selection. The source for HashSearchMain

Hash function	Best multiplier	Lowest collisions	Worst multiplier	Highest Collisions
hash()	130795	81 collisions	126975	250 collisions
xorShift16(hash())	2104137237	68 collisions	-1207975937	237 collisions
addShift16(hash())	805603055	68 collisions	-1040130049	243 collisions
xorShift16n9(hash())	841248317	69 collisions	467648511	177 collisions

The key code to look at is:

public static int hash(String s, int multiplier) {
    int h = 0;
    for (int i = 0; i < s.length(); i++) {
        h = multiplier * h + s.charAt(i);
    }
    return h;
}

private static int xorShift16(int hash) {
    return hash ^ (hash >> 16);
}

private static int addShift16(int hash) {
    return hash + (hash >> 16);
}

private static int xorShift16n9(int hash) {
    hash ^= (hash >>> 16);
    hash ^= (hash >>> 9);
    return hash;
}

As you can see, the repeated multiplication of each hash plus the next character is reasonable if you provide a good multiplier, or a multiplier which happens to work well with your key set. If you compare 130795 as a multiplier instead of 31, you get only 81 collisions instead of 103 collisions for the key set tested.

If you use the agitation function as well you can get around 68 collisions. This is getting close to the same collision rate as doubling the size of the array. I.e. an improved collision rate without using more memory.

But what happens when we add new keys to the hash collection, will our magic number still be good for us? This is where I look at the worst collision rates to determine which structure is likely to produce good results for a wider range of possible inputs. The worst case for hash() is 250 collisions. That is 70% of keys colliding, which is pretty bad. The agitation function improves this a little, however it's still not great. Note: if we add the shifted value instead of xor-ing it we get a worse result in this case.

However, if we do two shifts–to mix not just the top and bottom bits, but bits from four different portions of the hash code generated–we find that the worst case collision rate is much lower. This indicates to me that should the selection of keys change, we are less likely to get a bad result as the structure is better and the choice of magic number or choice of inputs matters less.

What if We Have `add` Instead of `xor` in the Hash Function?

In the agitation function using xor was perhaps better than using add. What happens if we change h = multiplier * h + s.charAt(i); to h = multiplier * h ^ s.charAt(i);?

Hash function	Best multiplier	Lowest collisions	Worst Score	Highest Collisions
hash()	1724087	78 collisions	247297	285 collisions
xorShift16(hash())	701377257	68 collisions	-369082367	271 collisions
addShift16(hash())	-1537823509	67 collisions	-1409310719	290 collisions
xorShift16n9(hash())	1638982843	68 collisions	1210040321	206 collisions

The best case numbers are slightly better, however the worst case collision rate is notably higher. This indicates to me that the choice of magic number matters more, but it also means that choice of keys will matter more. This would seem a risky choice as you have to consider that the keys may change over time.

Why Do We Chose Odd Multipliers?

When you multiply by an odd number the lower bit of the result has an equal chance of being 0 or 1. This is because 0 * 1 = 0 and 1 * 1 = 1. However, if you multiply by an even number the lower bit is always going to 0. i.e. it is no longer random. Say we repeat the earlier test but only using even numbers, how does this look?

Hash function	Best multiplier	Lowest collisions	Worst Score	Highest Collisions
hash()	82598	81 collisions	290816	325 collisions
xorShift16(hash())	1294373564	68 collisions	1912651776	301 collisions
addShift16(hash())	448521724	69 collisions	872472576	306 collisions
xorShift16n9(hash())	1159351160	66 collisions	721551872	212 collisions

If you are lucky and have the right input for your magic number the results are just as good as for odd numbers, however if you are unlucky, the results can be pretty bad. 325 collisions means that only 27 out of 512 buckets are being used.

How Do More Advanced Hashing Strategies Differ?

For the hashing strategies we use based on City, Murmur, XXHash and Vanilla Hash (our own):

The hashing strategy reads 64-bits at a time, which is faster than reading byte-by-byte.
The working value computed is two 64-bit values.
The working value is reduced to 64-bit long.
More multiplying constants are used as a result.
The agitation function is more complex.

We use long hash codes in our implementation as:

We optimize for 64-bit processors
The longest primitive data type is 64-bit in Java
If you have large hash collections (i.e. millions), 32-bit hashes are unlikely to be unique.

In Summary

By exploring how we generate the hash code, we have found ways to reduce the number of collisions for 352 keys down from 103 collisions to 68 collisions, but also have some confidence that should the key set change, we have reduced the impact this might have had.

This is without using more memory, or even much more processing power.

We still have the option of utilizing more memory.

For comparison, you can see that doubling the size of the array can improve the best case, but you still have the problem that a mismatch between the key set and the magic number can still have a high collision rate.

Hash function	Best multiplier	Lowest collisions	Worst Score	Highest Collisions
hash()	2924091	37 collisions	117759	250 collisions
xorShift16(hash())	543157075	25 collisions	- 469729279	237 collisions
addShift16(hash())	-1843751569	25 collisions	- 1501097607	205 collisions
xorShift16n9(hash())	-2109862879	27 collisions	-2082455553	172 collisions

Conclusion

In situations where you have a stable key set you can get a significant improvement in the rate of collisions by tuning the hashing strategy used.

You also need tests which indicate how bad things are likely to get if the key set changes without re-optimization.

Using these two in combination you can develop new hashing strategies to improve performance without having to use more memory or much more CPU.

Collision (computer science) Magic number (programming) 64-bit Data Types Data structure

Published at DZone with permission of Peter Lawrey, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending