DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Related

  • An Introduction to Bloom Filters
  • Hybrid Search Using Postgres DB
  • Make @Observable Wrapper for Better State Control in Swift
  • LangChain in Action: Redefining Customer Experiences Through LLMs

Trending

  • MCP Servers: The Technical Debt That Is Coming
  • Monolith: The Good, The Bad and The Ugly
  • How Can Developers Drive Innovation by Combining IoT and AI?
  • AI-Driven Root Cause Analysis in SRE: Enhancing Incident Resolution
  1. DZone
  2. Data Engineering
  3. Data
  4. Optimizing a Hashing Strategy

Optimizing a Hashing Strategy

Using two strategies in combination you can develop new hashing strategies to improve performance without having to use more memory or much more CPU.

By 
Peter Lawrey user avatar
Peter Lawrey
·
Sep. 16, 15 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
13.5K Views

Join the DZone community and get the full member experience.

Join For Free

Overview

The strategy that's used for hashing keys can have a direct impact on the performance of a hashed collection such as a HashMap or HashSet.

The built-in hashing functions are designed to be generic and work well in a wide range of use cases.  Can we do better, especially if we have a good idea of the use case?

Testing a Hashing Strategy

In a previous article I looked at a number of ways to test hashing strategies and in particular looked at a hashing strategy which had been optimized for "Orthogonal Bits," which looked at making sure each hash result was as different as possible based on just one bit changing.

However, if you have a known set of elements/keys to hash you can optimize for that specific use case, rather than trying to find a generic solution.

Minimizing Collisions

One of the main things you want to avoid in a hashed collection is collisions.  This is when two or more keys map to the same bucket.  These collisions mean you have to do more work to check whether the key is the one you expected, as there are now multiple keys in the same bucket.  Ideally there is at most 1 key in each bucket.

I Just Need Unique Hash Codes, Don't I?

A common misconception is that to avoid collisions all you need is to have a unique hash code.  While having unique hash codes is highly desirable, that is not enough.

Say you have a set of keys and all of them have unique 32-bit hash codes.  If you then have an array of 4 billion buckets, each key will have its own bucket, and there are no collisions.  It is generally undesirable to have such large arrays for all hash collections. In fact, HashMap and HashSet are limited by the largest power of two sizes you can have for an array, which is 2^30 or just over one billion.

What happens when you have a more realistically sized hashed collection? The number of buckets needs to be smaller and the hash codes need to be modulo-ed to the number of buckets. If the number of buckets is a power of 2, you can use a mask of the lowest bits.

Let's look at this example: ftse350.csv. If we take the first column as a key or element, we get 352 strings.  These strings have unique String.hashCode()s, but say we take the lower bits of these hash codes. Do we see collisions?

MaskString.hashCode() maskedHashMap.hash( 
String.hashCode()) masked
32 bitsNo collisionsNo collisions
16 bits1 collision3 collisions
15 bits2 collisions4 collisions
14 bits6 collisions6 collisions
13 bits11 collisions9 collisions
12 bits17 collisions15 collisions
11 bits29 collisions25 collisions
10 bits57 collisions50 collisions
9 bits103 collisions92 collisions

The size of the HashMap for a load factor of 0.7 (the default) is 512, which uses a mask of the lower 9 bits. As you can see around 30% of keys have a collision even though we started with unique hash codes.

The code for HashTesterMain is here.

To reduce the impact of a poor hashing strategy, the HashMap uses an agitating function. In Java 8 it is fairly simple.

From the source for HashMap.hash. You can read the Javadoc for more details:

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

This mixes the high bits of the hash code with the low bits to improve the randomness of the lower bits.  For the case above where there is a high collision rate, there is an improvement. See the third column.

A Look at the Hash Function for String

The code for String.hashCode():

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

Note: the implementation for String is defined in the Javadoc so there is little chance we can change it, but we could define a new hashing strategy.

Components of a Hashing Strategy

There are two parts I look at in a hashing strategy.

  • The magic numbers.  You can try different numbers to find the best result.
  • The structure of the code.  You want a structure where you get a good result for any sane choice of magic number.

While magic numbers matter, the reason you don't want them to be too important is that there is always a chance that your choice of magic number wasn't right for a given use case.  This is why you also want a code structure which has a low worst case outcome even for a poorly chosen magic number.

Let's try some different multiplying factors instead of 31.

MultiplierCollisions
1 230
2 167
3 113
4 99
5 105
6 102
7 93
8 90
9 100
10 91
11 91

You can see that the choice of a magic number matters, but also there are lots of numbers to try. We need to write a test to try out a good random selection. The source for HashSearchMain

Hash functionBest multiplierLowest collisionsWorst multiplierHighest Collisions
hash() 130795 81 collisions 126975 250 collisions
xorShift16(hash()) 2104137237 68 collisions -1207975937 237 collisions
addShift16(hash()) 805603055 68 collisions -1040130049 243 collisions
xorShift16n9(hash()) 841248317 69 collisions 467648511 177 collisions

The key code to look at is:

public static int hash(String s, int multiplier) {
    int h = 0;
    for (int i = 0; i < s.length(); i++) {
        h = multiplier * h + s.charAt(i);
    }
    return h;
}

private static int xorShift16(int hash) {
    return hash ^ (hash >> 16);
}

private static int addShift16(int hash) {
    return hash + (hash >> 16);
}

private static int xorShift16n9(int hash) {
    hash ^= (hash >>> 16);
    hash ^= (hash >>> 9);
    return hash;
}

As you can see, the repeated multiplication of each hash plus the next character is reasonable if you provide a good multiplier, or a multiplier which happens to work well with your key set. If you compare 130795 as a multiplier instead of 31, you get only 81 collisions instead of 103 collisions for the key set tested.

If you use the agitation function as well you can get around 68 collisions.  This is getting close to the same collision rate as doubling the size of the array. I.e. an improved collision rate without using more memory.

But what happens when we add new keys to the hash collection, will our magic number still be good for us?  This is where I look at the worst collision rates to determine which structure is likely to produce good results for a wider range of possible inputs.  The worst case for hash() is 250 collisions. That is 70% of keys colliding, which is pretty bad.  The agitation function improves this a little, however it's still not great.  Note: if we add the shifted value instead of xor-ing it we get a worse result in this case.

However, if we do two shifts–to mix not just the top and bottom bits, but bits from four different portions of the hash code generated–we find that the worst case collision rate is much lower.  This indicates to me that should the selection of keys change, we are less likely to get a bad result as the structure is better and the choice of magic number or choice of inputs matters less.

What if We Have add Instead of xor in the Hash Function?

In the agitation function using xor was perhaps better than using add. What happens if we change h = multiplier * h + s.charAt(i); to h = multiplier * h ^ s.charAt(i);?

Hash functionBest multiplierLowest collisionsWorst ScoreHighest Collisions
hash() 1724087 78 collisions 247297 285 collisions
xorShift16(hash()) 701377257 68 collisions -369082367 271 collisions
addShift16(hash()) -1537823509 67 collisions -1409310719 290 collisions
xorShift16n9(hash()) 1638982843 68 collisions 1210040321 206 collisions

The best case numbers are slightly better, however the worst case collision rate is notably higher. This indicates to me that the choice of magic number matters more, but it also means that choice of keys will matter more.  This would seem a risky choice as you have to consider that the keys may change over time.

Why Do We Chose Odd Multipliers?

When you multiply by an odd number the lower bit of the result has an equal chance of being 0 or 1. This is because 0 * 1 = 0 and 1 * 1 = 1.  However, if you multiply by an even number the lower bit is always going to 0. i.e. it is no longer random. Say we repeat the earlier test but only using even numbers, how does this look?

Hash functionBest multiplierLowest collisionsWorst ScoreHighest Collisions
hash() 82598 81 collisions 290816 325 collisions
xorShift16(hash()) 1294373564 68 collisions 1912651776 301 collisions
addShift16(hash()) 448521724 69 collisions 872472576 306 collisions
xorShift16n9(hash()) 1159351160 66 collisions 721551872 212 collisions

If you are lucky and have the right input for your magic number the results are just as good as for odd numbers, however if you are unlucky, the results can be pretty bad. 325 collisions means that only 27 out of 512 buckets are being used.

How Do More Advanced Hashing Strategies Differ?

For the hashing strategies we use based on City, Murmur, XXHash and Vanilla Hash (our own):

  • The hashing strategy reads 64-bits at a time, which is faster than reading byte-by-byte.
  • The working value computed is two 64-bit values.
  • The working value is reduced to 64-bit long.
  • More multiplying constants are used as a result.
  • The agitation function is more complex.

We use long hash codes in our implementation as:

  • We optimize for 64-bit processors
  • The longest primitive data type is 64-bit in Java 
  • If you have large hash collections (i.e. millions), 32-bit hashes are unlikely to be unique.

In Summary

By exploring how we generate the hash code, we have found ways to reduce the number of collisions for 352 keys down from 103 collisions to 68 collisions, but also have some confidence that should the key set change, we have reduced the impact this might have had.

This is without using more memory, or even much more processing power.

We still have the option of utilizing more memory.

For comparison, you can see that doubling the size of the array can improve the best case, but you still have the problem that a mismatch between the key set and the magic number can still have a high collision rate.

Hash functionBest multiplierLowest collisionsWorst ScoreHighest Collisions
hash() 2924091 37 collisions 117759 250 collisions
xorShift16(hash()) 543157075 25 collisions - 469729279 237 collisions
addShift16(hash()) -1843751569 25 collisions - 1501097607 205 collisions
xorShift16n9(hash()) -2109862879 27 collisions -2082455553 172 collisions

Conclusion

In situations where you have a stable key set you can get a significant improvement in the rate of collisions by tuning the hashing strategy used.  

You also need tests which indicate how bad things are likely to get if the key set changes without re-optimization. 

Using these two in combination you can develop new hashing strategies to improve performance without having to use more memory or much more CPU.

Collision (computer science) Magic number (programming) 64-bit Data Types Data structure

Published at DZone with permission of Peter Lawrey, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • An Introduction to Bloom Filters
  • Hybrid Search Using Postgres DB
  • Make @Observable Wrapper for Better State Control in Swift
  • LangChain in Action: Redefining Customer Experiences Through LLMs

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!