Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

What's Wrong With Hashcode in java.lang.String?

DZone's Guide to

What's Wrong With Hashcode in java.lang.String?

This overview of strings and hashcode covers how hashcode works, how collisions happen, and the danger of improperly using strings as keys.

· Java Zone ·
Free Resource

Verify, standardize, and correct the Big 4 + more– name, email, phone and global addresses – try our Data Quality APIs now at Melissa Developer Portal!

One of the most significant criteria of every hash function is the tendency for collisions. Hash functions inside the JDK are not the exception. The main idea of a collision attack is finding two different messages, m1 and m2, such that hash(m1) = hash(m2). In this article, I would like to show that this problem is reproducible in every program written in Java and how to get around it.

Firstly, let's consider the internals of the hashcode function in java.lang.String:

    public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }


As you see hashcode in java.lang.String class is computed as:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

This is nothing more than a polynomial accumulation method. The multiplier of this function is 31 and it has its own rationale: Joshua Bloch in Effective Java explained this solution: "The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically." (from Chapter 3, Item 9: Always override hashcode when you override equals, page 48).

The collisions in hashcode are generated trivially:

If String a and String b have a common prefix and the same length — if n and the statement 31*(b[n-2] - a[n-2]) == (a[n-1] - b[n-1]) is true — it means that first and second strings have the same hashcode.

Let's consider this example:

 String a = "Aa";
 String b = "BB";
 System.out.println(a.hashCode());
 System.out.println(b.hashCode());

 System.out.println(31 * ('C' - 'D') == ('B' - 'a'));
 System.out.println(31 * ('B' - 'A') == ('a' - 'B'));

 System.out.println("common_prefixDB".hashCode());
 System.out.println("common_prefixCa".hashCode());


If you run this example, you will see the same hashcode for "Aa" and "BB" — 2112 — and for the two last strings with "common_prefix" — 1027514780. You can generate your own collisions in a simple loop statement.

With this fact, it follows that the usage of a string like a key in HashMap is the mistake — for instance, if the programmer used a String key in a HashMap for storing HTTP headers, the attacker can use it for DoS attacks.

As a conclusion, be careful with hashcode functions.

Developers! Quickly and easily gain access to the tools and information you need! Explore, test and combine our data quality APIs at Melissa Developer Portal – home to tools that save time and boost revenue. Our APIs verify, standardize, and correct the Big 4 + more – name, email, phone and global addresses – to ensure accurate delivery, prevent blacklisting and identify risks in real-time.

Topics:
java ,hashcode ,java security ,java.lang.string ,tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}