A hash function maps arbitrarily long input strings to fixed-length outputs. For example, sha256 maps its input to a string of 256 bits. A cryptographically secure hash function h is a one-way function, i.e., given a message m, it’s easy to compute h(m), but it’s not practical to go the other way in order to learn anything about m from h(m). Secure hash functions are useful for message authentication codes because it is practically impossible to modify m without changing h(m).
Ideally, a secure hash is “indistinguishable from a random mapping.” So, if a hash function has a range of size N, how many items can we send through the hash function before we can expect two items to have same hash value? By the pigeonhole principle, we know that if we hash N+1 items, two of them are certain to have the same hash value. However, it’s likely that a much smaller number of inputs will lead to a collision, two items with the same hash value.
The famous birthday problem illustrates this. You could think of birthdays as a random mapping of people into 366 possible values. (Leap year, of course, complicates things, since February 29 birthdays are less common than other birthdays. Another complication is that birthdays are not entirely evenly distributed for the other days of the year. However, these complications don’t ruin the party trick; in a room of 30 people, two people usually share a birthday.) In a room of fewer than 366 people, it’s possible that everyone has a different birthday. However, in a group of 23 people, there are even odds that two people have the same birthday.
Variations on the birthday problem come up frequently (for example, in seeding random number generators). Importantly for this post, the birthday problem comes up in attacking hash functions.
When N is large, it is likely that hashing √N values will lead to a collision. We prove this below.
The proof below is a little informal. It could be made more formal by replacing the approximate equalities with equalities and adding the necessary little-o terms.
Suppose we’re hashing n items to a range of size N = n2. The exact probability that all n items have unique hash values is given in here. Taking the log of both sides gives us the first line of the proof below.
The first approximation is based on the first three terms in the asymptotic expansion for log Γ given here, applied to both log gamma expressions. Note that the third terms from the two asymptotic series are the same, so they cancel out. We can afford to be a little sloppy about the differences between n and n + 1 at this point because we’re interested in the limit as n goes to infinity. More concretely, we’ll be interested in values of n in the billions, so we’re not interested in the difference between hashing a billion items and hashing a billion and one items.
The second approximation comes from using the first two terms in the power series for log(1 + x). One term isn’t enough since that would reduce to zero. The final approximation is simply taking the limit as n goes to infinity. Again concretely, we’re willing to say that a billion and one divided by a billion is essentially one.
The probability of no collisions is exp(-1/2) or about 60%, which means there’s a 40% chance of at least one collision. As a rule of thumb, a hash function with a range of size N can hash on the order of √N values before running into collisions.
This means that with a 64-bit hash function, there’s about a 40% chance of collisions when hashing 232 or about 4 billion items. If the output of the hash function is discernibly different from random, the probability of collisions may be higher. A 64-bit hash function cannot be secure since an attacker could easily hash 4 billion items. A 256-bit or 512-bit hash could in principle be secure since one could expect to hash far more items before collisions are likely. Whether a particular algorithm like sha512 is actually secure is a matter for cryptologists, but it is at least feasible that a hash with a 512-bit range could be secure, based on the size of its range, while a 64-bit hash cannot be.
We used an asymptotic argument above rather than numerically evaluating the probabilities because this way we get a more general result. However, even if we were only interested in a fix but large n, we’d run into numerical problems. This is one of those not uncommon cases where a pencil-and-paper approximation is actually more accurate than direct calculation with no (explicit) approximations.
There are numerous numerical problems with direct calculation of the collision probability. First, without taking logarithms we’d run into overflow and underflow. Second, for large enough n, n2 – n = n2 in floating point representation. IEEE 754 doubles have 53 bits of precision. This isn’t enough to distinguish values that differ, say, in the 128th bit. Finally, the two log gamma terms are large, nearly equal numbers. The cardinal rule of numerical analysis is to avoid subtracting nearly equal numbers. If two numbers agree to k bits, you could lose k bits of precision in carrying out their difference. See this article for more along these lines.
Cryptography Engineering by Ferguson, Schneier, and Kohno.