When processing large data sets, we often want to do some simple checks, such as number of unique items, most frequent items, and whether some items exist in the data set. The common approach is to use some kind of deterministic data structure like HashSet or Hashtable for such purposes. But when the data set we are dealing with becomes very large, such data structures are simply not feasible because the data is too big to fit in the memory. It becomes even more difficult for streaming applications which typically require data to be processed in one pass and perform incremental updates.
Probabilistic data structures are a group of data structures that are extremely useful for big data and streaming applications. Generally speaking, these data structures use hash functions to randomize and compactly represent a set of items. Collisions are ignored but errors can be well-controlled under certain threshold. Comparing with error-free approaches, these algorithms use much less memory and have constant query time. They usually support union and intersection operations and therefore can be easily parallelized.
This article will introduce three commonly used probabilistic data structures: Bloom filter, HyperLogLog, and Count-Min sketch.
Membership Query - Bloom filter
A Bloom filter is a bit array of m bits initialized to 0. To add an element, feed it to k hash functions to get k array position and set the bits at these positions to 1. To query an element, feed it to k hash functions to obtain k array positions. If any of the bits at these positions is 0, then the element is definitely not in the set. If the bits are all 1, then the element might be in the set. A Bloom filter with 1% false positive rate only requires 9.6 bits per element regardless of the size of the elements.
For example, if we have inserted x, y, z into the bloom filter, with k=3 hash functions like the picture above. Each of these three elements has three bits each set to 1 in the bit array. When we look up for w in the set, because one of the bits is not set to 1, the bloom filter will tell us that it is not in the set.
Bloom filter has the following properties:
- False positive is possible when the queried positions are already set to 1. But false negative is impossible.
- Query time is O(k).
- Union and intersection of bloom filters with same size and hash functions can be implemented with bitwise OR and AND operations.
- Cannot remove an element from the set.
Bloom filter requires the following inputs:
m: size of the bit array
n: estimated insertion
The optimum number of hash functions k can be determined using the formula:
Given false positive probability p and the estimated number of insertions n, the length of the bit array can be calculated as:
The hash functions used for bloom filter should generally be faster than cryptographic hash algorithms with good distribution and collision resistance. Commonly used hash functions for bloom filter include Murmur hash, fnv series of hashes and Jenkins hashes. Murmur hash is the fastest among them. MurmurHash3 is used by Google Guava library's bloom filter implementation.
Cardinality - HyperLogLog
HyperLogLog is a streaming algorithm used for estimating the number of distinct elements (the cardinality) of very large data sets. HyperLogLog counter can count one billion distinct items with an accuracy of 2% using only 1.5 KB of memory. It is based on the bit pattern observation that for a stream of randomly distributed numbers, if there is a number x with the maximum of leading 0 bits k, the cardinality of the stream is very likely equal to 2^k.
For each element si in the stream, hash function h(si) transforms si into string of random bits (0 or 1 with probability of 1/2):
The probability P of the bit patterns:
0xxxx... → P = 1/2
01xxx... → P = 1/4
001xx... → P = 1/8
The intuition is that when we are seeing prefix 0k 1..., it's likely there are n ≥ 2k+1 different strings. By keeping track of prefixes 0k 1... that have appeared in the data stream, we can estimate the cardinality to be 2p, where p is the length of the largest prefix.
Because the variance is very high when using single counter, in order to get a better estimation, data is split into m sub-streams using the first few bits of the hash. The counters are maintained by m registers each has memory space of multiple of 4 bytes. If the standard deviation for each sub-stream is σ, then the standard deviation for the averaged value is only σ/√m. This is called stochastic averaging.
For instance for m=4,
The elements are split into m stream using the first 2 bits (00, 01, 10, 11) which are then discarded. Each of the register stores the rest of the hash bits that contains the largest 0k 1 prefix. The values in the m registers are then averaged to obtain the cardinality estimate.
HyperLogLog algorithm uses harmonic mean to normalize result. The algorithm also makes adjustment for small and very large values. The resulting error is equal to 1.04/√m.
Each of the m registers uses at most log2log2 n + O(1) bits when cardinalities ≤ n need to be estimated.
Union of two HyperLogLog counters can be calculated by first taking the maximum value of the two counters for each of the m registers, and then calculate the estimated cardinality.
Frequency - Count-Min Sketch
Count-Min sketch is a probabilistic sub-linear space streaming algorithm. It is somewhat similar to bloom filter. The main difference is that bloom filter represents a set as a bitmap, while Count-Min sketch represents a multi-set which keeps a frequency distribution summary.
The basic data structure is a two dimensional d x w array of counters with d pairwise independent hash functions h1 ... hd of range w. Given parameters (ε,δ), set w = [e/ε], and d = [ln1/δ]. ε is the accuracy we want to have and δ is the certainty with which we reach the accuracy. The two dimensional array consists of wd counts. To increment the counts, calculate the hash positions with the d hash functions and update the counts at those positions.
The estimate of the counts for an item is the minimum value of the counts at the array positions determined by the d hash functions.
The space used by Count-Min sketch is the array of w*d counters. By choosing appropriate values for d and w, very small error and high probability can be achieved.
Example of Count-Min sketch sizes for different error and probability combination:
|ε||1 - δ||w||d||wd|
Count-Min sketch has the following properties:
- Union can be performed by cell-wise ADD operation
- O(k) query time
- Better accuracy for higher frequency items (heavy hitters)
- Can only cause over-counting but not under-counting
Count-Min sketch can be used for querying single item count or "heavy hitters" which can be obtained by keeping a heap structure of all the counts.
Probabilistic data structures have many applications in modern web and data applications where the data arrives in a streaming fashion and needs to be processed on the fly using limited memory. Bloom filter, HyperLogLog, and Count-Min sketch are the most commonly used probabilistic data structures. There are a lot of research on various streaming algorithms, synopsis data structures and optimization techniques that are worth investigating and studying.
If you haven't tried these data structures, you will be amazed how powerful they can be once you start using them. It may be a little bit intimidating to understand the concept initially, but the implementation is actually quite simple. Google Guava has Bloom filter implementation using murmur hash. Clearspring's Java library stream-lib and Twitter's Scala library Algebird have implementation for all three data structures and other useful data structures that you can play with. I have included the links below.