# HyperLogLogs in Redis

# HyperLogLogs in Redis

### It's not a log, it's a HyperLogLog. See how this data structure can give you a better, more accurate count of your information.

Join the DZone community and get the full member experience.

Join For FreeDiscover Tarantool's unique features which include powerful stored procedures, SQL support, smart cache, and the speed of 1 million ACID transactions on a single CPU core!

*A hyper-what-now?*

A HyperLogLog is a probabilistic data structure used to count unique values — or as it’s referred to in mathematics: calculating the cardinality of a set.

These values can be anything: for example, IP addresses for the visitors of a website, search terms, or email addresses.

Counting unique values with *exact precision* requires an amount of memory proportional to the number of unique values. The reason for this is that there is no way of determining if a value has already been seen other than by comparing it to the previously seen values.

Because memory is a limited resource, doing this becomes problematic when working with *large sets of values*.

A HyperLogLog solves this problem by allowing to trade memory consumption for precision making it possible to estimate cardinalities larger than 10^{9} with a standard error of 2% using only 1.5 kilobytes of memory.

## How to Use HyperLogLogs in Redis

HyperLogLogs have been available in Redis since version 2.8.9 and are accessed using the `PFADD`

, `PFCOUNT`

, and `PFMERGE`

commands.

A Redis HyperLogLog consumes, at most, 12 kilobytes of memory and produces approximations with a standard error of 0.81%. The 12 kilobytes do not include the bytes required to store the actual key.

Ignoring `PFMERGE`

, using a HyperLogLog is very similar to using a Set for the same purpose: Instead of adding members with `SADD`

, add them with `PFADD`

, and, instead of retrieving the cardinality with `SCARD`

, retrieve it with `PFCOUNT`

.

The example below shows the output of an interactive Redis session. You can follow along by running `redis-cli`

and entering the commands shown on the lines beginning with `>`

. Make sure an instance of `redis-server`

is also running.

```
> PFADD visitors alice bob carol
(integer) 1
> PFCOUNT visitors
(integer) 3
```

`PFADD`

returns `1`

if the approximated cardinality of the HyperLogLog was changed when adding the element. If not, it returns `0`

.

After calculating the cardinality, `PFCOUNT`

will store the calculated value in the last 8 bytes of the HyperLogLog to serve as a cache. This cache is invalidated on the next `PFADD`

.

The `PFMERGE`

command is used to produce a HyperLogLog that approximates the cardinality of the union of two or more existing HyperLogLogs:

```
> PFADD customers alice dan
(integer) 1
> PFMERGE everyone visitors customers
OK
> PFCOUNT everyone
(integer) 4
```

The same result can be achieved by supplying multiple keys to `PFCOUNT`

. In this case an on-the-fly merge is performed:

```
> PFCOUNT visitors customers
(integer) 4
```

Under the hood, HyperLogLogs are actually stored as strings and can be `GET`

and `SET`

as such:

```
> GET visitors
"HYLL\x01\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00E<\x94X\x10\x84Qi\x8cQD"
> SET visitors "HYLL\x01\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00E<\x94X\x10\x84Qi\x8cQD"
OK
> PFCOUNT visitors
(integer) 3
```

This is useful if the HyperLogLog needs to be persisted elsewhere and later restored.

The `PFADD`

, `PFCOUNT`

, and `PFMERGE`

commands are prefixed with `PF`

in memory of Philippe Flajolet, the inventor of the HyperLogLog algorithm.

## The Theory Behind HyperLogLogs

To get started, let’s look at something more familiar than HyperLogLogs.

Imagine someone flipping a coin multiple times, while recording the maximum number of heads they get in a row. If they told you this number, would you be able to estimate how many times they’d flipped the coin?

Not with great accuracy, but since a longer run of heads is more probable when flipping the coin many times, the number of consecutive heads or tails in a series of coin flips *can* indicate how many times the coin was flipped: If the number of heads in a row is low, the number of coin flips is most probably *also* low. Conversely, if the number of heads in a row is high, the number of coin flips is most probably high.

If the same number of coin flips were done and recorded separately, a much more precise estimate could be made by combining these numbers. This can be observed using the following methods:

```
def coin_flips(n)
Array.new(n) { [:heads, :tails].sample }
end
def heads_in_a_row(flips)
run = max = 0
flips.each do |flip|
if flip == :heads
run += 1
else
max = [run, max].max
run = 0
end
end
max
end
class Array
def average
reduce(:+).to_f / size
end
end
```

The bigger the number of coin flips, the bigger the number of heads in a row.

```
heads_in_a_row coin_flips(10) #=> 3
heads_in_a_row coin_flips(100) #=> 5
heads_in_a_row coin_flips(1000) #=> 9
heads_in_a_row coin_flips(10000) #=> 15
```

The bigger the number of series, the better the stability of the output. With a more stable output, a better estimate can be made:

```
heads_in_a_row coin_flips(10) #=> 3
heads_in_a_row coin_flips(10) #=> 7
heads_in_a_row coin_flips(10) #=> 4
Array.new(1000) { heads_in_a_row coin_flips(10) }.average #=> 2.449
Array.new(1000) { heads_in_a_row coin_flips(10) }.average #=> 2.442
Array.new(1000) { heads_in_a_row coin_flips(10) }.average #=> 2.469
```

Similarly to how the number of coin flips can be estimated by observing the number of heads in a row, the calculations of a HyperLogLog are based on the observation that the cardinality of a set of *uniformly distributed random numbers* can be estimated by counting the maximum number of leading zeros in the binary representation of the numbers.

For there to *be* leading zeros, the numbers must be represented as integers with the same number of bits, for instance as 64-bit unsigned integers.

If the maximum number of leading zeros is *n*, the estimated number of unique values in the set is 2^{n}.

To make sure values are uniformly distributed and represented as same-size integers, the HyperLogLog algorithm applies a hash function to all values. This transforms the values into a set of uniformly distributed random numbers with the same cardinality as the original set.

The first few bits of the hashed values are used to divide them into different subsets, much like the separate series of coin flips in the example above. For each subset, the maximum number of leading zeros within its values are stored in a register.

To calculate the approximate cardinality of the whole set, the estimates for all the subsets are combined using a harmonic mean.

## An Example Use Case

Here’s an example of how to use a Redis HyperLogLog to count unique visitors of a site built with Sinatra using a custom Rack middleware:

```
require 'redis'
require 'sinatra'
$redis = Redis.new
class VisitorCounter
def initialize(app)
@app = app
end
def call(env)
$redis.pfadd 'visitors', Rack::Request.new(env).ip
@app.call(env)
end
end
use VisitorCounter
get '/' do
visitors = $redis.pfcount 'visitors'
"This website has been visited by #{visitors} unique visitors."
end
```

And there you have it. You've seen how you can use HyperLogLogs can be used in Redis to get a more accurate picture of your data.

Discover Tarantool's unique features such as powerful stored procedures, SQL support, smart cache, and the speed of 1 million ACID transactions on a single CPU.

Published at DZone with permission of Calle Erlandsson , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}