DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Counting Distinct Values With HyperLogLog

Counting Distinct Values With HyperLogLog

Containers, Redis and using HyperLogLog for fast estimates not 100% accuracy for counting

Moritz Plassnig user avatar by
Moritz Plassnig
·
Apr. 15, 16 · Tutorial
Like (2)
Save
Tweet
Share
8.05K Views

Join the DZone community and get the full member experience.

Join For Free

Counting distinct values is a trivial matter with small datasets, but it gets dramatically harder for streams with millions of distinct points. Absolute counting accuracy can be attained through a set, but then you have the undesirable tradeoff of linear memory growth.

What you really want when dealing with enormous datasets is something with predictable storage and performance characteristics. That is precisely what HyperLogLog (HLL) is — an algorithm optimized for counting the distinct values within a stream of data. Enormous streams of data. It is an algorithm that has become invaluable for real-time analytics. The canonical use case is tracking unique visitors to a website, but other use cases extend far beyond that.

Before HyperLogLog, there was the LogLog algorithm. Like HLL, it traded absolute accuracy for performance, speed, and a minor error rate. The author of LogLog, Philippe Flajolet, knew the error rate was too high and endeavored to create a successor that had fewer flaws. The result was HyperLogLog, which boasts a lower error rate (0.81%) and even better performance.

Fortunately, you don’t need to implement the data structure yourself in order to make use of the HLL goodness. Redis has a highly tuned implementation of HLL available as a native data structure.

Using HLL Within Redis

You can use the HyperLogLog data structure to count unique elements in a set using just a small constant amount of memory, specifically 12k bytes for every structure (plus a few bytes for the key itself).

Redis provides a couple of simple but powerful commands for working with HyperLogLog structures. The commands are all prefixed with PF, in honor of Philippe Flajolet, and are simple to experiment with. The Redis documentation is excellent and thoroughly describes the use of each command in more detail than we need.

For now, we’ll skim the basic usage of the PFADD, PFCOUNT, and PFMERGE commands. Start a new redis-cli session to experiment with and follow along:

127.0.0.1:6379> PFADD words this is a sentence about stuff
(integer) 1
127.0.0.1:6379> PFADD words this is another sentence about stuff
(integer) 1
127.0.0.1:6379> PFCOUNT words
(integer) 7
127.0.0.1:6379> PFADD alt-words stuff is a great word
(integer) 1
127.0.0.1:6379> PFCOUNT words alt-words
(integer) 9


What we’ve done here is use PFADD to create two structures, one with the key words and another with the key alt-words. After creating the keys, we used PFCOUNT to check the cardinality at each key. After the creation of alt-words, we call PFCOUNT with multiple keys, which merges the structures in memory and gives us the combined cardinality.

It’s that simple. With such a tiny data set, we can easily see that the counts are correct. Of course, it gets far more interesting when estimating larger uncountable datasets.

Setting Up an Experiment

To make things interesting, let’s set up an experiment that leverages HLL outside the realm where it’s normally applied.

Cardinality as a measurement only has a single dimension. It becomes more informative when given a second dimension, such as change over time. We can use HLL to measure the changes in a stream of data, for example, the vocabulary of a writer over their career. This approach could be used on a Twitter stream, a blog feed, or a series of books.

Given that Redis uses linear counting for structures with less than 40,000 distinct values, we’ll want to use the most expansive sample set available. The complete works of Charles Dickens are freely available in plain text format. That should make for an expansive sample!

My hypothesis is that there will be a significant jump in vocabulary between each book — in addition to the small changes in subject and characters. Let’s see if Dickens was sporting a 40,000-word vocabulary…

Adding and Counting the Data

The interface for adding and counting values is extremely simple. Here we’ll define a simple Ruby script that will download a sample of books as raw text. Each book is then processed line by line, lightly normalized, and then pushed into Redis.

Unfortunately the URLs are not named in a very friendly way. Just know that this sample includes A Tale of Two Cities, Great Expectations, David Copperfield, and a few other popular classics.

require 'open-uri'
require 'redis'

URLS = %w[
  http://www.gutenberg.org/ebooks/98.txt.utf-8
  http://www.gutenberg.org/ebooks/1400.txt.utf-8
  http://www.gutenberg.org/ebooks/730.txt.utf-8
  http://www.gutenberg.org/ebooks/766.txt.utf-8
  http://www.gutenberg.org/ebooks/19337.txt.utf-8
  http://www.gutenberg.org/ebooks/700.txt.utf-8
]

BOOKS = URLS.map(&File.method(:basename))
REDIS = Redis.current

URLS.each do |url|
  text = open(url)
  name = File.basename(url)

  text.each_line do |line|
    REDIS.pfadd(name, line.split(/\s+/).map(&:downcase))
  end
end


With all of the words stored, we can do some light exploration using PFCOUNT:

BOOKS.each do |name|
  puts "#{name}: #{REDIS.pfcount(name)}"
end

puts "All: #{REDIS.pfcount(*BOOKS)}"


That outputs the word count for each book, followed by the merged value for all of the works:

98.txt.utf-8: 18,583
1400.txt.utf-8:  21,384
730.txt.utf-8:   20,532
766.txt.utf-8:   31,601
19337.txt.utf-8: 7,405
700.txt.utf-8:   24,045
All:            65,454


Astoundingly, the cumulative distinct word count is over 65,000 words! Also astounding, but less apparent from the printed results, is that the values are reported instantaneously.

Powerful Analysis, Even Without Big Data

This is an intentionally simple example of how to use HLL for a trivial problem. However, once you grasp the fundamentals of the data structure, it opens up new avenues of analysis.

When given a problem dealing with unique values, most developers would instinctually reach for a set. Sets are perfectly suited to storing unique values, but that storage comes at a cost. In situations where the cardinality is important, such as unique visits, distinct search terms, or vocabulary, precise values are immaterial. Keep HyperLogLog in mind — it’s a data structure uniquely suited to the task of counting distinct values.

Big data

Published at DZone with permission of Moritz Plassnig, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Top Five Tools for AI-based Test Automation
  • Secrets Management
  • How To Generate Code Coverage Report Using JaCoCo-Maven Plugin
  • Writing a Modern HTTP(S) Tunnel in Rust

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: