Probabilistic Data Structures for Software Security
Learn how bloom filters and count-min sketch make security systems fast and scalable by trading perfect accuracy for speed and memory efficiency.
Join the DZone community and get the full member experience.
Join For FreeWe are living in an era where software systems are growing in size with each passing day and often face a constant tension between the scale, performance, and security, where each of them is essential and non-negotiable. Security tools must process large volumes of data in real time (network logs, user activity, login attempts, password matches, etc.), but storing and analyzing data in traditional formats is slow, expensive, and often impractical.
Traditional data structures (through databases, logs, hash tables, etc.) aim at providing exact answers for queries like
- Is the user blocked?
- How many times has the user had a failed attempt?
- Has the user logged in before?
At scale, questions like these do not require exact answers; they simply need a quick way to rule out suspicious behavior. This is where modern systems use probabilistic data structures, often balancing perfect accuracy with a faster, memory-efficient alternative at scale. In this post, we will explore two such data structures and the trade-offs they introduce.
Bloom Filters in Software Security
A bloom filter is a probabilistic data structure that can help answer if an element is present. It can help quickly answer questions like "Have I seen the user before?" while it provides high confidence when it says a user has not been seen, but may also provide false positives (i.e., occasionally gives a user as seen despite not seeing it).
Importantly, bloom filters do not store the actual data but rather a hashed representation of it, making them memory-efficient and privacy-preserving. Let's review this with a simple example of how this works.
Example:
Let's consider a use case where we may want to identify if a user has already visited before. For the scope of the post, we will use a simple HashSet of size 10, and whenever we see a user, we will mark the index with a value of 1.
Here is the initial state of the HashSet, where every index is set to 0.
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
We’ll use a simple hash function that determines the index based on the length of the user’s name.
- Suppose a user named "Tom" visits the page. Since
hash("Tom") = 3, we check the third position. Because it is currently0, we can confidently say that the user has not visited before, and we mark it as10 0 1 0 0 0 0 0 0 0 - Next, a new user named "Sarah" visits the page. With
hash("Sarah") = 5, we check the fifth position. Since it is also0, this indicates that Rahul has not visited before, and we update the hash set: This illustrates how a Bloom filter can say with certainty that a user has not been seen before—there are no false negatives.0 0 1 0 1 0 0 0 0 0 -
Now consider another user named "Joe", whose hash value
hash("Joe") = 3conflicts with "Tom". Even though Joe has never visited the page, the third index is already set to1. As a result, the system incorrectly indicates that Joe has visited before, demonstrating a false positive.
Count-Min Sketch in Software Security
A count-min sketch is a probabilistic data structure that quickly helps identify the frequency of an event. It helps answer questions like "How often has this occurred?"
Consider a scenario where we want to block users after two attempts. Below is a simplified example using a single hash function. (In real-world systems, multiple hash functions are used to reduce collisions.) Each visit from a user increments a counter at the hashed position. For simplicity, we determine the position based on the length of the user’s name.
Here is the initial state of the HashSet, where every index is set to 0.
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
- Suppose a user named "Tom" visits the page. Since
hash("Tom") = 3, We check the third position. Since the value is0, we can infer that Tom has not visited before, and we increment the counter:0 0 1 0 0 0 0 0 0 0 -
Next, a user named "Sarah" visits the page. With
hash("Sarah") = 5, we check the fifth position. As it is also0, this indicates Sarah has not visited before, and we update the hash set:0 0 1 0 1 0 0 0 0 0 -
When Tom visits again, we increment the existing value at index
3,0 0 2 0 1 0 0 0 0 0 -
If Tom visits a third time, the counter at index
3exceeds our threshold. Based on our initial condition, we treat this as system abuse and block the user.This example demonstrates how a count-in sketch-style approach can be used to approximate visit counts and enforce rate limits, while accepting the possibility of overcounting due to hash collisions, which can be significantly reduced with multiple properly defined hash functions.
Final Thoughts
Bloom filters and count-min sketches show how probabilistic data structures can make software security fast, scalable, and memory-efficient. While they trade perfect accuracy for performance, their guarantees — no false negatives in bloom filters and bounded overestimation in count-min sketches — make them ideal for early detection, abuse prevention, and rate limiting. Used wisely, they turn massive data streams into actionable security insights.
Opinions expressed by DZone contributors are their own.
Comments