DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Advanced Middleware Architecture For Secure, Auditable, and Reliable Data Exchange Across Systems
  • Delta Sharing vs Traditional Data Exchange: Secure Collaboration at Scale
  • SelfService HR Dashboards with Workday Extend and APIs
  • 5 Security Considerations for Deploying AI on Edge Devices

Trending

  • Detecting Bugs and Vulnerabilities in Java With SonarQube
  • No More Cheap Claude: 4 First Principles of Token Economics in 2026
  • AWS Kiro: The Agentic IDE That Makes Specs the Unit of Work
  • Solving the Mystery: Why Java RSS Grows in Docker on M1 Macs
  1. DZone
  2. Software Design and Architecture
  3. Security
  4. Probabilistic Data Structures for Software Security

Probabilistic Data Structures for Software Security

Learn how bloom filters and count-min sketch make security systems fast and scalable by trading perfect accuracy for speed and memory efficiency.

By 
Shashank Gollapudi user avatar
Shashank Gollapudi
·
Mar. 03, 26 · Analysis
Likes (1)
Comment
Save
Tweet
Share
1.6K Views

Join the DZone community and get the full member experience.

Join For Free

We are living in an era where software systems are growing in size with each passing day and often face a constant tension between the scale, performance, and security, where each of them is essential and non-negotiable. Security tools must process large volumes of data in real time (network logs, user activity, login attempts, password matches, etc.), but storing and analyzing data in traditional formats is slow, expensive, and often impractical. 

Traditional data structures (through databases, logs, hash tables, etc.) aim at providing exact answers for queries like 

  • Is the user blocked?
  • How many times has the user had a failed attempt?
  • Has the user logged in before?

At scale, questions like these do not require exact answers; they simply need a quick way to rule out suspicious behavior. This is where modern systems use probabilistic data structures, often balancing perfect accuracy with a faster, memory-efficient alternative at scale. In this post, we will explore two such data structures and the trade-offs they introduce.

Bloom Filters in Software Security

A bloom filter is a probabilistic data structure that can help answer if an element is present. It can help quickly answer questions like "Have I seen the user before?" while it provides high confidence when it says a user has not been seen, but may also provide false positives (i.e., occasionally gives a user as seen despite not seeing it).

Importantly, bloom filters do not store the actual data but rather a hashed representation of it, making them memory-efficient and privacy-preserving. Let's review this with a simple example of how this works.

Example:

Let's consider a use case where we may want to identify if a user has already visited before. For the scope of the post, we will use a simple HashSet of size 10, and whenever we see a user, we will mark the index with a value of 1.

Here is the initial state of the HashSet, where every index is set to 0.

0 0 0 0 0 0 0 0 0 0

We’ll use a simple hash function that determines the index based on the length of the user’s name. 

  1. Suppose a user named "Tom" visits the page. Since hash("Tom") = 3, we check the third position. Because it is currently 0, we can confidently say that the user has not visited before, and we mark it as 1
    0 0 1 0 0 0 0 0 0 0
  2. Next, a new user named "Sarah" visits the page. With hash("Sarah") = 5, we check the fifth position. Since it is also 0, this indicates that Rahul has not visited before, and we update the hash set: This illustrates how a Bloom filter can say with certainty that a user has not been seen before—there are no false negatives. 
    0 0 1 0 1 0 0 0 0 0
  3. Now consider another user named "Joe", whose hash value hash("Joe") = 3 conflicts with "Tom". Even though Joe has never visited the page, the third index is already set to 1. As a result, the system incorrectly indicates that Joe has visited before, demonstrating a false positive.
The above example is meant for simple illustration. While false positives cannot be completely eliminated, they can be significantly reduced by accurately estimating the scale of the system, choosing an appropriate HashSet size, and using multiple hash functions to add an extra layer of accuracy.

Count-Min Sketch in Software Security

A count-min sketch is a probabilistic data structure that quickly helps identify the frequency of an event. It helps answer questions like "How often has this occurred?"

Consider a scenario where we want to block users after two attempts. Below is a simplified example using a single hash function. (In real-world systems, multiple hash functions are used to reduce collisions.) Each visit from a user increments a counter at the hashed position. For simplicity, we determine the position based on the length of the user’s name.

Here is the initial state of the HashSet, where every index is set to 0.

0 0 0 0 0 0 0 0 0 0
  1. Suppose a user named "Tom" visits the page. Since hash("Tom") = 3, We check the third position. Since the value is 0, we can infer that Tom has not visited before, and we increment the counter:
    0 0 1 0 0 0 0 0 0 0
  2. Next, a user named "Sarah" visits the page. With hash("Sarah") = 5, we check the fifth position. As it is also 0, this indicates Sarah has not visited before, and we update the hash set:
    0 0 1 0 1 0 0 0 0 0
  3. When Tom visits again, we increment the existing value at index 3, 
    0 0 2 0 1 0 0 0 0 0
  4. If Tom visits a third time, the counter at index 3exceeds our threshold. Based on our initial condition, we treat this as system abuse and block the user.
    This example demonstrates how a count-in sketch-style approach can be used to approximate visit counts and enforce rate limits, while accepting the possibility of overcounting due to hash collisions, which can be significantly reduced with multiple properly defined hash functions.

Final Thoughts

Bloom filters and count-min sketches show how probabilistic data structures can make software security fast, scalable, and memory-efficient. While they trade perfect accuracy for performance, their guarantees — no false negatives in bloom filters and bounded overestimation in count-min sketches — make them ideal for early detection, abuse prevention, and rate limiting. Used wisely, they turn massive data streams into actionable security insights.

Data (computing) security

Opinions expressed by DZone contributors are their own.

Related

  • Advanced Middleware Architecture For Secure, Auditable, and Reliable Data Exchange Across Systems
  • Delta Sharing vs Traditional Data Exchange: Secure Collaboration at Scale
  • SelfService HR Dashboards with Workday Extend and APIs
  • 5 Security Considerations for Deploying AI on Edge Devices

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook