DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Data
  4. Understanding Bloom Filter

Understanding Bloom Filter

Giuseppe Vettigli user avatar by
Giuseppe Vettigli
·
Jan. 29, 13 · Interview
Like (1)
Save
Tweet
Share
6.39K Views

Join the DZone community and get the full member experience.

Join For Free

A Bloom Filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. It is based on a probabilistic mechanism where false positive retrieval results are possible, but false negatives are not. In this post we will see a pure python implementation of the Bloom Filter and the end we will see how to tune the parameters in order to minimize the number of false positive results. 

Let's begin with a little bit of theory. The idea behind the filter is to allocate a bit vector of length m, initially all set to 0, and then choose k independent hash functions, h1, h2, ..., hk, each with range [1 m]. When an element a is added to the set then the bits at positionsh(a)1, h(a)2, ..., h(a)k in the bit vector are set to 1. Given a query element q we can test whether it is in the set using the bits at positions h(a)1, h(a)2, ..., h(a)k in the vector. If any of these bits is 0 we report that q is not in the set otherwise we report that q is. The thing we have to care about is that in the first case there remains some probability that q is not in the set which could lead us to a false positive response. 
The following class is a naive implementation of a Bloom Filter (pay attention: this implementation is not supposed to be suitable for production. It is made just to show how a Bloom Filter works and to study its behavior):

classBloom:""" Bloom Filter """def __init__(self,m,k,hash_fun):"""
m, size of the vector
k, number of hash fnctions to compute
hash_fun, hash function to use
"""self.m = m
self.vector =[0]*m # initialize the vectorself.k = k
self.hash_fun = hash_fun
self.data ={}# data structure to store the dataself.false_positive =0def insert(self,key,value):""" insert the pair (key,value) in the database """self.data[key]= value
for i in range(self.k):self.vector[self.hash_fun(key+str(i))%self.m]=1def contains(self,key):""" check if key is cointained in the database
using the filter mechanism """for i in range(self.k):ifself.vector[self.hash_fun(key+str(i))%self.m]==0:returnFalse# the key doesn't existreturnTrue# the key can be in the data setdefget(self,key):""" return the value associated with key """ifself.contains(key):try:returnself.data[key]# actual lookupexceptKeyError:self.false_positive +=1
The usage of this filter is pretty easy, we have to initialize the data structure with a hash function, a value for k and the size of the bit vector then we can start adding items as in this example:

import hashlib




def hash_f(x):
h = hashlib.sha256(x)returnint(h.hexdigest(),base=16)




b =Bloom(100,10,hash_f)
b.insert('this is a key','this is a value')print b.get('this is a key')
Now, the problem is to choose the parameters of the filter in order to minimize the number of false positive results. We have that after inserting n elements into a table of size m, the probability that a particular bit is still 0 is exactly 


Hence, afer n insertions, the probability that a certain bit is 1 is 


So, for fixed parameters m and n, the optimal value k that minimizes this probability is 


With this in mind we can test our filter. The first thing we need is a function which tests the Bloom Filter for fixed values of m, n and k countinig the percentage of false positive:
import random




def rand_data(n, chars):""" generate random strings using the characters in chars """return''.join(random.choice(chars)for i in range(n))def bloomTest(m,n,k):""" return the percentage of false positive """
bloom =Bloom(m,k,hash_f)# generating a random data
rand_keys =[rand_data(10,'abcde')for i in range(n)]# pushing the items into the data structurefor rk in rand_keys:
bloom.insert(rk,'data')# adding other elements to the dataset
rand_keys = rand_keys +[rand_data(10,'fghil')for i in range(n)]# performing a query for each element of the datasetfor rk in rand_keys:
bloom.get(rk)returnfloat(bloom.false_positive)/n*100.0
If we fix m = 10000 and n = 1000, according to the equations above, we have that the value of k which minimizes the false positive number is around 6.9314. We can confirm that experimentally with the following test:
# testing the filter
m =10000
n =1000
k = range(1,64)
perc =[bloomTest(m,n,kk)for kk in k]# k is varying# plotting the result of the testfrom pylab import plot,show,xlabel,ylabel
plot(k,perc,'--ob',alpha=.7)
ylabel('false positive %')
xlabel('k')
show()
The result of the test should be as follows 


Looking at the graph we can confirm that for k around 7 we have the lowest false positive percentage.

Filter (software) Data structure

Published at DZone with permission of Giuseppe Vettigli, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • How Observability Is Redefining Developer Roles
  • Top Three Docker Alternatives To Consider
  • AWS Cloud Migration: Best Practices and Pitfalls to Avoid
  • Real-Time Stream Processing With Hazelcast and StreamNative

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: