DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. Databases
  4. Earth Mover Distance and T-Closeness

Earth Mover Distance and T-Closeness

In this post, we take a look at the mathematics that allow databases to ensure the privacy and anonymity of thousands of users.

John Cook user avatar by
John Cook
·
Nov. 04, 18 · Analysis
Like (1)
Save
Tweet
Share
4.84K Views

Join the DZone community and get the full member experience.

Join For Free

There's an old saying that if you want to hide a tree, put it in a forest. An analogous principle in privacy is that a record preserves privacy if it's like a lot of other records.

k-anonymity

The idea of k-anonymity is that every database record appears at least k times. If you have a lot of records and few fields, your value of k could be high. But as you get more fields, it becomes more likely that a combination of fields is unique. If k = 1, then k-anonymity offers no anonymity.

Another problem with k-anonymity is that it doesn't offer group privacy. A database could be k-anonymous but reveal information about a group if that group is homogeneous with respect to some field. That is, the method is subject to a homogeneity attack.

Or going the other way around, if you know already know something that stands about a group, this could help you identify the record belonging to an individual. That is, the method is subject to a background knowledge attack.

One way to address this shortcoming is l-diversity. This post won't go into l-diversity because it's an intermediate step to where we want to go, which is t-closeness.

t-closeness

The idea of t-closeness is that the distribution of sensitive data in every group is not too far from the distribution in the full population. The " t" comes from requiring that the distributions be no more than a distance t apart in a sense that we'll define below [1]. If the sensitive data in a group doesn't stand out, this thwarts the homogeneity attack and the background knowledge attack.

Earth Mover Distance

When we say that the distribution on sensitive data within a group is not far from the distribution in the full data, how do we quantify what "far" means? That is, how do we measure the distance between two distributions?

There are a lot of ways to measure the similarity of two probability distributions. A common choice is the Kullback-Liebler divergence, though that's not what we'll use here. Instead, t-closeness uses the so-called earth mover distance (EMD), also know as the Wasserstein metric.

The idea of EMD is to imagine both probability distributions as piles of dirt and calculate the minimum amount of work needed to reshape the first pile so that it has the same shape as the second. The key attribute of EMD is that it takes distance into account.

Suppose your data is some ordered response, 1 through 5. Suppose distribution X has probability 0.8 at 1 and 0.05 for the rest of the responses. Distributions Y and Z are the same except they have 80% of their probability mass at 2 and at 5, respectively. By some measures, X is equally far from Y and Z, but the earth mover distance would say that X is closer to Y than to Z, which is more appropriate in our setting.

We can calculate the EMD for the example above. To transform X into Y, we need to move a probability mass of 0.75 from 1 to 2, and so the EWD is 0.75. To transform X into Z we need to move the same amount of mass, but we need to move it 4x further, and so the EWD is 3.

Related Posts

  • Adding noise to a database
  • PHI and offshore processing
  • Bits of information in a zip code

[1] It's common in math to use a variable name as an adjective, as with k-anonymity, l-diversity, and t-closeness. This is unfortunate because it isn't descriptive and locks in a variable naming convention. Someone used the variable k to count the number of redundant database tables, and the name stuck. Similarly, the l in l-diversity counts something.

As a side note, the variable names here follow the old FORTRAN convention that variables i through n represent integers by default, and that all other variables represent real (floating point) numbers by default. The t in t-closeness is a continuous measure, so the t is a real value, while k and l are integers.

Database Distribution (differential geometry)

Published at DZone with permission of John Cook, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Custom Validators in Quarkus
  • How To Choose the Right Streaming Database
  • Create Spider Chart With ReactJS
  • DevOps for Developers: Continuous Integration, GitHub Actions, and Sonar Cloud

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: