Earth Mover Distance and T-Closeness
In this post, we take a look at the mathematics that allow databases to ensure the privacy and anonymity of thousands of users.
Join the DZone community and get the full member experience.Join For Free
There's an old saying that if you want to hide a tree, put it in a forest. An analogous principle in privacy is that a record preserves privacy if it's like a lot of other records.
The idea of k-anonymity is that every database record appears at least k times. If you have a lot of records and few fields, your value of k could be high. But as you get more fields, it becomes more likely that a combination of fields is unique. If k = 1, then k-anonymity offers no anonymity.
Another problem with k-anonymity is that it doesn't offer group privacy. A database could be k-anonymous but reveal information about a group if that group is homogeneous with respect to some field. That is, the method is subject to a homogeneity attack.
Or going the other way around, if you know already know something that stands about a group, this could help you identify the record belonging to an individual. That is, the method is subject to a background knowledge attack.
One way to address this shortcoming is l-diversity. This post won't go into l-diversity because it's an intermediate step to where we want to go, which is t-closeness.
The idea of t-closeness is that the distribution of sensitive data in every group is not too far from the distribution in the full population. The " t" comes from requiring that the distributions be no more than a distance t apart in a sense that we'll define below . If the sensitive data in a group doesn't stand out, this thwarts the homogeneity attack and the background knowledge attack.
Earth Mover Distance
When we say that the distribution on sensitive data within a group is not far from the distribution in the full data, how do we quantify what "far" means? That is, how do we measure the distance between two distributions?
There are a lot of ways to measure the similarity of two probability distributions. A common choice is the Kullback-Liebler divergence, though that's not what we'll use here. Instead, t-closeness uses the so-called earth mover distance (EMD), also know as the Wasserstein metric.
The idea of EMD is to imagine both probability distributions as piles of dirt and calculate the minimum amount of work needed to reshape the first pile so that it has the same shape as the second. The key attribute of EMD is that it takes distance into account.
Suppose your data is some ordered response, 1 through 5. Suppose distribution X has probability 0.8 at 1 and 0.05 for the rest of the responses. Distributions Y and Z are the same except they have 80% of their probability mass at 2 and at 5, respectively. By some measures, X is equally far from Y and Z, but the earth mover distance would say that X is closer to Y than to Z, which is more appropriate in our setting.
We can calculate the EMD for the example above. To transform X into Y, we need to move a probability mass of 0.75 from 1 to 2, and so the EWD is 0.75. To transform X into Z we need to move the same amount of mass, but we need to move it 4x further, and so the EWD is 3.
 It's common in math to use a variable name as an adjective, as with k-anonymity, l-diversity, and t-closeness. This is unfortunate because it isn't descriptive and locks in a variable naming convention. Someone used the variable k to count the number of redundant database tables, and the name stuck. Similarly, the l in l-diversity counts something.
As a side note, the variable names here follow the old FORTRAN convention that variables i through n represent integers by default, and that all other variables represent real (floating point) numbers by default. The t in t-closeness is a continuous measure, so the t is a real value, while k and l are integers.
Published at DZone with permission of John Cook, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.