Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Big Aggregate Queries Can Still Violate Privacy

DZone's Guide to

Big Aggregate Queries Can Still Violate Privacy

Limiting queries to aggregates of large sets is not enough. You can still get the information you want from small sets based on the information in large sets.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

Suppose you want to prevent your data science team from being able to find out information on individual customers, but you do want them to be able to get overall statistics. So you implement two policies:

  1. Data scientists can only query aggregate statistics, such as counts and averages.
  2. These aggregate statistics must be based on results that return at least 1,000 database rows.

This sounds good, but it's naive. It's not enough to protect customer privacy.

Someone wants to know how much money customer 123456789 makes. If he asks for this person's income, the query would be blocked by the first rule. If he asks for the average income of customers with ID 123456789, then he gets past the first rule but not the second.

He decides to test whether the customer in question makes a six-figure income. So first, he queries for the number of customers with income over $100,000. This is a count, so it gets past the first rule. The result turns out to be 14,254, so it gets past the second rule as well. Now, he asks how many customers with ID not equal to 123456789 have income over $100,000. This is a valid query, as well, and returns 14,253. So by executing only queries that return aggregate statistics on thousands of rows, he found out that customer 123456789 has at least a six-figure income.

Now he goes back and asks for the average income of customers with income over $100,000. Then he asks for the average income of customers with income over $100,000 and with ID not equal to 123456789. With a little algebra, he's able to find customer 123456789's exact income.

You might object that it's cheating to have a clause such as "ID not equal 123456789" in a query. Of course it's cheating. It clearly violates the spirit of the law, but not the letter. You might try to patch the rules by saying you cannot ask questions about a small set, nor about the complement of a small set. (Readers familiar with measure theory might sense a σ-algebra lurking in the background...)

That doesn't work either. Someone could run queries on customers with ID less than or equal to 123456789 and on customers with ID greater than or equal to 123456789. Both these sets and their complements may be large but they let you find out information on an individual.

You may be asking, Why let a data scientist have access to customer IDs at all? Obviously, you wouldn't do that if you wanted to protect customer privacy. The point of this example is that limiting queries to aggregates of large sets is not enough. You can find out information on small sets from information on large sets. This could still be a problem with obvious identifiers removed.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
privacy ,big data ,queries ,data analytics ,aggregation ,data science

Published at DZone with permission of John Cook, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}