Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Big Data and Privacy

DZone 's Guide to

Big Data and Privacy

How does big data impact privacy? Which is a bigger risk to your privacy, being part of a little database or a big database?

· Big Data Zone ·
Free Resource

Rows vs. Columns

People commonly speak of big data in terms of volume — the "four Vs" of big data being volume, variety, velocity, and veracity— but what we're concerned with here might better be called "area." We'll think of our data being in one big table. If there are repeated measures on an individual, think of them as more columns in a denormalized database table.

In what sense is the data big: is it wide or long? That is, if we think of the data as a table with rows for individuals and columns for different fields of information on individuals, are there a lot of rows or a lot of columns?

All else being equal, your privacy goes down as columns go up. The more information someone has about you, the more likely some of it may be used in combination to identify you.

How privacy varies with the number of rows is more complicated. Your privacy could go up or down with the number of rows.

The more individuals in a dataset, the more likely there are individuals like you in the dataset. From the standpoint of k-anonymity, this makes it harder to indentify you, but easier to reveal information about you.

Group Privacy

For example, suppose there are 50 people who have all the same quasi-identifiers as you do. Say you're a Native American man in your 40s and there are 49 others like you. Then someone who knows your demographics can't be sure which record is yours; they'd have a 2% chance of guessing the right one. The presence of other people with similar demographics makes you harder to identify. On the other hand, their presence makes it more likely that someone could find out something you all have in common. If the data shows that middle aged Native American men are highly susceptible to some disease, then the data implies that you are likely to be susceptible to that disease.

Because privacy measures aimed at protecting individuals don't necessarily protect groups, some minority groups have been reluctant to participate in scientific studies. The Genetic Information Nondiscrimination Act of 2008 (GINA) makes some forms of genetic discrimination illegal, but it's quite understandable that minority groups might still be reluctant to participate in studies.

Improving Privacy With Size

Your privacy can improve as a dataset gets bigger. If there are a lot of rows, the data curator can afford a substantial amount of randomization without compromising the value of the data. The noise in the data will not effect statistical conclusions from the data but will protect individual privacy.

With differential privacy, the data is held by a trusted curator. Noise is added not to the data itself but to the results of queries on the data. Noise is added in proportion to the sensitivity of a query. The sensitivity of a query often goes down with the size of the database, and so a differentially private query of a big dataset may only need to add a negligible amount of noise to maintain privacy.

If the dataset is very large, it may be possible to randomize the data itself before it enters the database using randomized response or local differential privacy. With these approaches, there's no need for a trusted data curator. This wouldn't be possible with a small dataset because the noise would be too large relative to the size of the data.

Topics:
big data ,big data sets ,privacy ,data security

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}