Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Bits of Information in Age, Birthday, and Birthdate

DZone's Guide to

Bits of Information in Age, Birthday, and Birthdate

Look at how much information is contained in someone's age, zip code, and birthdate to demonstrate the theory that 87% of the US population can be identified based on zip code, sex, and birth date.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

The previous post looked at how much information is contained in zip codes. This post will look at how much information is contained in someone's age, birthday, and birthdate. Combining zip code with birthdate will demonstrate the plausibility of Latanya Sweeney's famous result [*] that 87% of the US population can be identified based on zip code, sex, and birth date.

Birthday

Birthday is the easiest. There is a small variation in the distribution of birthdays, but this doesn't matter for our purposes. The amount of information in a birthday, to three significant figures, is 8.51 bits, whether you include or exclude leap days. You can assume all birthdays are equally common or use actual demographic data. It only makes a difference in the third decimal place.

Age

I'll be using the following age distribution data found on Wikipedia.

|-----------+------------|
| Age range | Population |
|-----------+------------|
|  0- 4     |   20201362 |
|  5- 9     |   20348657 |
| 10-14     |   20677194 |
| 15-19     |   22040343 |
| 20-24     |   21585999 |
| 25-29     |   21101849 |
| 30-34     |   19962099 |
| 35-39     |   20179642 |
| 40-44     |   20890964 |
| 45-49     |   22708591 |
| 50-54     |   22298125 |
| 55-59     |   19664805 |
| 60-64     |   16817924 |
| 65-69     |   12435263 |
| 70-74     |    9278166 |
| 75-79     |    7317795 |
| 80-84     |    5743327 |
| 85+       |    5493433 |
|-----------+------------|

To get data for each particular age, I'll assume ages are evenly distributed in each group, and I'll assume the 85+ group consists of people from ages 85 to 92. (See "The Christmas Song," commonly known as "Chestnuts Roasting on an Open Fire.")

With these assumptions, there are 6.4 bits of information in age. This seems plausible: if all ages were uniformly distributed between 0 and 63, there would be exactly 6 bits of information since 26 = 64.

Birthdate

If we assume birthdays are uniformly distributed within each age, then age and birth date are independent. The information contained in the birthdate would be the sum of the information contained in birthday and age, or 8.5 + 6.4 = 14.9 bits.

Zip Code, Sex, and Age

The previous post showed there are 13.8 bits of information in a zip code. There are about an equal number of men and women, so sex adds 1 bit. So zip code, sex, and birthdate would give a total of 29.7 bits. Since the US population is between 228 and 29, it's plausible that we'd have enough information to identify everyone.

We've made a number of simplifying assumptions. We were a little fast and loose with age data, and we've assumed independence several times. We know that sex and age are not independent: more babies are boys, but women live longer. Still, Latanya Sweeney found empirically that you can identify 87% of Americans using the combination of zip code, sex, and birth date. See her paper Simple Demographics Often Identify People Uniquely. Carnegie Mellon University, Data Privacy Working Paper 3. Pittsburgh 2000 (available here).

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,statistics ,data analytics

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}