Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Using the Mortality Database

DZone's Guide to

Using the Mortality Database

Take a look at some fun analysis based on demographic data and using a demography library in R to get the proportion of people alive in 1945 who are still alive today.

· Big Data Zone
Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

In demography, we like to use life tables to estimate the probability that someone born in 1945 (as an example) is still alive nowadays. But another interesting quantity might be the probability that someone alive in 1945 is still alive nowadays.

The main difference is that we do not know when that person, alive in 1945, was born. Someone who was old in 1945 is very unlikely to still be alive in 2017. To compute those probabilities, we can use datasets from here. More precisely, we need both death and birth data. I assume that datasets (text files) were downloaded (it is necessary to register, for free, to get the data).

D=read.table("FRDeaths_1x1.txt",skip=1,header=TRUE)
B=read.table("FRBirths.txt",skip=1,header=TRUE)

In the death dataset, there is a “110+” for people older than 110 years. For convenience, let us cap our observations at 110 years old,

D$Age=as.numeric(as.character(D$Age))
D$Age[is.na(D$Age)]=110

Consider now a first function that will return for people born in 1930 (as an example) two peices of information:

  • The number of people (here, let's consider women only) born in 1930 (from the birth database).
  • The number of death of people of age 0 in 1930, people of age 1 in 1931, people of age 2 in 1932, etc.

The code is simple:

nb=function(y=1930){
debut=1816
MatDFemale=matrix(D$Female,nrow=111)
colnames(MatDFemale)=debut+0:198
cly=y-debut+1:111
deces=diag(MatDFemale[,cly[cly%in%1:199]])
return(c(B$Female[B$Year==y],deces))}

We have a single number for the number of births, and then a vector for the number of deaths. Consider now another function. Consider the people born in 1930. We want to get two numbers: the number of people still alive in 1945 (say), and the number of people still alive nowadays. The ratio will be the proportion of people born in 1930 that were alive in 1945 that are still alive in 2015.

pop=function(ne=1930,an=1945){
comptage=nb(ne)
s=0
if(an>ne) s=sum(comptage[seq(2,1+an-ne)])
p1=max(comptage[1]-s,0)
p2=max(p1-sum(comptage[seq(2+an-ne,length(comptage))]),0)
c(p1,p2)
}

Then, for a given year (say 1945), to get the proportion of people alive in 1945 that are still alive today, we need to count how many people born in 1944 were still alive in 1945, and in 2015, but also born in 1943, 1942, etc. And we simply consider the ratio of the total number of people alive in 2015 over the total number of people alive in 1945:

ptn=function(y=1945){
V=Vectorize(function(x) pop(x,y))(1816:y)
sum(V[2,!is.na(V[2,])])/sum(V[1,!is.na(V[1,])])
}

Hence, 22% of those alive in 1945 are still alive in 2015:

> ptn(1945)
[1] 0.2209435

Actually, instead of looking only at 1945, it is possible to get a plot:

P=Vectorize(ptn)(1900:2010)
plot(1900:2010,P,type="l",ylim=0:1)

For instance:

> ptn(1975)
[1] 0.6377413

63.7% of those alive in 1975 are still alive 40 years after. That is a rather interesting function; I was surprised that I couldn’t find it is standard demographical R package.

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

Topics:
database ,demographics ,r

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}