Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Estimating Age from First Name, Part 1

DZone's Guide to

Estimating Age from First Name, Part 1

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Today I read a cute post from Flowing Data on the most trendy names in US history. What caught my attention was a link posted in the article to the source data, which happens to be yearly lists of baby names registered with the US social security agency since 1880
(see here). I thought that it might be good to compile and use these lists at work for two reasons:

(1) I don’t have experience handling file input programmatically in R (ie working with a bunch of files in a directory instead of manually loading one or two) and
(2) It could be useful to have age estimates in the donor files that I work with (using the year when each first name was most popular).

I’ve included the R code in this post at the bottom, after the following explanatory text.

I managed to build a dataframe that contains in each row a name, how many people were registered as having been born with that name in a given year, the year, the total population for that year, and the relative proportion of people with that name in that year.

Once I got that dataframe, I built a function to query that dataframe for the year when a given name was most popular, an estimated age using that year, and the relative proportion of people born with that name from that year.

I don’t have any testing data to check the results against, but I did do an informal check around the office, and it seems okay!

However, I’d like to scale this upwards so that age estimates can be calculated for each row in a vector of first names. As the code stands below, the function I made takes too long to be scaled up effectively.

I’m wondering what’s the best approach to take? Some ideas I have so far follow:

  • Create a smaller dataframe where each row contains a unique name, the year when it was most popular, and the relative popularity in that year. Make a function to query this new dataframe.
  • Possibly convert the above dataframe into a data table and then building a function to query the data table instead.
  • Failing the efficacy of the above two ideas, load the popularity data into Python, and make a function to query it there instead.
Does anyone have any better ideas for me?

library(stringr)
library(plyr)
 
# We're assuming you've downloaded the SSA files into your R project directory.
 
file_listing = list.files()[3:135]
for (f in file_listing) {
  year = str_extract(f, "[0-9]{4}")
  if (year == "1880") { # Initializing the very long dataframe
    name_data = read.csv(f, header=FALSE)
    names(name_data) = c("Name", "Sex", "Pop")
    name_data$Year = rep(year, dim(name_data)[1]) }
  else { # adding onto the very long dataframe
    name_data_new = read.csv(f, header=FALSE)
    names(name_data_new) = c("Name", "Sex", "Pop")
    name_data_new$Year = rep(year, dim(name_data_new)[1])
    name_data = rbind(name_data, name_data_new)
}}
 
year_pop_totals = ddply(name_data, .(Year), function (x) sum(x$Pop))
name_data = merge(name_data, year_pop_totals, by.x="Year", by.y="Year", all.x=TRUE)
name_data$Rel_Pop = name_data$Pop/name_data$V1
 
estimate_age = function (input_name, sex = NA) {
if (is.na(sex)) {
  name_subset = subset(name_data, Name == input_name & Year >= 1921)} #1921 is a year I chose arbitrarily. Change how you like.
else {
  name_subset = subset(name_data, Name == input_name & Year >= 1921 & Sex == sex)
}
  year_and_rel_pop = name_subset[which(name_subset$Rel_Pop == max(name_subset$Rel_Pop)),c(1,6)]
  current_year = as.numeric(substr(Sys.time(),1,4))
  estimated_age = current_year - as.numeric(year_and_rel_pop[1])
  return(list(year_of_birth=as.numeric(year_and_rel_pop[1]), age=estimated_age, relative_pop=sprintf("%1.2f%%",year_and_rel_pop[2]*100)))
}

I’ll also accept any suggestions for cleaning up my code as is :)



Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}