Over a million developers have joined DZone.

Estimating Age from First Name, Part 1

· Big Data Zone

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Today I read a cute post from Flowing Data on the most trendy names in US history. What caught my attention was a link posted in the article to the source data, which happens to be yearly lists of baby names registered with the US social security agency since 1880
(see here). I thought that it might be good to compile and use these lists at work for two reasons:

(1) I don’t have experience handling file input programmatically in R (ie working with a bunch of files in a directory instead of manually loading one or two) and
(2) It could be useful to have age estimates in the donor files that I work with (using the year when each first name was most popular).

I’ve included the R code in this post at the bottom, after the following explanatory text.

I managed to build a dataframe that contains in each row a name, how many people were registered as having been born with that name in a given year, the year, the total population for that year, and the relative proportion of people with that name in that year.

Once I got that dataframe, I built a function to query that dataframe for the year when a given name was most popular, an estimated age using that year, and the relative proportion of people born with that name from that year.

I don’t have any testing data to check the results against, but I did do an informal check around the office, and it seems okay!

However, I’d like to scale this upwards so that age estimates can be calculated for each row in a vector of first names. As the code stands below, the function I made takes too long to be scaled up effectively.

I’m wondering what’s the best approach to take? Some ideas I have so far follow:

  • Create a smaller dataframe where each row contains a unique name, the year when it was most popular, and the relative popularity in that year. Make a function to query this new dataframe.
  • Possibly convert the above dataframe into a data table and then building a function to query the data table instead.
  • Failing the efficacy of the above two ideas, load the popularity data into Python, and make a function to query it there instead.
Does anyone have any better ideas for me?

# We're assuming you've downloaded the SSA files into your R project directory.
file_listing = list.files()[3:135]
for (f in file_listing) {
  year = str_extract(f, "[0-9]{4}")
  if (year == "1880") { # Initializing the very long dataframe
    name_data = read.csv(f, header=FALSE)
    names(name_data) = c("Name", "Sex", "Pop")
    name_data$Year = rep(year, dim(name_data)[1]) }
  else { # adding onto the very long dataframe
    name_data_new = read.csv(f, header=FALSE)
    names(name_data_new) = c("Name", "Sex", "Pop")
    name_data_new$Year = rep(year, dim(name_data_new)[1])
    name_data = rbind(name_data, name_data_new)
year_pop_totals = ddply(name_data, .(Year), function (x) sum(x$Pop))
name_data = merge(name_data, year_pop_totals, by.x="Year", by.y="Year", all.x=TRUE)
name_data$Rel_Pop = name_data$Pop/name_data$V1
estimate_age = function (input_name, sex = NA) {
if (is.na(sex)) {
  name_subset = subset(name_data, Name == input_name & Year >= 1921)} #1921 is a year I chose arbitrarily. Change how you like.
else {
  name_subset = subset(name_data, Name == input_name & Year >= 1921 & Sex == sex)
  year_and_rel_pop = name_subset[which(name_subset$Rel_Pop == max(name_subset$Rel_Pop)),c(1,6)]
  current_year = as.numeric(substr(Sys.time(),1,4))
  estimated_age = current_year - as.numeric(year_and_rel_pop[1])
  return(list(year_of_birth=as.numeric(year_and_rel_pop[1]), age=estimated_age, relative_pop=sprintf("%1.2f%%",year_and_rel_pop[2]*100)))

I’ll also accept any suggestions for cleaning up my code as is :)

Hortonworks Sandbox is a personal, portable Apache Hadoop® environment that comes with dozens of interactive Hadoop and it's ecosystem tutorials and the most exciting developments from the latest HDP distribution, brought to you in partnership with Hortonworks.


Published at DZone with permission of Matthew Dubins , DZone MVB .

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}