Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

R: Conditionally Updating Rows of a Data Frame

DZone's Guide to

R: Conditionally Updating Rows of a Data Frame

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

In a blog post I wrote a couple of days ago about cohort analysis I had to assign a monthNumber to each row in a data frame and started out with the following code:

library(zoo)
library(dplyr)
 
monthNumber = function(cohort, date) {
  cohortAsDate = as.yearmon(cohort)
  dateAsDate = as.yearmon(date)
 
  if(cohortAsDate > dateAsDate) {
    "NA"
  } else {
    paste(round((dateAsDate - cohortAsDate) * 12), sep="")
  }
}
 
cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(monthNumber != "0") %>% 
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber)

If we time this function using system.time we’ll see that it’s not very snappy:

system.time(cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(monthNumber != "0") %>% 
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber))
 
   user  system elapsed 
  1.968   0.019   2.016

The reason for the poor performance is that we process each row of the data table individually due to the call to group_by on the second line. One way we can refactor the code is to use the ifelse which can process multiple rows at a time:

system.time(
cohortAttendance %>% 
  mutate(monthNumber = ifelse(as.yearmon(cohort) > as.yearmon(date), 
                              paste((round(as.yearmon(date) - as.yearmon(cohort))*12), sep=""), 
                              NA)))
   user  system elapsed 
  0.026   0.000   0.026

Antonios suggested another approach which involves first setting every row to ‘NA’ and then selectively updating the appropriate rows. I ended up with the following code:

cohortAttendance$monthNumber = NA
 
cohortAttendance$monthNumber[as.yearmon(cohortAttendance$cohort) > as.yearmon(cohortAttendance$date)] = paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep="")

Let’s measure that:

system.time(paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep=""))
   user  system elapsed 
  0.013   0.000   0.013

Both approaches are much quicker than my original version although this one seems to be marginally quicker than the ifelse approach.

Note to future Mark: try to avoid grouping by row number – there’s usually a better and faster solution!

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}