R: Conditionally Updating Rows of a Data Frame
Join the DZone community and get the full member experience.
Join For FreeIn a blog post I wrote a couple of days ago about cohort analysis I had to assign a monthNumber to each row in a data frame and started out with the following code:
library(zoo) library(dplyr) monthNumber = function(cohort, date) { cohortAsDate = as.yearmon(cohort) dateAsDate = as.yearmon(date) if(cohortAsDate > dateAsDate) { "NA" } else { paste(round((dateAsDate - cohortAsDate) * 12), sep="") } } cohortAttendance %>% group_by(row_number()) %>% mutate(monthNumber = monthNumber(cohort, date)) %>% filter(monthNumber != "NA") %>% filter(monthNumber != "0") %>% mutate(monthNumber = as.numeric(monthNumber)) %>% arrange(monthNumber)
If we time this function using system.time we’ll see that it’s not very snappy:
system.time(cohortAttendance %>% group_by(row_number()) %>% mutate(monthNumber = monthNumber(cohort, date)) %>% filter(monthNumber != "NA") %>% filter(monthNumber != "0") %>% mutate(monthNumber = as.numeric(monthNumber)) %>% arrange(monthNumber)) user system elapsed 1.968 0.019 2.016
The reason for the poor performance is that we process each row of the data table individually due to the call to group_by on the second line. One way we can refactor the code is to use the ifelse which can process multiple rows at a time:
system.time( cohortAttendance %>% mutate(monthNumber = ifelse(as.yearmon(cohort) > as.yearmon(date), paste((round(as.yearmon(date) - as.yearmon(cohort))*12), sep=""), NA))) user system elapsed 0.026 0.000 0.026
Antonios suggested another approach which involves first setting every row to ‘NA’ and then selectively updating the appropriate rows. I ended up with the following code:
cohortAttendance$monthNumber = NA cohortAttendance$monthNumber[as.yearmon(cohortAttendance$cohort) > as.yearmon(cohortAttendance$date)] = paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep="")
Let’s measure that:
system.time(paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep="")) user system elapsed 0.013 0.000 0.013
Both approaches are much quicker than my original version although this one seems to be marginally quicker than the ifelse approach.
Note to future Mark: try to avoid grouping by row number – there’s usually a better and faster solution!
Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
Comparing Cloud Hosting vs. Self Hosting
-
Observability Architecture: Financial Payments Introduction
-
Microservices With Apache Camel and Quarkus
-
Auditing Tools for Kubernetes
Comments