R: Joining multiple data frames
Join the DZone community and get the full member experience.
Join For FreeI’ve been looking through the code from Martin Eastwood’s excellent talk ‘Predicting Football Using R‘ and was intrigued by the code which reshaped the data into that expected by glm.
The original looks like this:
df <- read.csv('http://www.football-data.co.uk/mmz4281/1314/E0.csv') # munge data into format compatible with glm function df <- apply(df, 1, function(row){ data.frame(team=c(row['HomeTeam'], row['AwayTeam']), opponent=c(row['AwayTeam'], row['HomeTeam']), goals=c(row['FTHG'], row['FTAG']), home=c(1, 0)) }) df <- do.call(rbind, df)
The initial data frame looks like this:
> library(dplyr) > df %>% select(HomeTeam, AwayTeam, FTHG, FTAG) %>% head(1) HomeTeam AwayTeam FTHG FTAG 1 Arsenal Aston Villa 1 3
And we want to get it to look like this:
> head(df, 2) team opponent goals home HomeTeam Arsenal Aston Villa 1 1 AwayTeam Aston Villa Arsenal 3 0
So for each row in the initial data frame we want to have two rows: one representing each team, how many goals they scored in the match and whether they were playing at home or away.
I really like dplyr’s pipelining function so I thought I’d try and translate Martin’s code to use that and other dplyr functions.
I ended up with the following two sets of function calls:
df %>% select(team = HomeTeam, opponent = AwayTeam, goals = FTHG) %>% mutate(home = 1) df %>% select(team = AwayTeam, opponent = HomeTeam, goals = FTAG) %>% mutate(home = 0)
I’m doing pretty much the same thing as Martin except I’ve used dplyr’s select and mutate functions to transform the data frame.
The next step was to join those two data frames together and with Nicole’s help I realised that there are many ways we can do this.
The functions that will do the job are:
We decided to benchmark them to see which was able to transform the data frame the fastest:
# load data into data.frame dfOrig <- read.csv('http://www.football-data.co.uk/mmz4281/1314/E0.csv') original = function(df) { df <- apply(df, 1, function(row){ data.frame(team=c(row['HomeTeam'], row['AwayTeam']), opponent=c(row['AwayTeam'], row['HomeTeam']), goals=c(row['FTHG'], row['FTAG']), home=c(1, 0)) }) do.call(rbind, df) } newRBind = function(df) { rbind(df %>% select(team = HomeTeam, opponent = AwayTeam, goals = FTHG) %>% mutate(home = 1), df %>% select(team = AwayTeam, opponent = HomeTeam, goals = FTAG) %>% mutate(home = 0)) } newUnion = function(df) { union(df %>% select(team = HomeTeam, opponent = AwayTeam, goals = FTHG) %>% mutate(home = 1), df %>% select(team = AwayTeam, opponent = HomeTeam, goals = FTAG) %>% mutate(home = 0)) } newJoin = function(df) { join(df %>% select(team = HomeTeam, opponent = AwayTeam, goals = FTHG) %>% mutate(home = 1), df %>% select(team = AwayTeam, opponent = HomeTeam, goals = FTAG) %>% mutate(home = 0), type = "full") } newMerge = function(df) { merge(df %>% select(team = HomeTeam, opponent = AwayTeam, goals = FTHG) %>% mutate(home = 1), df %>% select(team = AwayTeam, opponent = HomeTeam, goals = FTAG) %>% mutate(home = 0), all = TRUE) }
> library(microbenchmark) > microbenchmark(original(dfOrig)) Unit: milliseconds expr min lq mean median uq max neval original(dfOrig) 189.4 196.8 202.5 201 205.5 284 100 > microbenchmark(newRBind(dfOrig)) Unit: milliseconds expr min lq mean median uq max neval newRBind(dfOrig) 2.197 2.274 2.396 2.309 2.377 4.526 100 > microbenchmark(newUnion(dfOrig)) Unit: milliseconds expr min lq mean median uq max neval newUnion(dfOrig) 2.156 2.223 2.377 2.264 2.34 4.597 100 > microbenchmark(newJoin(dfOrig)) Unit: milliseconds expr min lq mean median uq max neval newJoin(dfOrig) 5.961 6.132 6.817 6.253 6.65 11.95 100 > microbenchmark(newMerge(dfOrig)) Unit: milliseconds expr min lq mean median uq max neval newMerge(dfOrig) 7.121 7.413 8.037 7.541 7.934 13.32 100
We actually get a 100 time speed up over the original function if we use rbind or union whereas with merge or join it’s around 35 times quicker.
In this case using merge or join is a bit misleading because we’re not actually connecting the data frames together based on any particular field – we are just appending one to the other.
The code’s available as a gist if you want to have a play.
Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments