DZone
Big Data Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Big Data Zone > R: Joining multiple data frames

R: Joining multiple data frames

Mark Needham user avatar by
Mark Needham
·
Nov. 21, 14 · Big Data Zone · Interview
Like (0)
Save
Tweet
9.71K Views

Join the DZone community and get the full member experience.

Join For Free

I’ve been looking through the code from Martin Eastwood’s excellent talk ‘Predicting Football Using R‘ and was intrigued by the code which reshaped the data into that expected by glm.

The original looks like this:

df <- read.csv('http://www.football-data.co.uk/mmz4281/1314/E0.csv')
 
# munge data into format compatible with glm function
df <- apply(df, 1, function(row){
  data.frame(team=c(row['HomeTeam'], row['AwayTeam']),
             opponent=c(row['AwayTeam'], row['HomeTeam']),
             goals=c(row['FTHG'], row['FTAG']),
             home=c(1, 0))
})
df <- do.call(rbind, df)

The initial data frame looks like this:

> library(dplyr)
> df %>% select(HomeTeam, AwayTeam, FTHG, FTAG) %>% head(1)
  HomeTeam    AwayTeam FTHG FTAG
1  Arsenal Aston Villa    1    3

And we want to get it to look like this:

> head(df, 2)
                team    opponent goals home
HomeTeam     Arsenal Aston Villa     1    1
AwayTeam Aston Villa     Arsenal     3    0

So for each row in the initial data frame we want to have two rows: one representing each team, how many goals they scored in the match and whether they were playing at home or away.

I really like dplyr’s pipelining function so I thought I’d try and translate Martin’s code to use that and other dplyr functions.

I ended up with the following two sets of function calls:

df %>% select(team = HomeTeam, opponent = AwayTeam, goals = FTHG) %>% mutate(home = 1)
df %>% select(team = AwayTeam, opponent = HomeTeam, goals = FTAG) %>% mutate(home = 0)

I’m doing pretty much the same thing as Martin except I’ve used dplyr’s select and mutate functions to transform the data frame.

The next step was to join those two data frames together and with Nicole’s help I realised that there are many ways we can do this.

The functions that will do the job are:

  • rbind
  • union
  • plyr’s join with ‘type = “full”‘
  • dplyr’s union
  • merge with ‘all = TRUE’

We decided to benchmark them to see which was able to transform the data frame the fastest:

# load data into data.frame
dfOrig <- read.csv('http://www.football-data.co.uk/mmz4281/1314/E0.csv')
 
original = function(df) {
  df <- apply(df, 1, function(row){
    data.frame(team=c(row['HomeTeam'], row['AwayTeam']),
               opponent=c(row['AwayTeam'], row['HomeTeam']),
               goals=c(row['FTHG'], row['FTAG']),
               home=c(1, 0))
  })
  do.call(rbind, df)
}
 
newRBind = function(df) {
  rbind(df %>% select(team = HomeTeam, opponent = AwayTeam, goals = FTHG) %>% mutate(home = 1),
        df %>% select(team = AwayTeam, opponent = HomeTeam, goals = FTAG) %>% mutate(home = 0))  
}
 
newUnion = function(df) {
  union(df %>% select(team = HomeTeam, opponent = AwayTeam, goals = FTHG) %>% mutate(home = 1),
        df %>% select(team = AwayTeam, opponent = HomeTeam, goals = FTAG) %>% mutate(home = 0))  
}
 
newJoin = function(df) {
  join(df %>% select(team = HomeTeam, opponent = AwayTeam, goals = FTHG) %>% mutate(home = 1),
       df %>% select(team = AwayTeam, opponent = HomeTeam, goals = FTAG) %>% mutate(home = 0),
      type = "full")  
}
 
newMerge = function(df) {
  merge(df %>% select(team = HomeTeam, opponent = AwayTeam, goals = FTHG) %>% mutate(home = 1),
       df %>% select(team = AwayTeam, opponent = HomeTeam, goals = FTAG) %>% mutate(home = 0),
       all = TRUE)  
}
> library(microbenchmark)
 
> microbenchmark(original(dfOrig))
Unit: milliseconds
             expr   min    lq  mean median    uq max neval
 original(dfOrig) 189.4 196.8 202.5    201 205.5 284   100
 
> microbenchmark(newRBind(dfOrig))
Unit: milliseconds
             expr   min    lq  mean median    uq   max neval
 newRBind(dfOrig) 2.197 2.274 2.396  2.309 2.377 4.526   100
 
> microbenchmark(newUnion(dfOrig))
Unit: milliseconds
             expr   min    lq  mean median   uq   max neval
 newUnion(dfOrig) 2.156 2.223 2.377  2.264 2.34 4.597   100
 
> microbenchmark(newJoin(dfOrig))
 
Unit: milliseconds
            expr   min    lq  mean median   uq   max neval
 newJoin(dfOrig) 5.961 6.132 6.817  6.253 6.65 11.95   100
 
> microbenchmark(newMerge(dfOrig))
Unit: milliseconds
             expr   min    lq  mean median    uq   max neval
 newMerge(dfOrig) 7.121 7.413 8.037  7.541 7.934 13.32   100

We actually get a 100 time speed up over the original function if we use rbind or union whereas with merge or join it’s around 35 times quicker.

In this case using merge or join is a bit misleading because we’re not actually connecting the data frames together based on any particular field – we are just appending one to the other.

The code’s available as a gist if you want to have a play.

Data (computing) Frame (networking) R (programming language)

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Querying Kafka Topics Using Presto
  • Maven Tutorial: Nice and Easy [Video]
  • OpenTelemetry in Action: Identifying Database Dependencies
  • Java Hashtable, HashMap, ConcurrentHashMap: Performance Impact

Comments

Big Data Partner Resources

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo