Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

R: Vectorising All the Things

DZone's Guide to

R: Vectorising All the Things

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

After my last post about finding the distance a date/time is from the weekend Hadley Wickham suggested I could improve the function by vectorising it…

@markhneedham vectorise with pmin(pmax(dateToLookup – before, 0), pmax(after – dateToLookup, 0)) / dhours(1)

— Hadley Wickham (@hadleywickham) December 14, 2014

…so I thought I’d try and vectorise some of the other functions I’ve written recently and show the two versions.

I found the following articles useful for explaining vectorisation and why you might want to do it:

Let’s get started.

Distance from the weekend

We want to find out how many hours away from the weekend i.e. nearest Saturday/Sunday a particular date/time is. We’ll be using the following libraries and set of date/times:

library(dplyr)
library(lubridate)
library(geosphere)
options("scipen"=100, "digits"=4)
 
times = ymd_hms("2002-01-01 17:00:00") + c(0:99) * hours(1)
data = data.frame(time = times)
> data %>% head()
                 time
1 2002-01-01 17:00:00
2 2002-01-01 18:00:00
3 2002-01-01 19:00:00
4 2002-01-01 20:00:00
5 2002-01-01 21:00:00
6 2002-01-01 22:00:00

Let’s have a look at the non vectorised version first:

distanceFromWeekend = function(dateToLookup) {
  before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59)
  after  = ceiling_date(dateToLookup, "week") - days(1)
  timeToBefore = dateToLookup - before
  timeToAfter = after - dateToLookup
 
  if(timeToBefore < 0 || timeToAfter < 0) {
    0  
  } else {
    if(timeToBefore < timeToAfter) {
      timeToBefore / dhours(1)
    } else {
      timeToAfter / dhours(1)
    }
  }
}

Now let’s run it against our data frame:

> system.time(
    data %>% mutate(ind = row_number()) %>% group_by(ind) %>% mutate(dist = distanceFromWeekend(time))  
    )
   user  system elapsed 
  1.837   0.020   1.884

And now for Hadley’s vectorised version:

distanceFromWeekendVectorised = function(dateToLookup) {
  before = floor_date(dateToLookup, "week") + hours(23) + minutes(59) + seconds(59)
  after  = ceiling_date(dateToLookup, "week") - days(1)
  pmin(pmax(dateToLookup - before, 0), pmax(after - dateToLookup, 0)) / dhours(1)
}
 
> system.time(data %>% mutate(dist = distanceFromWeekendVectorised(time)))
   user  system elapsed 
  0.020   0.001   0.023

Extracting start date

My next example was from cleaning up Google Trends data and extracting the start date from a cell inside a CSV file.

We’ll use this data frame:

googleTrends = read.csv("/Users/markneedham/Downloads/report.csv", row.names=NULL)
names(googleTrends) = c("week", "score")
> googleTrends %>% head(10)
                        week score
1  Worldwide; 2004 - present      
2         Interest over time      
3                       Week neo4j
4    2004-01-04 - 2004-01-10     0
5    2004-01-11 - 2004-01-17     0
6    2004-01-18 - 2004-01-24     0
7    2004-01-25 - 2004-01-31     0
8    2004-02-01 - 2004-02-07     0
9    2004-02-08 - 2004-02-14     0
10   2004-02-15 - 2004-02-21     0

The non vectorised version looked like this:

> system.time(
    googleTrends %>% 
      mutate(ind = row_number()) %>% 
      group_by(ind) %>%
      mutate(dates = strsplit(week, " - "),
             start = dates[[1]][1] %>% strptime("%Y-%m-%d") %>% as.character())
    )
   user  system elapsed 
  0.215   0.000   0.214

In this case it’s actually not possible to vectorise the code using the strsplit so we need to use something else. Antonios showed me how to do so using substr:

> system.time(googleTrends %>% mutate(start = substr(week, 1, 10) %>% ymd()))
   user  system elapsed 
  0.018   0.000   0.017

Calculating haversine distance

I wanted to work out the great circular distance from a collection of venues to a centre point in London. I started out with this data frame:

centre = c(-0.129581, 51.516578)
venues = read.csv("/tmp/venues.csv")
 
> venues %>% head()
                       venue   lat      lon
1              Skills Matter 51.52 -0.09911
2                   Skinkers 51.50 -0.08387
3          Theodore Bullfrog 51.51 -0.12375
4 The Skills Matter eXchange 51.52 -0.09923
5               The Guardian 51.53 -0.12234
6            White Bear Yard 51.52 -0.10980

My non vectorised version looked like this:

> system.time(venues %>% 
    mutate(distanceFromCentre = by(venues, 1:nrow(venues), function(row) { distHaversine(c(row$lon, row$lat), centre)  }))
    )
   user  system elapsed 
  0.034   0.000   0.033

It’s pretty quick but we can do better – the distHaversine function allows us to calculate multiple distances if the first argument ot it is a matrix of lon/lat values rather than a vector:

> system.time(
    venues %>% mutate(distanceFromCentre = distHaversine(cbind(venues$lon, venues$lat), centre))
    )
   user  system elapsed 
  0.001   0.000   0.001

One I can’t figure out…

And finally I have a function which I can’t figure out how to vectorise but maybe someone with more R skillz than me can?

I have a data frame containing the cumulative member counts of various NoSQL London groups:

cumulativeMeetupMembers = read.csv("/tmp/cumulativeMeetupMembers.csv")
> cumulativeMeetupMembers %>% sample_n(10)
                               g.name dayMonthYear    n
4734            Hadoop Users Group UK   2013-10-26 1144
4668            Hadoop Users Group UK   2013-08-03  979
4936            Hadoop Users Group UK   2014-07-31 1644
5150                      Hive London   2012-10-15  109
8020        Neo4j - London User Group   2014-03-15  826
7666        Neo4j - London User Group   2012-08-06   78
1030                  Big Data London   2013-03-01 1416
6500        London MongoDB User Group   2013-09-21  952
8290 Oracle Big Data 4 the Enterprise   2012-06-04   61
2584              Data Science London   2012-03-20  285

And I want to find out the number of members for a group on a specific date. e.g. given the following data…

> cumulativeMeetupMembers %>% head(10)
                                          g.name dayMonthYear  n
1  Big Data / Data Science / Data Analytics Jobs   2013-01-29  1
2  Big Data / Data Science / Data Analytics Jobs   2013-02-06 15
3  Big Data / Data Science / Data Analytics Jobs   2013-02-07 28
4  Big Data / Data Science / Data Analytics Jobs   2013-02-10 31
5  Big Data / Data Science / Data Analytics Jobs   2013-02-18 33
6  Big Data / Data Science / Data Analytics Jobs   2013-03-27 38
7  Big Data / Data Science / Data Analytics Jobs   2013-04-16 41
8  Big Data / Data Science / Data Analytics Jobs   2013-07-17 53
9  Big Data / Data Science / Data Analytics Jobs   2013-08-28 58
10 Big Data / Data Science / Data Analytics Jobs   2013-11-11 63

…the number of members for the ‘Big Data / Data Science / Data Analytics Jobs’ group on the 10th November 2013 should be 58.

I created this data frame of groups and random dates:

dates = ymd("2014-09-01") + c(0:9) * weeks(1)
groups = cumulativeMeetupMembers %>% distinct(g.name) %>% select(g.name)
 
groupsOnDate = merge(dates, groups)
names(groupsOnDate) = c('date', 'name')
 
> groupsOnDate %>% sample_n(10)
          date                                            name
156 2014-10-06                                 GridGain London
153 2014-09-15                                 GridGain London
70  2014-11-03                                Couchbase London
185 2014-09-29                           Hadoop Users Group UK
105 2014-09-29                             Data Science London
137 2014-10-13            Equal Experts Technical Meetup Group
360 2014-11-03                        Scale Warriors of London
82  2014-09-08 Data Science & Business Analytics London Meetup
233 2014-09-15                 London ElasticSearch User Group
84  2014-09-22 Data Science & Business Analytics London Meetup

The non vectorised version looks like this:

memberCount = function(meetupMembers) {
  function(groupName, date) {
    (meetupMembers %>% 
       filter(g.name == groupName & dayMonthYear < date) %>% do(tail(., 1)))$n    
  }  
} 
 
findMemberCount = memberCount(cumulativeMeetupMembers)
 
> system.time(groupsOnDate %>% mutate(groupMembers = by(groupsOnDate, 1:nrow(groupsOnDate), function(row) { 
          findMemberCount(row$name, as.character(row$date))
        }) %>% 
        cbind() %>% 
        as.vector() ))
   user  system elapsed 
  2.259   0.005   2.269

The output looks like this:

          date                                     name groupMembers
116 2014-10-06                      DeNormalised London          157
322 2014-09-08                 OpenCredo Tech Workshops            7
71  2014-09-01                  Data Enthusiasts London             
233 2014-09-15          London ElasticSearch User Group          614
171 2014-09-01 HPC & GPU Supercomputing Group of London           80
109 2014-10-27                      Data Science London         3632
20  2014-11-03            Big Data Developers in London          708
42  2014-09-08              Big Data Week London Meetup           96
127 2014-10-13          Enterprise Search London Meetup          575
409 2014-10-27                            Women in Data          548

I’ve tried many different approaches but haven’t been able to come up with a version that lets me pass in all the rows to memberCount and calculate the count for each row in one go.

Any ideas/advice/hints welcome!

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}