Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Compare Adjacent Rows in an R data.table

DZone's Guide to

Compare Adjacent Rows in an R data.table

Learn how to compare related, adjacent rows in an R data table by using R's handy lag function for time-series data.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

As part of my exploration of the Land Registry price paid data set, I wanted to compare the difference between consecutive sales of properties.

This means we need to group the sales by a property identifier and then get the previous sale price into a column on each row, unless it’s the first sale, in which case we’ll have ‘NA’. We can do this by creating a lag variable.

I’ll use a simpler data set which is very similar in structure to the Land Registry’s to demonstrate:

> blogDT = data.table(name = c("Property 1","Property 1","Property 1","Property 2","Property 2","Property 2"), 
                      price = c(10000, 12500, 18000, 245000, 512000, 1000000))
 
> blogDT
         name   price
1: Property 1   10000
2: Property 1   12500
3: Property 1   18000
4: Property 2  245000
5: Property 2  512000
6: Property 2 1000000

We want to group by the ‘name’ column and then have the price on row 1 show on row 2, the price on row 2 on row 3, the price on row 4 on row 5 and the price on row 5 on row 6. To do that we’ll introduce a ‘lag.price’ column:

> blogDT[, lag.price := c(NA, price[-.N]), by = name]
 
> blogDT
         name   price lag.price
1: Property 1   10000        NA
2: Property 1   12500     10000
3: Property 1   18000     12500
4: Property 2  245000        NA
5: Property 2  512000    245000
6: Property 2 1000000    512000

Next let’s calculate the difference between the two prices:

> blogDT[, diff := price - lag.price]
> blogDT
         name   price lag.price   diff
1: Property 1   10000        NA     NA
2: Property 1   12500     10000   2500
3: Property 1   18000     12500   5500
4: Property 2  245000        NA     NA
5: Property 2  512000    245000 267000
6: Property 2 1000000    512000 488000

Finally let’s order the data table by the biggest price gains:

> blogDT[order(-diff)]
         name   price lag.price   diff
1: Property 2 1000000    512000 488000
2: Property 2  512000    245000 267000
3: Property 1   18000     12500   5500
4: Property 1   12500     10000   2500
5: Property 1   10000        NA     NA
6: Property 2  245000        NA     NA

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
r language ,data ,table ,time series data

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}