Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Compare Adjacent Rows in an R data.table

DZone's Guide to

Compare Adjacent Rows in an R data.table

Learn how to compare related, adjacent rows in an R data table by using R's handy lag function for time-series data.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

As part of my exploration of the Land Registry price paid data set, I wanted to compare the difference between consecutive sales of properties.

This means we need to group the sales by a property identifier and then get the previous sale price into a column on each row, unless it’s the first sale, in which case we’ll have ‘NA’. We can do this by creating a lag variable.

I’ll use a simpler data set which is very similar in structure to the Land Registry’s to demonstrate:

> blogDT = data.table(name = c("Property 1","Property 1","Property 1","Property 2","Property 2","Property 2"), 
                      price = c(10000, 12500, 18000, 245000, 512000, 1000000))
 
> blogDT
         name   price
1: Property 1   10000
2: Property 1   12500
3: Property 1   18000
4: Property 2  245000
5: Property 2  512000
6: Property 2 1000000

We want to group by the ‘name’ column and then have the price on row 1 show on row 2, the price on row 2 on row 3, the price on row 4 on row 5 and the price on row 5 on row 6. To do that we’ll introduce a ‘lag.price’ column:

> blogDT[, lag.price := c(NA, price[-.N]), by = name]
 
> blogDT
         name   price lag.price
1: Property 1   10000        NA
2: Property 1   12500     10000
3: Property 1   18000     12500
4: Property 2  245000        NA
5: Property 2  512000    245000
6: Property 2 1000000    512000

Next let’s calculate the difference between the two prices:

> blogDT[, diff := price - lag.price]
> blogDT
         name   price lag.price   diff
1: Property 1   10000        NA     NA
2: Property 1   12500     10000   2500
3: Property 1   18000     12500   5500
4: Property 2  245000        NA     NA
5: Property 2  512000    245000 267000
6: Property 2 1000000    512000 488000

Finally let’s order the data table by the biggest price gains:

> blogDT[order(-diff)]
         name   price lag.price   diff
1: Property 2 1000000    512000 488000
2: Property 2  512000    245000 267000
3: Property 1   18000     12500   5500
4: Property 1   12500     10000   2500
5: Property 1   10000        NA     NA
6: Property 2  245000        NA     NA

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
r language ,data ,table ,time series data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}