Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

R/dplyr: Extracting Data Frame Column Value for Filtering With %in%

DZone's Guide to

R/dplyr: Extracting Data Frame Column Value for Filtering With %in%

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

I’ve been playing around with dplyr over the weekend and wanted to extract the values from a data frame column to use in a later filtering step.

I had a data frame:

library(dplyr)
df = data.frame(userId = c(1,2,3,4,5), score = c(2,3,4,5,5))

And wanted to extract the userIds of those people who have a score greater than 3. I started with:

highScoringPeople = df %>% filter(score > 3) %>% select(userId)
> highScoringPeople
  userId
1      3
2      4
3      5

And then filtered the data frame expecting to get back those 3 people:

> df %>% filter(userId %in% highScoringPeople)
[1] userId score 
<0 rows> (or 0-length row.names)

No rows! I created vector with the numbers 3-5 to make sure that worked:

> df %>% filter(userId %in% c(3,4,5))
  userId score
1      3     4
2      4     5
3      5     5

That works as expected so highScoringPeople obviously isn’t in the right format to facilitate an ‘in lookup’. Let’s explore:

> str(c(3,4,5))
 num [1:3] 3 4 5
 
> str(highScoringPeople)
'data.frame': 3 obs. of  1 variable:
 $ userId: num  3 4 5

Now it’s even more obvious why it doesn’t work – highScoringPeople is still a data frame when we need it to be a vector/list.

One way to fix this is to extract the userIds using the $ syntax instead of the select function:

highScoringPeople = (df %>% filter(score > 3))$userId
 
> str(highScoringPeople)
 num [1:3] 3 4 5
 
> df %>% filter(userId %in% highScoringPeople)
  userId score
1      3     4
2      4     5
3      5     5

Or if we want to do the column selection using dplyr we can extract the values for the column like this:

highScoringPeople = (df %>% filter(score > 3) %>% select(userId))[[1]]
 
> str(highScoringPeople)
 num [1:3] 3 4 5

Not so difficult after all.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}