Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

R/dplyr: Extracting Data Frame Column Value for Filtering With %in%

DZone's Guide to

R/dplyr: Extracting Data Frame Column Value for Filtering With %in%

· Big Data Zone
Free Resource

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

I’ve been playing around with dplyr over the weekend and wanted to extract the values from a data frame column to use in a later filtering step.

I had a data frame:

library(dplyr)
df = data.frame(userId = c(1,2,3,4,5), score = c(2,3,4,5,5))

And wanted to extract the userIds of those people who have a score greater than 3. I started with:

highScoringPeople = df %>% filter(score > 3) %>% select(userId)
> highScoringPeople
  userId
1      3
2      4
3      5

And then filtered the data frame expecting to get back those 3 people:

> df %>% filter(userId %in% highScoringPeople)
[1] userId score 
<0 rows> (or 0-length row.names)

No rows! I created vector with the numbers 3-5 to make sure that worked:

> df %>% filter(userId %in% c(3,4,5))
  userId score
1      3     4
2      4     5
3      5     5

That works as expected so highScoringPeople obviously isn’t in the right format to facilitate an ‘in lookup’. Let’s explore:

> str(c(3,4,5))
 num [1:3] 3 4 5
 
> str(highScoringPeople)
'data.frame': 3 obs. of  1 variable:
 $ userId: num  3 4 5

Now it’s even more obvious why it doesn’t work – highScoringPeople is still a data frame when we need it to be a vector/list.

One way to fix this is to extract the userIds using the $ syntax instead of the select function:

highScoringPeople = (df %>% filter(score > 3))$userId
 
> str(highScoringPeople)
 num [1:3] 3 4 5
 
> df %>% filter(userId %in% highScoringPeople)
  userId score
1      3     4
2      4     5
3      5     5

Or if we want to do the column selection using dplyr we can extract the values for the column like this:

highScoringPeople = (df %>% filter(score > 3) %>% select(userId))[[1]]
 
> str(highScoringPeople)
 num [1:3] 3 4 5

Not so difficult after all.

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}