R: dplyr -- Segfault Cause 'memory not mapped'
Join the DZone community and get the full member experience.
Join For FreeIn my continued playing around with web logs in R I wanted to process the logs for a day and see what the most popular URIs were.
I first read in all the lines using the read_lines function in readr and put the vector it produced into a data frame so I could process it using dplyr.
library(readr) dlines = data.frame(column = read_lines("~/projects/logs/2015-06-18-22-docs"))
In the previous post I showed some code to extract the URI from a log line. I extracted this code out into a function and adapted it so that I could pass in a list of values instead of a single value:
extract_uri = function(log) { parts = str_extract_all(log, "\"[^\"]*\"") return(lapply(parts, function(p) str_match(p[1], "GET (.*) HTTP")[2] %>% as.character)) }
Next I ran the following function to count the number of times each URI appeared in the logs:
library(dplyr) pages_viewed = dlines %>% mutate(uri = extract_uri(column)) %>% count(uri) %>% arrange(desc(n))
This crashed my R process with the following error message:
segfault cause 'memory not mapped'
I narrowed it down to a problem when doing a group by operation on the ‘uri’ field and came across this post which suggested that it was handled more cleanly in more recently version of dplyr.
I upgraded to 0.4.2 and tried again:
## Error in eval(expr, envir, enclos): cannot group column uri, of class 'list'
That makes more sense. We’re probably returning a list from extract_uri rather than a vector which would fit nicely back into the data frame. That’s fixed easily enough by unlisting the result:
extract_uri = function(log) { parts = str_extract_all(log, "\"[^\"]*\"") return(unlist(lapply(parts, function(p) str_match(p[1], "GET (.*) HTTP")[2] %>% as.character))) }
And now when we run the count function it’s happy again, good times!
Opinions expressed by DZone contributors are their own.
Comments