DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Tips and Tricks for Efficient Coding in R
  • How to Get a Non-Programmer Started with R
  • Python vs. R: A Comparison of Machine Learning in the Medical Industry
  • How to Rectify R Package Error in Android Studio

Trending

  • Cutting Data Pipeline Costs and Data Freshness Issues With Netflix Maestro and Apache Iceberg: A Practical Tutorial
  • Introducing RAI Audit Kit: Evidence-Grade Responsible AI Audits in Python
  • Grok AI API Tutorial: Chat, Image, Video, Tool Calling, and Web Search
  • Getting Started With GitHub Copilot CLI for Coding Tasks
  1. DZone
  2. Coding
  3. Languages
  4. R: Regex -- Capturing Multiple Matches of the Same Group

R: Regex -- Capturing Multiple Matches of the Same Group

By 
Mark Needham user avatar
Mark Needham
·
Jun. 24, 15 · Interview
Likes (0)
Comment
Save
Tweet
Share
1.0K Views

Join the DZone community and get the full member experience.

Join For Free

I’ve been playing around with some web logs using R and I wanted to extract everything that existed in double quotes within a logged entry.

This is an example of a log entry that I want to parse:

log = '2015-06-18-22:277:548311224723746831\t2015-06-18T22:00:11\t2015-06-18T22:00:05Z\t93317114\tip-127-0-0-1\t127.0.0.5\tUser\tNotice\tneo4j.com.access.log\t127.0.0.3 - - [18/Jun/2015:22:00:11 +0000] "GET /docs/stable/query-updating.html HTTP/1.1" 304 0 "http://neo4j.com/docs/stable/cypher-introduction.html" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36"'

And I want to extract these 3 things:

  • /docs/stable/query-updating.html
  • http://neo4j.com/docs/stable/cypher-introduction.html
  • Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36

i.e. the URI, the referrer and browser details.

I’ll be using the stringr library which seems to work quite well for this type of work.

To extract these values we need to find all the occurrences of double quotes and get the text inside those quotes. We might start by using the str_match function:

> library(stringr)
> str_match(log, "\"[^\"]*\"")
     [,1]                                               
[1,] "\"GET /docs/stable/query-updating.html HTTP/1.1\""

Unfortunately that only picked up the first occurrence of the pattern so we’ve got the URI but not the referrer or browser details. I tried str_extract with similar results before I found str_extract_all which does the job:

> str_extract_all(log, "\"[^\"]*\"")
[[1]]
[1] "\"GET /docs/stable/query-updating.html HTTP/1.1\""                                                                            
[2] "\"http://neo4j.com/docs/stable/cypher-introduction.html\""                                                                    
[3] "\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36\""

We still need to do a bit of cleanup to get rid of the ‘GET’ and ‘HTTP/1.1′ in the URI and the quotes in all of them:

parts = str_extract_all(log, "\"[^\"]*\"")[[1]]
uri = str_match(parts[1], "GET (.*) HTTP")[2]
referer = str_match(parts[2], "\"(.*)\"")[2]
browser = str_match(parts[3], "\"(.*)\"")[2]
 
> uri
[1] "/docs/stable/query-updating.html"
 
> referer
[1] "https://www.google.com/"
 
> browser
[1] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36"

We could then go on to split out the browser string into its sub components but that’ll do for now!

R (programming language)

Opinions expressed by DZone contributors are their own.

Related

  • Tips and Tricks for Efficient Coding in R
  • How to Get a Non-Programmer Started with R
  • Python vs. R: A Comparison of Machine Learning in the Medical Industry
  • How to Rectify R Package Error in Android Studio

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook