Over a million developers have joined DZone.

Process Strings in R Using stringr

DZone's Guide to

Process Strings in R Using stringr

Explore the different functions available in the stringr package in R using the interesting U.S. Carriers flight dataset.

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

The first article in this series shed some light on the different methods of encoding character attributes for creating useful machine learning models. Here in this piece, we will focus on manipulating and extracting useful text out of the messy strings using R.

To reiterate the essential foundation in our previous article, character or string data dominates datasets in enterprises, making it hard to create a very accurate machine learning model. We have to clean messy strings, pull strings apart, and extract useful strings embedded in a text to bring it into a form that can be used in a machine learning pipeline.

Below are some advantages of using stringr:

  • Consistent function names and descriptive input parameters.

  • Built-in pattern matching and regex functions.

  • Deals with missing data by default.

  • Datatype of input and output strings are preserved.

Now, let's explore the different functions available in the stringr package. We will use U.S. Carriers flight data, which can be downloaded from Bureau of Transporation Statistics website. Once the data is downloaded, load the stringr library and read the file into the R environment as shown below:

flights <- read.csv("606231461_T_T100D_MARKET_US_CARRIER_ONLY")

The column UNIQUE_CARRIER_NAME has names of the carriers as strings. We will use this attribute to explore the stringr functionality.

  • str_detect is used to find a pattern in a string. For instance, str_detect(flights$UNIQUE_CARRIER_NAME,"Tradewind") returns TRUE when any pattern in the strings matched Tradewind and FALSE when there is no match.

  • str_extract extracts the string that matches the pattern. For example, str_extract(flights$UNIQUE_CARRIER_NAME,"Tradewind") searches for Tradewind in every string and extracts it whenever there is a match.

  • str_length retrieves the length of each string that is present in the attribute. str_length(flights$UNIQUE_CARRIER_NAME) returns the length of carrier names present in the UNIQUE_CARRIER_NAME column.

  • str_locate returns the position of the input string pattern. For example, for the flight's dataset, str_locate(flights$UNIQUE_CARRIER_NAME,"Trade") returns the start as 1 and the end as 5 — which means that the pattern Trade is present from the first to the fifth position in the data for the UNIQUE_CARRIER_NAME column

  • str_replace is used widely. There are times where we need to replace some text patterns with another string. This function comes in handy here where it replaces the first occurrence of a matched pattern in a string. For instance, str_replace(flights$UNIQUE_CARRIER_NAME,"Tradewind","Air") replaces Tradewind with Air. After this replacement, the carrier Tradewind Aviation is changed to Air Aviation. Cool makeover. Hope Tradewind Aviation likes this new branding.

  • str_split breaks up a string based on the pattern provided. For example,  str_split(flights$UNIQUE_CARRIER_NAME,"Air") splits "GoJet Airlines LLC d/b/a United Express" to "GoJet" and "lines LLC d/b/a United Express".

  • str_sub is similar to native substr function; it returns a substring from a character vector. For example, str_sub(flights$UNIQUE_CARRIER_NAME,1,3) returns "Tra" for Tradewind Aviation.

  • str_trim is a useful function which trims the whitespaces at the beginning and end of a string. The command str_trim(" Airlines ") trims the whitespaces and returns just "Airlines". Similarly, str_trim(" GoJet Airlines ") trims the leading and trailing whitespaces and returns "GoJet Airlines". Note the space in between "GoJet Airlines" is not trimmed.

These are some of the handy functions in stringr that are often used. There are some more functions in the package that are less commonly used but are good to know. You can refer to the R documentation for exploring those methods. stringr is one of the necessary packages in a data science toolbox, and if you have read this long, you are ready to manipulate strings in R with ease.

TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

data science ,r ,big data analytics ,machine learning ,ai ,data manipulation

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}