Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Hands-on Data Wrangling: What, How, and Why

DZone's Guide to

Hands-on Data Wrangling: What, How, and Why

Companies are investing heavily in infrastructure, tools, and personnel to ingest and curate raw data to be "analyzable." In this tutorial, get a feel for what data wranglers do with some hands-on exercises.

Free Resource

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

Objective:

Companies are finding that data can be a powerful differentiator and are investing heavily in infrastructure, tools, and personnel to ingest and curate raw data to be "analyzable".  This process of data curation is called "Data Wrangling."

This task can be very cumbersome and requires trained personnel. However with the advances in open source and commercial tooling, this process has gotten a lot easier and the technical expertise required to do this effectively has dropped several notches.

In this tutorial, we will get a feel for what data wranglers do and use the following tools with some hands-on exercises. Given the introductory nature of these examples, the size of data sets is small and does not necessitate HDFS and cluster computing. In the next few installments of this blog series, I intend to add such examples. 

Here are the tools you will need to install on your computer:

Tool Version Download & Install Instructions Type
R language 3.2.4 https://cran.rstudio.com/ Open Source
R Studio 0.99.887 https://www.rstudio.com/products/rstudio/download/ Open Source
OpenRefine

2.6

http://openrefine.org/download.html Open Source
Trifacta Wrangler 3.0.1-client1 https://www.trifacta.com/trifacta-wrangler Commercial but free offering with limited functionality


By no stretch of the imagination is this tutorial supposed to be the end-all and all-inclusive learning experience of data wrangling tools and strategies but it merely scratches the scratch of the plethora of tools we have at our disposal to wrangle data.

Download Resources

  1. Data Wrangling - Slide Deck
  2. Data Wrangling - Exercise Code
  3. Transform Legacy Enterprise to Data Driven Business - Slide Deck

Exercise 1: Fix Date Formats

Solution:

Using R & R Studio

Exercise 1 Preparation:
# Install the following packages
install.packages("stringr")
install.packages("dplyr")
install.packages("lubridate")
install.packages("tidyr")
Exercise 1a Solution:
# Load the previously installed packages
library(stringr)
library(dplyr)
library(lubridate)
library(tidyr)


# Replace this with file name where you want to download the raw file
rawfile <- "ReplaceWithFileNameIncludingPath"

# download the file containing dates 
download.file("https://docs.google.com/uc?authuser=0&id=0B6UoSlNMfxQOQ09oSjVMc3kwajQ&export=download", destfile = rawfile)

# read the file, since the file does not contain a header, it is false
dates <- read.csv(rawfile, header = FALSE, col.names = c("date"))

# there should be 22
length(dates$date)

# use mutate to transform content from dates, ignore the warnings. 
# the intelligence is embedded in useful functions in lubridate package
# create three separate columns based on date formats and then merge them all 
# into one column 
fixed <- mutate(dates, yyyymmdd = ymd(date), mmddyyyy = mdy(date), ddmmyyyy = dmy(date)) %>% unite("fixedup", yyyymmdd, ddmmyyyy, mmddyyyy, sep ="")

# replace NAs with blank strings
fixed$fixedup <- gsub("NA","", fixed$fixedup)

# View fixed
fixed

# identify those dates where month and day are both less than equal to 12
fixed_md_dm <- fixed %>% mutate(fixedup, month = month(fixedup), day = day(fixedup), to_be_reviewed = day(fixedup) <= 12 & month(fixedup) <= 12)

fixed_md_dm
Exercise 1b Solution:
# Continuing from previous example
# Alternative approach highlighting functions in R

# read the file, since the file does not contain a header, it is false
dates <- read.csv(rawfile, header = FALSE, col.names = c("date"))

# there should be 22
length(dates$date)

#Define function
fixdates <- function(date) {

    # empty vector
    fixed <- c()

    # iterate over the vector containing raw dates
    for(i in 1:length(date)) {

        # is it yyyymmdd?
        if(is.na(ymd(date[i]))) {

            # is it mmddyyyy
            if(is.na(mdy(date[i]))) {

                # is it ddmmyyyy
                if(is.na(dmy(date[i]))) {

                    # none, then undertermined
                    fixed[i] <- NA
                } else {
                    # assign ddmmyyyy
                    fixed[i] <- dmy(date[i])
                }
            } else {
                # assign mmddyyyy
                fixed[i] <- mdy(date[i])
            }
        } else {
            # assign yyyymmdd
            fixed[i] <- ymd(date[i])
        }
    }
    # convert to date
    class(fixed) <- "Date"

    # return vector containing converted dates
    fixed
}

# invoke function to convert raw dates
dates$FixedDate <- fixdates(dates$date)

# view contents
dates
Exercise 1c Solution:
# Continuing from previous example
# Extension exercise: add three different dates
# read the file, since the file does not contain a header, it is false

dates <- read.csv(rawfile, header = FALSE, stringsAsFactors = FALSE, col.names = c("date"))

# there should be 22
length(dates$date)

# adding three more dates
dates[nrow(dates)+1,] <- c("01/02/2013")
dates[nrow(dates)+1,] <- c("12021010")
dates[nrow(dates)+1,] <- c("20021010")

# three turn in valid responses for more than one of yyyymmdd, ddmmyyyy, mmddyyyy!
# should be reviewed by someone who has domain knowledge about this data and 
# hence we need to flag such data accordingly

# use mutate to transform content from dates, ignore the warnings. 
fixed <- mutate(dates, yyyymmdd = ymd(date), mmddyyyy = mdy(date), ddmmyyyy = dmy(date))

# add a new column to know if a certain date observation has to be reviewed
fixed %>% mutate(to_be_reviewed = ifelse((!is.na(fixed$yyyymmdd)) + (!is.na(fixed$mmddyyyy))  + (!is.na(fixed$ddmmyyyy)) == 1, FALSE, TRUE))

# extension to above, here we find those which seem to be valid and convert them 
fixed %>% mutate(to_be_reviewed = ifelse((!is.na(fixed$yyyymmdd)) + (!is.na(fixed$mmddyyyy))  + (!is.na(fixed$ddmmyyyy)) == 1, FALSE, TRUE)) %>% filter(to_be_reviewed==FALSE) %>%unite("fixedup", yyyymmdd, ddmmyyyy, mmddyyyy, sep ="") %>% mutate(fixedup = str_replace_all(fixedup, "NA","")) %>% select(date, fixedup)

Trifacta recipe using "Wrangle"

splitrows col: column1 on: '\n'
replace col: column1 on: `(th)|(st)` with: ''
derive value: dateformat(column1, 'yyyy-MM-dd') as: 'column2'
derive value: (day(column2) <= 12) && (month(column2) <= 12) as: 'column3'
rename col: column2 to: 'Fixed_Dates'
rename col: column3 to: 'To_Be_Reviewed'

Exercise 2: Fix Currency Values

Solution:

Using R and R Studio

# Replace this with file name where you want to download the raw file
rawfile <- "ReplaceWithFileNameIncludingPath"

# download sales file
download.file("https://docs.google.com/uc?authuser=0&id=0B6UoSlNMfxQONW01LXpUWTdmdXM&export=download", rawfile)

# read the file
sales <- read.csv(rawfile, stringsAsFactors = FALSE, header = T, sep = "\n")

# There is a lot going on here. 
# Create a separate column extracting the currency symbol
# Create another column containing numeric content by replacing comma, $ and £
# Convert to numeric values
# Normalize to USD by converting £ to USD using 1.44 conversion ratio

sales %>% mutate(currency = str_extract(sales,"\\$|£") ) %>% mutate(raw_amount = str_replace_all(sales, ",|\\$|£", "") ) %>% mutate(amount = as.numeric(raw_amount)) %>%mutate(convertedAmountInUSD = ifelse(currency == '$', amount, ifelse(currency == '£', round(amount*1.44,2), NA)))

Trifacta recipe using "Wrangle"

splitrows col: column1 on: '\n'
header
extract col: sales on: `$|£`
split col: sales on: `$|£`
merge col: sales2,sales3 as: 'column1'
rename col: column1 to: 'amounts'
settype col: amounts type: 'Float'
replace col: amounts on: `,` with: ''
drop col: sales2,sales3
derive value: (sales1 == '$') ? amounts : ((sales1 == '£') ? (amounts * 1.44) : 'test') as: 'column1'
settype col: column1 type: 'Float'
set col: column1 value: numformat(column1, '$##,###.##')

Exercise 3: Fix Java Log4j Log File With Exception Stack Traces

Solution:

Using R & R Studio

Exercise 3 Preparation:
# in addition to the previously loaded libraries
# install another package to deal with Excel using cpp instead of 
# error prone java dependency from R 
install.packages("openxlsx")
Exercise 3 Solution:
# if you have loaded plyr, there are conflict with dplyr. 
# hence detaching it
detach("package:plyr", unload=TRUE)

library(openxlsx)

# specify input and output file
rawfile <- "ReplaceWithFileNameIncludingPath"
outputFile <- "ReplaceWithFileNameIncludingPath"

# download log file from log file
download.file("https://docs.google.com/uc?authuser=0&id=0B6UoSlNMfxQOQWlGQzdvbzhNRG8&export=download", rawfile)

# read log file, each line is a row
logs <- readLines(rawfile)


# read 311 lines
length(logs)

# vector of characters
class(logs)

# convert into a data frame
logsdf <- as.data.frame(logs, stringsAsFactors = FALSE)

# find structure of the data frame
str(logsdf)

# add a new column with line number
logsdf$ID <- seq.int(nrow(logsdf))

# peek at first 6 lines 
head(logsdf)

# look at the class of columns of data frame
sapply(logsdf, class)

# function hacked to group exception lines together 
# feel silly hacking such a contrived function...bear with me here
groupLogs <- function(ID) {
  log_group <- c()
  i <- 1
  while (i <= length(ID)) {
    log_group[i] = ID[i]
    j <- i+1
    while(j <=length(ID)) {
      if(ID[j] - ID[j-1] == 1) {
        log_group[j] <- ID[i]
      } else {
        log_group[j] <- ID[j]
        i <- j
        break
      }
      j <- j+1
      if(j > length(ID))
        i <- j
    }
  }
  log_group
}

# find lines in log file that represent exceptions 
exception_logs <- logsdf[str_detect(logsdf$logs, "^[^\\d]"),]

# invoke function to group together exception lines part of the same stack trace
exception_logs$log_grp <- groupLogs(exception_logs$ID)

# group all the lines in the same exception into a single collapsed row separated # by carriage return
exception_logs <- exception_logs %>% select(logs, log_grp) %>% group_by(log_grp) %>% summarise(stacktrace = paste(logs, collapse = "\r"))


# little math trick to help merge the exception into a the log line previous to
# this exception
exception_logs$log_grp <- ifelse(exception_logs$log_grp == 1, 
exception_logs$log_grp, exception_logs$log_grp - 1)

# log lines that begin with timestamp
proper_logs <- logsdf[str_detect(logsdf$logs, "^[\\d]"),]


# split the log message as a separate column using " - " separator 
proper_logs <- proper_logs %>% separate(logs, c("logs", "message" ), sep = "\\s-\\s") 

# split the log text into time, thread, level and class
proper_logs <- proper_logs %>% separate(logs, c("time", "thread", "level", "class" ), sep = "\\s+")


# left outer join to include exceptions by joining the two data frames created
# earlier
logs <- merge(proper_logs, exception_logs, by.x = c("ID"), by.y = c("log_grp"), all.x = TRUE)

# write the excel file, example /Users/akuntamu/temp/text.xls
write.xlsx(logs, outputFile)

Trifacta recipe using "Wrangle"

  • Requires a two-step approach. 
  • Create an intermediate data set that distinguishes the regular log statements which begin with a timestamp from the exception stack traces.
splitrows col: column1 on: '\n'
set col: column1 value: merge([matches([column1], `{start}{digit}`) ? 'Q' : '',column1])
  • Split rows based on the dummy identifier Q added in the previous step. This technique can help span multi-row observations/records.

splitrows col: column1 on: 'Q'
split col: column1 on: `\n`
split col: column2 on: `{delim-ws}`
split col: column1 on: ' ' limit: 3
replace col: column7 on: `{start}{delim}` with: ''
rename col: column2 to: 'Time'
rename col: column5 to: 'Thread'
rename col: column6 to: 'Level'
rename col: column7 to: 'Class'
rename col: column4 to: 'Message'
rename col: column3 to: 'Exception'

Exercise 4: Web Scraping Top 50 Pop Songs in the Last Decade

  • Web Page
  • If the above page is unavailable, use this cached page

Solution:

Using R & R Studio

Exercise 4 Preparation:
# Install the following packages
install.packages("gtools")

install.packages("rvest")
Exercise 4a Solution:
library(rvest)
library(tidyr)
library(stringr)
library(gtools)

# Scrape the page listing top 50 pop hits in the last decade
scraping_top50_pop <- read_html("http://www.pastemagazine.com/blogs/lists/2009/11/the-best-albums-of-the-decade.html?a=1")

# If the above does not work, I have a cached copy of this page here.
# scraping_top50_pop <- read_html("")

# Scraping using rvest page (which was inspired by BeautifulSoup, python library
top_50_pop <- html_nodes(scraping_top50_pop, ".big") %>% html_text %>% trimws(which ="both")

# split the rank and rest of the text
matrix_top_50_pop <- str_split_fixed(top_50_pop, "\\.", 2)

# supply column names
colnames(matrix_top_50_pop) <- c("rank", "text")

# convert to a data frame
top_50_pop_df <- as.data.frame(matrix_top_50_pop, stringsAsFactors = FALSE)

# convert rank from character to numeric
top_50_pop_df$rank <- as.numeric(top_50_pop_df$rank)

# split the artist based on : delimiter
top_50_pop_df <- top_50_pop_df %>% separate(col = text, into = c("artist", "text"), sep = ":")

# split lines that contains all three fields: title, company, year
top_50_pop_df1 <- top_50_pop_df[which(top_50_pop_df$text %>% str_detect("\\[.*\\]")),] %>% separate(col = text, into = c("title", "company", "year"), sep = "\\[|\\]")

# split lines that contain only title and year
top_50_pop_df2 <- top_50_pop_df[which(!top_50_pop_df$text %>% str_detect("\\[.*\\]")),] %>% separate(col = text, into = c("title",  "year"), sep = "\\s\\(")


# smartly combine the above data frames since company is not common column in both
top_50_pop_combined <- smartbind(top_50_pop_df1,top_50_pop_df2)

# edit the year field to trim and replace parantheses
top_50_pop_combined$year <- apply(top_50_pop_combined[,"year", drop = FALSE], 2, function(year) {
str_replace_all(trimws(year, which = c("both")), "\\(|\\)", "")
}
)[,"year"]

# trim artist
top_50_pop_combined$artist <- trimws(top_50_pop_combined$artist, which = "both")

# trim title
top_50_pop_combined$title <- trimws(top_50_pop_combined$title, which = "both")

# sort the final data frame using rank in the descending order
top_50_pop_final <- arrange(top_50_pop_combined, -rank)

# print
top_50_pop_final

# Provide output CSV file name
# Example outputCSV <- "/Users/akuntamu/test/top-50-pop.csv"

outputCSV <- "ReplaceWithFileNameIncludingPath"

# export to a CSV 
write.table(top_50_pop_final, outputCSV, row.names = FALSE, fileEncoding = "UTF-8", sep = "|")

Trifacta recipe using "Wrangle":

Since Trifacta currently does not have data ingestion capability at the time of this writing, let's perform the first few steps of data scraping in R and create a raw data file using the following commands.

Exercise 4b Solution:

library(dplyr)
library(rvest)
library(tidyr)
library(stringr)

scraping_top50_pop <- read_html("http://www.pastemagazine.com/blogs/lists/2009/11/the-best-albums-of-the-decade.html?a=1")
top_50_pop <- html_nodes(scraping_top50_pop, ".big")  %>% html_text %>% trimws(which ="both")

# replace outputRawCSV 
# Example: outputRawCSV <- "/Users/akuntamu/test/top-50-pop-raw.csv"

outputRawCSV <- "ReplaceWithFileNameIncludingPath"

write.table(as.data.frame(top_50_pop), outputRawCSV, row.names = FALSE, fileEncoding = "UTF-8", sep = "|")

Now upload this file as an input data set into Trifacta Wrangler, create a project and use the following script:


splitrows col: column1 on: '\n' quote: '\"'
replace col: column1 on: /^"/ with: ''
replace col: column1 on: /"$/ with: ''
delete row: matches([column1], `top_50_pop`)
split col: column1 on: `{delim-ws}`
split col: column3 on: `{delim-ws}`
split col: column4 on: ` [`
split col: column5 on: `] `
extract col: column3 on: ` \({digit}+\)`
replace col: column3 on: ` \({digit}+\)` with: ''
merge col: column5,column6 as: 'column7'
drop col: column5,column6
set col: column7 value: trim(column7)
replace col: column7 on: `\(` with: ''
replace col: column7 on: `\)` with: ''
rename col: column2 to: 'rank'
rename col: column1 to: 'artist'
rename col: column3 to: 'title'
rename col: column4 to: 'company'
rename col: column7 to: 'year'
sort order: -rank

Exercise 5: Web Scrape Using OpenRefine, R

This is a simple concocted exercise to consolidate entire 3-day agenda on this page into a simple table showing time slot and session information.

Solution: 

Using Open Refine

  1. Launch openrefine application
  2. This starts a jetty web server on your machine which hosts the openrefine web app
  3. Go to http://localhost:3333
  4. Create Project
  5. Using Clipboard
  6. Paste the following link: 
    1. http://globalbigdataconference.com/64/dallas/big-data-bootcamp/schedule.html
    2. If not available, cached copy
  7. Follow the prompts and select UTF-8 character encoding
  8. Create project 
  9. The recipe for data scraping and transformation can be downloaded from here
  10. Once the project is created successfully, you will find Undo/Redo link 
  11. Click on that link to find Apply button
  12. Paste the content of downloaded file into textbox in the dialog box and click perform operations
  13. It will take a few seconds and run through the 14 steps. 
  14. We will review these 14 steps and walk through these individually

Using R and R Studio

Exercise 5 Preparation:
# Install the following packages
# you may be prompted to restart R session. accept it
install.packages("xml2")
Exercise 5 Solution:
# reloading several packages since R session restarted
library(rvest)
library(xml2)
library(dplyr)
library(gtools)
library(openxlsx)
library(xml2)

# read html content
webpage <- read_html("http://globalbigdataconference.com/64/dallas/big-data-bootcamp/schedule.html")

# extract all tables
tbls <- html_nodes(webpage, "table")

# inspect tables
head(tbls)

# fetch only 1, 3, 4 tables since 2nd and 5th are not useful
tbls_list <- tbls %>% .[c(1,3,4)] %>% html_table(fill=TRUE)

# structure of returned object
str(tbls_list)

# set up column names for each of the data frames representing tables
colnames(tbls_list[[1]]) <- c("time-slot", "session-info")
colnames(tbls_list[[2]]) <- c("time-slot", "session-info")
colnames(tbls_list[[3]]) <- c("time-slot", "session-info")

# smartly append three tables
agenda <- smartbind(tbls_list[[1]],tbls_list[[2]], tbls_list[[3]])

# inspect structure again
str(agenda)

# there is a row with NA, the following command shows how to filter complete rows
agenda <- agenda[complete.cases(agenda),]


# replace outputXLS 
# Example: outputXLS <- "/Users/akuntamu/test/agenda.xls"

outputXLS <- "ReplaceWithFileNameIncludingPath"

# write it out
write.xlsx(agenda, outputXLS)

This concludes this installment of data wrangling exercises using R, R Studio and Trifacta. Hope this wets your appetite for learning more about data wrangling. In future installments, we will look into dealing with missing values, identifying outliers etc using other technologies such as python pandas, open refine and/or any freeware offering by a commercial vendor to get more hands-on expertise.

Feel free to post questions, comments, suggestions for improvement on Twitter. I am reachable @akuntamukkala. 

Look forward to hearing from you!

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:
r ,big data ,data wrangling

Published at DZone with permission of Ashwini Kuntamukkala, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}