Analyzing the right data is crucial for an analytics project's success. Most of the times, data from transactional systems or other data sources such as surveys, social media, and sensors are not ready to be analyzed directly. Data has to mix and matched, massaged and preprocessed to transform it into a proper form which can be analyzed. Without this, the data being analyzed and reported on becomes meaningless. And these small discrepancies can make a significant difference in the outcomes that can affect an organization's bottom line performance.
Data Science is a blessing for those who does it correctly and is a curse for others
With R being one of the most preferred tools for Data Science and Machine Learning, we'll discuss some data management techniques using it. The concepts can also be applied using other tools. All we need to do is to search for the corresponding functions in that tool such as Python or Julia.
So, when we talk about data management in R, it is essentially what to perform on a data set that is just loaded in R, so that we can understand something and pre-process it for other exercises such as Modeling or Visualization.
Let's start thinking in a logical way the steps that one should perform once we have the data imported into R.
The first step would be to discover what's in the data file that was exported. To do this, we can:
Use head function to view few rows from the data set. By default, head shows first 5 rows. Ex:
str to view the structure of the imported data. Ex:
summary to view the data summary. Ex:
The second step, which we should always take care of, is NULL values. So, we need to treat those nulls. You can read some of the techniques in my previous article The art of missing value imputations.
Basic imputations for NULL values can be achieved using
na.omitfunction. You can also use the
complete.casesfunction, or simply do a dataframe subset by filtering the null cases, ex:
For regression based imputations in R, you can download the library
Hmiscand use its
For random forest-based imputations, you can use the
rfImputefunction of the library
The third step is the Outlier Analysis and Treatment. Outliers are values that lie far away from other variables and have some unusual behavior. The presence of outliers may distort the overall distribution of data and give erroneous model results. Based on the number of outliers, they can be either dropped from the analysis or treated accordingly, so their presence doesn’t affect the overall model results.
The aforementioned are the three mandatory steps that is applicable to every data set that one needs to execute once a data set is imported. Next, steps are basically the second level of data management where based on the type of data and task you want to perform, you need to use them.
Aggregating Data — Aggregation functions are very useful for understanding the data and present its summarized picture. One can use the aggregate function present in R to calculate the various data aggregates. Ex:
mean_df <- aggregate(mtcars$mpg, list(mtcars$cyl), mean)
Reshaping Data — Different functions or algorithms requires different format of data. So, sometimes we require data to be converted from wide format to long format and vice versa. The reshape and reshape 2 libraries provide various functions to easily interchange the data frame formats. Reshape 2 is the new version of reshape, and has an optimized method which is way faster than the reshape library. Also, functions like cast have been replaced by two functions —
dcast, which produces data frames and
acast, which produces arrays/matrices. You can use this functions to aggregate the values in conjuction with changing the data formats. The
meltfunction takes data in a wide format and stacks a set of columns into a single column of data. To make use of the function, we need to specify a data frame, the id variables, and the measured variables to be stacked. Ex:
df3 <- melt(df, id=("id","Name"))
Merging data — When we have multiple tables, data frames, or data files, we might need to join or merge them to analyze them together. R provides a merge function to accomplish this task. For example, let's say we have two data frames fr1 and fr2, we can use the merge function to join them together by their ID column.
fr1 <- data.frame(name=c("Mark", "Michael","Sibanjan"),ID= c(2,3,1)) fr2 <- data.frame(ID= c(1,2,3), country=c("US","IND","SA")) mrg <- merge(fr1, fr2, by="ID")
Binning — A way to group a set of observations into bins based on the value of a particular variable. Binning techniques come in handy to split continuous data into discrete pieces. For example, suppose we have customer's age, but we want them to categorize them into 4 bins/categories: 0-25, 25-50, 50-75, and 75-100. R has several libraries and functions for this task, such as the
binrlibrary. To know more about binning techniques and methods, you can refer to my book Data Science using Oracle Data Miner and Oracle R Enterprise.
Apart from these, you can also try out the dplyr package which is very useful to perform data management tasks in R. Some of its useful functionalities are described below:
Filter: To subset a data set based on certain conditions. Ex:
Summarise: Summarize multiple values into a single value. You can use various options in this function, such as removing NA values by setting na.rm=TRUE. Ex:
summarise(airquality, mean(airquality$Temp, na.rm=TRUE))
Group By: To group data by one or more variables. For example, we can group the data together based on the month and then use the summarise function to calculate per group summaries. Ex:
summarise(group_by(airquality, Month), mean(airquality$Temp, na.rm=TRUE)).
Sample: dplyr also provides various functions to sample from a data set. For example,
sample_n(mtcars,10) samples 10 rows from the mtcars data frame.
sample_frac(mtcars,0.5) samples 0.5 fracs of rows.
Create new variables: You can use
mutate() to create new variables by preserving the existing ones and use
transmute() to make new variables by dropping the existing ones. Ex:
Piping: Pipes take the output from one function and feed it to the first argument of the next function. It's similar to the pipe function in unix. It can work with most functions. Ex:
Sorting: Sometimes, we need the data to be sorted in an order for creating graphs or for some analysis. Here the order() function in R comes in handy. Ex:
This was just a snapshot of functions available in R for data management. There are various other functions with different flavors that does the same tasks and also much more are getting added every day. What we need to know are the concepts and functionalities so that we can look out for the libraries that accommodates our analytics need well.