DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • AI, ML, and Data Science: Shaping the Future of Automation
  • Recommender Systems Best Practices: Collaborative Filtering
  • When Doris Meets Iceberg: A Data Engineer's Redemption
  • Building a Distributed Multi-Language Data Science System

Trending

  • The Modern Data Stack Is Overrated — Here’s What Works
  • A Guide to Container Runtimes
  • Breaking Bottlenecks: Applying the Theory of Constraints to Software Development
  • Unlocking AI Coding Assistants Part 3: Generating Diagrams, Open API Specs, And Test Data
  1. DZone
  2. Data Engineering
  3. Big Data
  4. The First Steps to Analyzing Data in R

The First Steps to Analyzing Data in R

What steps should be taken when you want to analyze a data set in R? Enjoy these first steps to prepare your data for a variety of use cases.

By 
Sibanjan Das user avatar
Sibanjan Das
DZone Core CORE ·
Jan. 17, 17 · Opinion
Likes (8)
Comment
Save
Tweet
Share
20.7K Views

Join the DZone community and get the full member experience.

Join For Free

Analyzing the right data is crucial for an analytics project's success. Most of the times, data from transactional systems or other data sources such as surveys, social media, and sensors are not ready to be analyzed directly. Data has to mix and matched, massaged and preprocessed to transform it into a proper form which can be analyzed. Without this, the data being analyzed and reported on becomes meaningless. And these small discrepancies can make a significant difference in the outcomes that can affect an organization's bottom line performance.

Data Science is a blessing for those who does it correctly and is a curse  for others

With R being one of the most preferred tools for Data Science and Machine Learning, we'll discuss some data management techniques using it. The concepts can also be applied using other tools. All we need to do is to search for the corresponding functions in that tool such as Python or Julia.

So, when we talk about data management in R, it is essentially what to perform on a data set that is just loaded in R, so that we can understand something and pre-process it for other exercises such as Modeling or Visualization.

Let's start thinking in a logical way the steps that one should perform once we have the data imported into R.

  • The first step would be to discover what's in the data file that was exported. To do this, we can:

    • Use head function to view few rows from the data set. By default, head shows first 5 rows. Ex: head(mtcars) 

    • str to view the structure of the imported data. Ex: str(mtcars)

    • summary to view the data summary. Ex: summary(mtcars) 

  • The second step, which we should always take care of, is NULL values. So, we need to treat those nulls.  You can read some of the techniques in my previous article The art of missing value imputations.

    • Basic imputations for NULL values can be achieved using  na.omit  function. You can also use the complete.cases function, or simply do a dataframe subset by filtering the null cases, ex: df[is.na(df)]=FALSE  

    • For regression based imputations in R, you can download the library Hmisc and use its aregImpute function.

    • For random forest-based imputations, you can use the rfImpute function of the library randomForest

  • The third step is the Outlier Analysis and Treatment. Outliers are values that lie far away from other variables and have some unusual behavior. The presence of outliers may distort the overall distribution of data and give erroneous model results. Based on the number of outliers, they can be either dropped from the analysis or treated accordingly, so their presence doesn’t affect the overall model results.

The aforementioned are the three mandatory steps that is applicable to every data set that one needs to execute once a data set is imported. Next, steps are basically the second level of data management where based on the type of data and task you want to perform, you need to use them.

  • Aggregating Data — Aggregation functions are very useful for understanding the data and present its summarized picture. One can use the aggregate function present in R to calculate the various data aggregates. Ex: mean_df <- aggregate(mtcars$mpg, list(mtcars$cyl), mean) 

  • Reshaping Data — Different functions or algorithms requires different format of data. So, sometimes we require data to be converted from wide format to long format and vice versa. The reshape and reshape 2 libraries provide various functions to easily interchange the data frame formats.  Reshape 2 is the new version of reshape, and has an optimized method which is way faster than the reshape library. Also, functions like cast have been replaced by two functions — dcast, which produces data frames and acast, which produces arrays/matrices. You can use this functions to aggregate the values in conjuction with changing the data formats. The melt function takes data in a wide format and stacks a set of columns into a single column of data. To make use of the function, we need to specify a data frame, the id variables, and the measured variables to be stacked. Ex:  df3 <- melt(df, id=("id","Name"))

  • Merging data — When we have multiple tables, data frames, or data files, we might need to join or merge them to analyze them together. R provides a merge function to accomplish this task. For example, let's say we have two data frames fr1 and fr2, we can use the merge function to join them together by their ID column.

fr1 <- data.frame(name=c("Mark", "Michael","Sibanjan"),ID= c(2,3,1))

fr2 <- data.frame(ID= c(1,2,3), country=c("US","IND","SA"))

mrg <- merge(fr1, fr2, by="ID")
  • Binning — A way to group a set of observations into bins based on the value of a particular variable. Binning techniques come in handy to split continuous data into discrete pieces. For example, suppose we have customer's age, but we want them to categorize them into 4 bins/categories:  0-25, 25-50, 50-75, and 75-100. R has several libraries and functions for this task, such as the cut function and  binr library. To know more about binning techniques and methods, you can refer to my book Data Science using Oracle Data Miner and Oracle R Enterprise.

Apart from these, you can also try out the dplyr package which is very useful to perform data management tasks in R. Some of its useful functionalities are described below:

Filter: To subset a data set based on certain conditions. Ex:  filter(mtcars, mpg>21) 

Summarise: Summarize multiple values into a single value. You can use various options in this function, such as removing NA values by setting na.rm=TRUE. Ex: summarise(airquality, mean(airquality$Temp, na.rm=TRUE)) 

Group By: To group data by one or more variables. For example, we can group the data together based on the month and then use the summarise function to calculate per group summaries. Ex: summarise(group_by(airquality, Month), mean(airquality$Temp, na.rm=TRUE)).

Sample: dplyr also provides various functions to sample from a data set. For example, sample_n(mtcars,10) samples 10 rows from the mtcars data frame. sample_frac(mtcars,0.5) samples 0.5 fracs of rows.

Create new variables: You can use mutate() to create new variables by preserving the existing ones and use  transmute() to make new variables by dropping the existing ones. Ex: mutate(mtcars,new_var=disp/hp) 

Piping: Pipes take the output from one function and feed it to the first argument of the next function. It's similar to the pipe function in unix. It can work with most functions. Ex: mtcars$mpg%>%mean.

Sorting: Sometimes, we need the data to be sorted in an order for creating graphs or for some analysis. Here the order() function in R comes in handy. Ex:  arrange(mtcars, desc(mpg))[]

This was just a snapshot of functions available in R for data management. There are various other functions with different flavors that does the same tasks and also much more are getting added every day. What we need to know are the concepts and functionalities so that we can look out for the libraries that accommodates our analytics need well.

Data science R (programming language)

Opinions expressed by DZone contributors are their own.

Related

  • AI, ML, and Data Science: Shaping the Future of Automation
  • Recommender Systems Best Practices: Collaborative Filtering
  • When Doris Meets Iceberg: A Data Engineer's Redemption
  • Building a Distributed Multi-Language Data Science System

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!