DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Data Engineering
  3. Data
  4. Exploratory Data Analysis in R (Introduction)

Exploratory Data Analysis in R (Introduction)

Exploratory data analysis (EDA) is the very first step in a data project. We will create a code template to achieve this with one function.

Pablo Casas user avatar by
Pablo Casas
·
Aug. 28, 18 · Tutorial
Like (3)
Save
Tweet
Share
9.92K Views

Join the DZone community and get the full member experience.

Join For Free

Image title

Introduction

EDA consists of univariate (1-variable) and bivariate (2-variables) analysis.
In this post, we will review some functions that lead us to the analysis of the first case.

  • Step 1 - First approach to data.
  • Step 2 - Analyzing categorical variables.
  • Step 3 - Analyzing numerical variables.
  • Step 4 - Analyzing numerical and categorical at the same time.

Covering some key points in a basic EDA:

  • Data types.
  • Outliers.
  • Missing values.
  • Distributions (numerically and graphically) for both, numerical and categorical variables.

Type of Analysis Results

There are two type of result: informative or operative.

  • Informative: For example plots, or any long variable summary. We cannot filter data from it, but it gives us a lot of information at once. Most be used at the EDA stage.
  • Operative: The results can be used to take an action directly on the data workflow (for example, selecting any variables whose percentage of missing values are below 20%). Most used in the Data Preparation stage.

Getting Set Up

Uncoment in case you don't have any of these libraries:

# install.packages("tidyverse")
# install.packages("funModeling")
# install.packages("Hmisc")

A newer version of funModeling has been released on Ago-1, please update!

Now load the needed libraries...

library(funModeling) 
library(tidyverse) 
library(Hmisc)

tl;dr (Code)

Run all the functions in this post in one-shot with the following function:

basic_eda <- function(data)
{
  glimpse(data)
  df_status(data)
  freq(data) 
  profiling_num(data)
  plot_num(data)
  describe(data)
}

Replace data with your data, and that's it!

basic_eda(my_amazing_data)

Creating the data for this example:

Using the heart_disease data (from funModeling package). We will take only 4 variables for legibility.

data=heart_disease %>% select(age, max_heart_rate, thal, has_heart_disease)

Step 1: The First Approach to the Data

Number of observations (rows) and variables, and a head of the first cases.

glimpse(data)
## Observations: 303
## Variables: 4
## $ age               <int> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, ...
## $ max_heart_rate    <int> 150, 108, 129, 187, 172, 178, 160, 163, 147,...
## $ thal              <fct> 6, 3, 7, 3, 3, 3, 3, 3, 7, 7, 6, 3, 6, 7, 7,...
## $ has_heart_disease <fct> no, yes, yes, no, no, no, yes, no, yes, yes,...

Getting the metrics about data types, zeros, infinite numbers, and missing values:

df_status(data)
##            variable q_zeros p_zeros q_na p_na q_inf p_inf    type unique
## 1               age       0       0    0 0.00     0     0 integer     41
## 2    max_heart_rate       0       0    0 0.00     0     0 integer     91
## 3              thal       0       0    2 0.66     0     0  factor      3
## 4 has_heart_disease       0       0    0 0.00     0     0  factor      2

df_status returns a table, so it is easy to keep with variables that match certain conditions like:

  • Having at least 80% of non-NA values (p_na < 20).
  • Having less than 50 unique values (unique <= 50).

TIPS:

  • Are all the variables in the correct data type?
  • Variables with lots of zeros or NAs?
  • Any high cardinality variable?

[Read more here.]

Step 2: Analyzing Categorical Variables

freq function runs for all factor or character variables automatically:

freq(data)

Frequency analysis

##   thal frequency percentage cumulative_perc
## 1    3       166      54.79              55
## 2    7       117      38.61              93
## 3    6        18       5.94              99
## 4 <NA>         2       0.66             100


Frequency analysis
##   has_heart_disease frequency percentage cumulative_perc
## 1                no       164         54              54
## 2               yes       139         46             100
## [1] "Variables processed: thal, has_heart_disease"


TIPS:

  • If freq receives one variable —freq(data$variable)— it retruns a table. This is useful for treating high cardinality variables (like zip code).
  • Export the plots to jpeg format in the current directory: freq(data, path_out = ".")
  • Do all the categories make sense?
  • Are there lots of missing values?
  • Always check the absolute and relative values

[Read more here.]

Step 3: Analyzing Numerical Variables

We will see: plot_num and profiling_num. Both run automatically for all numerical/integer variables:

Graphically

plot_num(data)
Histograms


Export the plot to jpeg: plot_num(data, path_out = ".")

TIPS:

  • Try to identify high-unbalanced variables.
  • Visually check any variable with outliers.

[Read more here.]

Quantitatively

profiling_num runs for all numerical/integer variables automatically:

data_prof=profiling_num(data)
##         variable mean std_dev variation_coef p_01 p_05 p_25 p_50 p_75 p_95
## 1            age   54       9           0.17   35   40   48   56   61   68
## 2 max_heart_rate  150      23           0.15   95  108  134  153  166  182
##   p_99 skewness kurtosis iqr        range_98     range_80
## 1   71    -0.21      2.5  13        [35, 71]     [42, 66]
## 2  192    -0.53      2.9  32 [95.02, 191.96] [116, 176.6]

TIPS:

  • Try to describe each variable based on its distribution (also useful for reporting).
  • Pay attention to variables with high standard deviation.
  • Select the metrics that you are most familiar with: data_prof %>% select(variable, variation_coef, range_98): A high value in variation_coef may indictate outliers. range_98 indicates where most of the values are.

[Read more here.]

Step 4: Analyzing Numerical and Categorical at the Same Time

describe from Hmisc package.

library(Hmisc)
describe(data)
## data 
## 
##  4  Variables      303  Observations
## ---------------------------------------------------------------------------
## age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0       41    0.999    54.44     10.3       40       42 
##      .25      .50      .75      .90      .95 
##       48       56       61       66       68 
## 
## lowest : 29 34 35 37 38, highest: 70 71 74 76 77
## ---------------------------------------------------------------------------
## max_heart_rate 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      303        0       91        1    149.6    25.73    108.1    116.0 
##      .25      .50      .75      .90      .95 
##    133.5    153.0    166.0    176.6    181.9 
## 
## lowest :  71  88  90  95  96, highest: 190 192 194 195 202
## ---------------------------------------------------------------------------
## thal 
##        n  missing distinct 
##      301        2        3 
##                          
## Value         3    6    7
## Frequency   166   18  117
## Proportion 0.55 0.06 0.39
## ---------------------------------------------------------------------------
## has_heart_disease 
##        n  missing distinct 
##      303        0        2 
##                     
## Value        no  yes
## Frequency   164  139
## Proportion 0.54 0.46
## ---------------------------------------------------------------------------

Really useful to get a quick picture of all the variables. But is not as operative as freq and profiling_num when we want to use its results to change our data workflow.

TIPS:

  • Check min and max values (outliers).
  • Check Distributions (same as before).

[Read more here.]


That's all for now!

----

Pablo Casas.

Twitter

LinkedIn

---

Other posts you might like:

  • Introduction to Machine Learning for non-developers
  • Playing with dimensions: from Clustering, PCA, t-SNE... to Carl Sagan!
  • "I hate math!" - Education and Artificial Intelligence to find a meaning in what we do
Data (computing) R (programming language) Exploratory data analysis Data analysis Machine learning

Published at DZone with permission of Pablo Casas, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • What Is JavaScript Slice? Practical Examples and Guide
  • Detecting Network Anomalies Using Apache Spark
  • How to Use Buildpacks to Build Java Containers
  • How To Best Use Java Records as DTOs in Spring Boot 3

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: