# Exploratory Data Analysis in R (Introduction)

### Exploratory data analysis (EDA) is the very first step in a data project. We will create a code template to achieve this with one function.

Join the DZone community and get the full member experience.

Join For Free## Introduction

EDA consists of univariate (1-variable) and bivariate (2-variables) analysis.

In this post, we will review some functions that lead us to the analysis of the first case.

- Step 1 - First approach to data.
- Step 2 - Analyzing categorical variables.
- Step 3 - Analyzing numerical variables.
- Step 4 - Analyzing numerical and categorical at the same time.

Covering some key points in a basic EDA:

- Data types.
- Outliers.
- Missing values.
- Distributions (numerically and graphically) for both, numerical and categorical variables.

## Type of Analysis Results

There are two type of result: informative or operative.

**Informative**: For example plots, or any long variable summary. We cannot filter data from it, but it gives us a lot of information at once. Most be used at the**EDA**stage.**Operative**: The results can be used to take an action directly on the data workflow (for example, selecting any variables whose percentage of missing values are below 20%). Most used in the**Data Preparation**stage.

## Getting Set Up

Uncoment in case you don't have any of these libraries:

```
# install.packages("tidyverse")
# install.packages("funModeling")
# install.packages("Hmisc")
```

A newer version of `funModeling`

has been released on Ago-1, please update!

Now load the needed libraries...

```
library(funModeling)
library(tidyverse)
library(Hmisc)
```

### tl;dr (Code)

Run all the functions in this post in one-shot with the following function:

```
basic_eda <- function(data)
{
glimpse(data)
df_status(data)
freq(data)
profiling_num(data)
plot_num(data)
describe(data)
}
```

Replace `data`

with *your* data, and that's it!

`basic_eda(my_amazing_data)`

**Creating the data for this example:**

Using the `heart_disease`

data (from `funModeling`

package). We will take only 4 variables for legibility.

`data=heart_disease %>% select(age, max_heart_rate, thal, has_heart_disease)`

## Step 1: The First Approach to the Data

Number of observations (rows) and variables, and a `head`

of the first cases.

`glimpse(data)`

```
## Observations: 303
## Variables: 4
## $ age <int> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, ...
## $ max_heart_rate <int> 150, 108, 129, 187, 172, 178, 160, 163, 147,...
## $ thal <fct> 6, 3, 7, 3, 3, 3, 3, 3, 7, 7, 6, 3, 6, 7, 7,...
## $ has_heart_disease <fct> no, yes, yes, no, no, no, yes, no, yes, yes,...
```

Getting the metrics about data types, zeros, infinite numbers, and missing values:

`df_status(data)`

```
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 age 0 0 0 0.00 0 0 integer 41
## 2 max_heart_rate 0 0 0 0.00 0 0 integer 91
## 3 thal 0 0 2 0.66 0 0 factor 3
## 4 has_heart_disease 0 0 0 0.00 0 0 factor 2
```

`df_status`

returns a table, so it is easy to keep with variables that match certain conditions like:

- Having at least 80% of non-NA values (
`p_na < 20`

). - Having less than 50 unique values (
`unique <= 50`

).

TIPS:

- Are all the variables in the correct data type?
- Variables with lots of zeros or
`NA`

s? - Any high cardinality variable?

## Step 2: Analyzing Categorical Variables

`freq`

function runs for all factor or character variables automatically:

`freq(data)`

```
## thal frequency percentage cumulative_perc
## 1 3 166 54.79 55
## 2 7 117 38.61 93
## 3 6 18 5.94 99
## 4 <NA> 2 0.66 100
```

```
## has_heart_disease frequency percentage cumulative_perc
## 1 no 164 54 54
## 2 yes 139 46 100
```

`## [1] "Variables processed: thal, has_heart_disease"`

TIPS:

- If
`freq`

receives one variable —`freq(data$variable)`

— it retruns a table. This is useful for treating high cardinality variables (like zip code). - Export the plots to jpeg format in the current directory:
`freq(data, path_out = ".")`

- Do all the categories make sense?
- Are there lots of missing values?
- Always check the absolute and relative values

## Step 3: Analyzing Numerical Variables

We will see: `plot_num`

and `profiling_num`

. Both run automatically for all numerical/integer variables:

### Graphically

`plot_num(data)`

Export the plot to jpeg: `plot_num(data, path_out = ".")`

TIPS:

- Try to identify high-unbalanced variables.
- Visually check any variable with outliers.

### Quantitatively

`profiling_num`

runs for all numerical/integer variables automatically:

`data_prof=profiling_num(data)`

```
## variable mean std_dev variation_coef p_01 p_05 p_25 p_50 p_75 p_95
## 1 age 54 9 0.17 35 40 48 56 61 68
## 2 max_heart_rate 150 23 0.15 95 108 134 153 166 182
## p_99 skewness kurtosis iqr range_98 range_80
## 1 71 -0.21 2.5 13 [35, 71] [42, 66]
## 2 192 -0.53 2.9 32 [95.02, 191.96] [116, 176.6]
```

TIPS:

- Try to describe each variable based on its distribution (also useful for reporting).
- Pay attention to variables with high standard deviation.
- Select the metrics that you are most familiar with:
`data_prof %>% select(variable, variation_coef, range_98)`

: A high value in`variation_coef`

may indictate outliers.`range_98`

indicates where most of the values are.

## Step 4: Analyzing Numerical and Categorical at the Same Time

`describe`

from Hmisc package.

```
library(Hmisc)
describe(data)
```

```
## data
##
## 4 Variables 303 Observations
## ---------------------------------------------------------------------------
## age
## n missing distinct Info Mean Gmd .05 .10
## 303 0 41 0.999 54.44 10.3 40 42
## .25 .50 .75 .90 .95
## 48 56 61 66 68
##
## lowest : 29 34 35 37 38, highest: 70 71 74 76 77
## ---------------------------------------------------------------------------
## max_heart_rate
## n missing distinct Info Mean Gmd .05 .10
## 303 0 91 1 149.6 25.73 108.1 116.0
## .25 .50 .75 .90 .95
## 133.5 153.0 166.0 176.6 181.9
##
## lowest : 71 88 90 95 96, highest: 190 192 194 195 202
## ---------------------------------------------------------------------------
## thal
## n missing distinct
## 301 2 3
##
## Value 3 6 7
## Frequency 166 18 117
## Proportion 0.55 0.06 0.39
## ---------------------------------------------------------------------------
## has_heart_disease
## n missing distinct
## 303 0 2
##
## Value no yes
## Frequency 164 139
## Proportion 0.54 0.46
## ---------------------------------------------------------------------------
```

Really useful to get a quick picture of all the variables. But is not as operative as `freq`

and `profiling_num`

when we want to use its results to change our data workflow.

TIPS:

- Check min and max values (outliers).
- Check Distributions (same as before).

That's all for now!

----

Pablo Casas.

---

*Other posts you might like:*

Published at DZone with permission of Pablo Casas, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Comments