{{announcement.body}}
{{announcement.title}}

# Using Python to Find Correlation Between Categorical and Continuous Variables

DZone 's Guide to

# Using Python to Find Correlation Between Categorical and Continuous Variables

### In this post, we'll learn how to find correlations between categorical and continuous variables using Python and Pandas.

· Big Data Zone ·
Free Resource

Comment (3)

Save
{{ articles.views | formatCount}} Views

Before making any machine learning model on a tabular dataset, normally we check whether there is a relation between the independent and target variables. This can be done by measuring the correlation between two variables. In Python, Pandas provides a function, `dataframe.corr()`, to find the correlation between numeric variables only.

In this article, we will see how to find the correlation between categorical and continuous variables.

## Case 1: When an Independent Variable Only Has Two Values

### Point Biserial Correlation

If a categorical variable only has two values (i.e. true/false), then we can convert it into a numeric datatype (0 and 1). Since it becomes a numeric variable, we can find out the correlation using the `dataframe.corr()` function.

Let's create a dataframe which will consist of two columns: Employee Type (EmpType) and Salary.

Purposely, we will assign more salary to EmpType1. This way we will get some correlation between EmpType and Salary.

Create a dataframe with the following properties:

• Mean (average) salary of `EmpType1` is 60 with a standard deviation of five.

• Mean (average) salary of `EmpType2` is 50 with a standard deviation of five.

``````import pandas as pd
import numpy as np

num1=np.random.normal(loc=60,scale=5,size=100)
df1=pd.DataFrame(num1,columns=['Salary'])
df1['Type']='EmpType1'

num2=np.random.normal(loc=50,scale=5,size=100)
df2=pd.DataFrame(num2,columns=['Salary'])
df2['Type']='EmpType2'

df=pd.concat([df1,df2],axis=0)
# Since Categorical variable 'Type' has only 2 values we will convert it into numeric (0 and 1) datatype.

df['TypeInt']=(df['Type']=='EmpType1').astype(int)
df.corr()``````

Output

 Salary TypeInt Salary 1 0.736262 TypeInt 0.736262 1

The correlation between EmpType and Salary is 0.7. So we can determine it is correlated.

## Case 2: When Independent Variables Have More Than Two Values

### ANOVA (Analysis of Variance)

We will assign more salary to `EmpType1`, an average salary to `EmpType2`, and a low salary to `EmpType3`. This way, we will get some correlation between EmpType and Salary.

• The mean salary of `EmpType1` is 90 with a standard deviation of five.

• The mean salary of `EmpType2` is 70 with a standard deviation of five.

• The mean salary of `EmpType3` is 50 with a standard deviation of five.

``````num1=np.random.normal(loc=90,scale=5,size=100)
df1=pd.DataFrame(num1,columns=['Salary'])
df1['Type']='EmpType1'

num2=np.random.normal(loc=70,scale=5,size=100)
df2=pd.DataFrame(num2,columns=['Salary'])
df2['Type']='EmpType2'

num3=np.random.normal(loc=50,scale=5,size=100)
df3=pd.DataFrame(num3,columns=['Salary'])
df3['Type']='EmpType3'

df=pd.concat([df1,df2,df3],axis=0)

from scipy import stats

F, p = stats.f_oneway(df[df.Type=='EmpType1'].Salary,
df[df.Type=='EmpType2'].Salary,
df[df.Type=='EmpType3'].Salary)

print(F)``````

The output we get is: 1443.6261

• Since the mean salary of three employee types is 90, 70, and 50 (with a standard deviation of five) the F score is 1444.
• If the mean salary of three employee types is 60, 55, 50 the F score is 86.
• And if the mean salary of three employee types is 51, 50, 49 (almost the same) then F score will be close to 0, i.e. there's no correlation.
• The greater the F score value the higher the correlation will be.

Topics:
big data, correlation, data analysis, machine learning, python tutorial

Comment (3)

Save
{{ articles.views | formatCount}} Views

Opinions expressed by DZone contributors are their own.