DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Good Data, Bad Metric: A Mutation Testing Pattern for Analytics Engineering
  • Exploring Databricks Genie: Conversational Analytics with Unity Catalog
  • Building Cost-Aware Product Roadmaps Using Real-Time Data from Distributed Logistics Systems
  • Data Processing for Real Estate: Enabling Smart Analysis and Decision-Making

Trending

  • Stop Debugging Glue Jobs Manually: Building an Agentic Observability Layer for Data Pipelines
  • 8 RAG Patterns You Should Stop Ignoring
  • Implementing Secure API Gateways for Microservices Architecture
  • Jakarta EE 12: Entering the Data Age of Enterprise Java
  1. DZone
  2. Data Engineering
  3. Data
  4. Correlations Made Easy

Correlations Made Easy

In this article, we will check out in more detail correlations and how to identify the correlation coefficient between given variables.

By 
Rajesh Gaddipati user avatar
Rajesh Gaddipati
·
Chintamani Chhatre user avatar
Chintamani Chhatre
·
Updated Mar. 01, 23 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
2.9K Views

Join the DZone community and get the full member experience.

Join For Free

Prelude

In the connected world, the internet and social networks give an additional push to generate humongous data. There is no such great challenge to store all the data that was generated, and one can leverage the latest advancements in terms of storing data. The real challenge is to understand and identify the possibility to correlate this data and draw meaningful insights. In this article, we will check out in more detail correlations and how to identify the correlation coefficient between given variables.

Data Analytics

Data Analytics plays a key role in deep diving into data, identifying trends, discovering data patterns, and extracting value out of data. Here, I would like to give a real-time example instead of regular correlation examples (sales of umbrellas during rainstorms and sales of ice cream during summer). The two variables we consider here to correlate are:

  1. Life Expectancy
  2. Public Health Care Expenditure % of GDP

We are analyzing data sets from “Our World in Data” the “life expectancy” of individuals in a country is correlated to “Public Health Care Expenditure % of GDP” or not.

There are four different types of analytics, and we just try to understand them with the same example in simple statements.

1. Descriptive Analytics: “What happened?” 

  • What is life expectancy in a specific continent or country? 
  • How much government expenditure on healthcare as a percent of gross domestic product (GDP) is spent?

2. Diagnostic Analytics: “What could be the root cause?”

  • Why the life expectancy in some countries is very low?
  • Is there a correlation between life expectancy and government spending on public health?
  • Identifying data patterns and correlations is key as part of Diagnostic Analytics. 

3. Predictive Analytics: “What might possibly happen in the future?”

  • What will be the life expectancy of a specific country in the next two years?
  • What percentage of the amount will be spent in the next five years of time?

4. Prescriptive Analytics: “What are the possible action items?”

  • Whether to start a program globally that can help countries to focus on healthcare as a priority or a program to collaborate with NGOs to develop local health facilities.

Diagnostic Analytics — Correlation Analysis

Let’s talk about correlation analysis in a bit more detail on how it is used to identify the strength of correlation between two different variables.

The two different variables we are considering here:

  1. Public Health Care Expenditure % of GDP (2019 — “Our World in Data”) — Independent Variable
  2. Life Expectancy (2019 — “Our World in Data”) — Dependent Variable

We will identify if the dependent variable is changing as the value changes in independent variables.

The correlation coefficient (r) is calculated to identify the strength of the given two variables.

Correlation is identified based on the r value:

  • r towards 1: Positive Correlation
  • r towards -1: Negative Correlation
  • r towards 0: No Correlation

Access the dataset here.

Sample data for a few countries and aggregation values for 52 countries:

Country 2019

Public Health Care Expenditure % of GDP

 - X

Life expectancy 2019 - Y

X * X

Y * Y

X * Y

Argentina

5.954

77.3

35.45012

5975.29

460.2442

Australia

7.361

83.1

54.18432

6905.61

611.6991

Austria

7.865

81.9

61.85823

6707.61

644.1435

Belgium

8.107

81.8

65.72345

6691.24

663.1526

Brazil

3.93

75.3

15.4449

5670.09

295.929

Bulgaria

4.295

75.1

18.44703

5640.01

322.5545

Canada

7.641

82.4

58.38488

6789.76

629.6184

Chile

5.656

80.3

31.99034

6448.09

454.1768

China

3.002

78

9.012004

6084

234.156

Colombia

6.284

76.8

39.48866

5898.24

482.6112

Costa Rica

5.339

79.4

28.50492

6304.36

423.9166

Croatia

5.579

78.7

31.12524

6193.69

439.0673

Cyprus

3.857

81.4

14.87645

6625.96

313.9598

Czechia

6.463

79.2

41.77037

6272.64

511.8696

Denmark

8.473

81.4

71.79173

6625.96

689.7022

Estonia

5.081

78.7

25.81656

6193.69

399.8747

Finland

7.136

81.9

50.9225

6707.61

584.4384

France

9.273

82.7

85.98853

6839.29

766.8771

Germany

9.827

81.6

96.56993

6658.56

801.8832

Total for all 52 Countries

312.279

4141.4

2125.139

330504.1

25103.71
















We have all the required X and Y values available to quantify the correlation r value between the given two variables.

There are several different types to calculate the correlation coefficient (ex: Pearson, Rank, Intra-class, etc..).

In this article, we will use the Pearson Correlation coefficient. 

Pearson Correlation Coefficient Formulae

Pearson Correlation Coefficient Formulae  Pearson Correlation Coefficient Formulae                                                         

                                                    r =   0.61564

Result of r, which is towards 1 and says the given 2 variables have a positive correlation. In the same way, we can quantify the correlation between any given two variables. It’s not only about identifying correlations but visualizing them with proper storytelling plays a key role in decision-making. 

Let us see another example where we can visualize two different variables.

  1. Life Expectancy
  2. No. of people having access to safe water

Here, we not only quantify the correlation between having access to safe water and the life expectancy of the population in any country, but we also visualize those correlations. Visualizations were created using Tableau.

Visualize

Life Expectancy:

Life Expectancy



Safe Water

The complete interactive dashboard can be accessed here.

Correlation Analysis plays a vital role as part of Diagnostic Analysis. Pearson Correlation is widely used to identify and quantify the correlation between given two variables. 

Analytics Predictive analytics Prescriptive analytics Correlation (projective geometry) Data (computing)

Opinions expressed by DZone contributors are their own.

Related

  • Good Data, Bad Metric: A Mutation Testing Pattern for Analytics Engineering
  • Exploring Databricks Genie: Conversational Analytics with Unity Catalog
  • Building Cost-Aware Product Roadmaps Using Real-Time Data from Distributed Logistics Systems
  • Data Processing for Real Estate: Enabling Smart Analysis and Decision-Making

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook