Exploratory and Confirmatory Analysis: What's the Difference?
Learn about the differences and uses of exploratory data analysis and confirmatory analysis by considering the process a detective goes through.
Join the DZone community and get the full member experience.
Join For FreeHow does a detective solve a case? She pulls together all the evidence she has, all the data that's available to her, and she looks for clues and patterns.
At the same time, she takes a good hard look at individual pieces of evidence. What supports her hypothesis? What bucks the trend? Which factors work against her narrative? What questions does she still need to answer... and what does she need to do next in order to answer them?
Then, adding to the mix her wealth of experience and ingrained intuition, she builds a picture of what really took place — and perhaps even predicts what might happen next.
But that's not the end of the story. We don't simply take the detective's word for it that she's solved the crime. We take her findings to a court and make her prove it.
In a nutshell, that's the difference between exploratory and confirmatory analysis.
Data analysis is a broad church, and managing this process successfully involves several rounds of testing, experimenting, hypothesizing, checking, and interrogating both your data and approach.
Putting your case together, and then ripping apart what you think you're certain about to challenge your own assumptions, are both crucial to Business Intelligence.
Before you can do either of these things, however, you have to be sure that you can tell them apart.
What Is Exploratory Data Analysis?
Exploratory data analysis (EDA) is the first part of your data analysis process. There are several important things to do at this stage, but it boils down to this: figuring out what to make of the data, establishing the questions you want to ask and how you're going to frame them, and coming up with the best way to present and manipulate the data you have to draw out those important insights.
That's what it is, but how does it work?
As the name suggests, you're exploring — looking for clues. You're teasing out trends and patterns, as well as deviations from the model, outliers, and unexpected results, using quantitative and visual methods. What you find out now will help you decide the questions to ask, the research areas to explore and, generally, the next steps to take.
Exploratory data analysis involves things like: establishing the data's underlying structure, identifying mistakes and missing data, establishing the key variables, spotting anomalies, checking assumptions and testing hypotheses in relation to a specific model, estimating parameters, establishing confidence intervals and margins of error, and figuring out a "parsimonious model" — i.e. one that you can use to explain the data with the fewest possible predictor variables.
In this way, your exploratory data analysis is your detective work. To make it stick, though, you need confirmatory data analysis.
What Is Confirmatory Data Analysis?
Confirmatory data analysis is the part where you evaluate your evidence using traditional statistical tools such as significance, inference, and confidence.
At this point, you're really challenging your assumptions. A big part of confirmatory data analysis is quantifying things like the extent any deviation from the model you've built could have happened by chance, and at what point you need to start questioning your model.
Confirmatory data analysis involves things like testing hypotheses, producing estimates with a specified level of precision, regression analysis, and variance analysis. In this way, your confirmatory data analysis is where you put your findings and arguments to trial.
Uses of Confirmatory and Exploratory Data Analysis
In reality, exploratory and confirmatory data analyses aren't performed one after another, but continually intertwine to help you create the best possible model for analysis.
Let's take an example of how this might look in practice.
Imagine that in recent months, you'd seen a surge in the number of users canceling their product subscription. You want to find out why this is so that you can tackle the underlying cause and reverse the trend.
This would begin with exploratory data analysis. You'd take all of the data you have on the defectors, as well as on happy customers of your product, and start to sift through looking for clues. After plenty of time spent manipulating the data and looking at it from different angles, you notice that the vast majority of people that defected had signed up during the same month.
On closer investigation, you find out that during the month in question, your marketing team was shifting to a new customer management system and as a result, introductory documentation that you usually send to new customers wasn't always going through. This would have helped to troubleshoot many teething problems that new users face.
Now you have a hypothesis: people are defecting because they didn't get the welcome pack (and the easy solution is to make sure they always get a welcome pack!).
But first, you need to be sure that you were right about this cause. Based on your exploratory data analysis, you now build a new predictive model that allows you to compare defection rates between those that received the welcome pack and those that did not. This is rooted in confirmatory data analysis.
The results show a broad correlation between the two. Bingo! You have your answer.
Exploratory Data Analysis and Big Data
Getting a feel for the data is one thing, but what about when you're dealing with enormous data pools?
After all, there are already so many different ways you can approach exploratory data analysis, by transforming it through nonlinear operators, projecting it into a difference subspace and examining your resulting distribution, or slicing and dicing it along different combinations of dimensions... add sprawling amounts of data into the mix and suddenly the whole "playing detective" element feels a lot more daunting.
The important thing is to ensure that you have the right tech stack in place to cope with this, and to make sure you have access to the data you need in real time.
Two of the best statistical programming packages available for conducting exploratory data analysis are R and S-Plus; R is particularly powerful and easily integrated with many BI platforms. That's the first thing to consider.
The next step is ensuring that your BI platform has a comprehensive set of data connectors, that — crucially — allow data to flow in both directions. This means that you can keep importing Exploratory Data Analysis and models from, for example, R to visualize and interrogate results and also send data back from your BI solution to automatically update your model and results as new information flows into R.
In this way, you not only strengthen your exploratory data analysis, you incorporate confirmatory data analysis, too — covering all your bases of collecting, presenting and testing your evidence to help reach a genuinely insightful conclusion.
Your honor, we rest our case.
Published at DZone with permission of Shelby Blitz, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments