We talk a lot about the science side of data analysis and BI — the calculations and algorithms needed to perform complex queries. Sure, a big part of BI is math, but making sense of data — planning how to structure your analysis at one end and interpreting the results at the other — is very much an art form, too.
What Is Exploratory Data Analysis?
Exploratory data analysis (EDA) is the first step in your data analysis process. Here, you make sense of the data you have and then figure out what questions you want to ask and how to frame them, as well as how best to manipulate your available data sources to get the answers you need.
You do this by taking a broad look at patterns, trends, outliers, unexpected results, and so on in your existing data, using visual and quantitative methods to get a sense of the story this tells. You’re looking for clues that suggest your logical next steps, questions or areas of research.
Developed by John Tukey in the 1970s, exploratory data analysis is often described as a philosophy, and there are no hard-and-fast rules for how you approach it. That said, it also gave rise to a whole family of statistical-computing environments both used to help define what EDA is and to tackle specific tasks such as:
- Spotting mistakes and missing data.
- Mapping out the underlying structure of the data.
- Identifying the most important variables.
- Listing anomalies and outliers.
- Testing a hypotheses / checking assumptions related to a specific model.
- Establishing a parsimonious model (one that can be used to explain the data with minimal predictor variables).
- Estimating parameters and figuring out the associated confidence intervals or margins of error.
Tools and Techniques
Among the most important statistical programming packages used to conduct exploratory data analysis are S-Plus and R. The latter is a powerful, versatile, open-source programming language that can be integrated with many BI platforms... but more on that in a moment.
Specific statistical functions and techniques you can perform with these tools include:
- Clustering and dimension reduction techniques, which help you to create graphical displays of high-dimensional data containing many variables.
- Univariate visualization of each field in the raw dataset, with summary statistics.
- Bivariate visualizations and summary statistics that allow you to assess the relationship between each variable in the dataset and the target variable you’re looking at.
- Multivariate visualizations for mapping and understanding interactions between different fields in the data.
- K-means clustering, creating “centers” for each cluster based on the nearest mean.
- Predictive models, for example, linear regression.
Where It Fits Into Your BI
With the right data connectors, you can incorporate EDA data directly into your BI platform, helping to inform your analysis. What’s more, you can set this up to allow data to flow the other way too, building and running statistical models in (for example) R that use BI data and automatically update as new information flows into the model.
For example, you could use EDA to map your lead-to-cash process, tracking the journey taken through every step and department to convert a marketing lead into a customer – with a view to streamlining this for smooth transition along the way.
Potential uses of this are wide-ranging, but ultimately, it boils down to this: exploratory data analysis is about getting to know and understand your data before you make any assumptions about it. It helps you avoid accidentally creating inaccurate models, or building accurate models that are built on the wrong data.
Do it right, and you gain the necessary confidence in your data to start deploying powerful machine learning algorithms with huge implications for business performance — but ignore this crucial step, and you’ll find that your BI is built on shaky foundations later on.