Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

What Is Exploratory Data Analysis?

DZone's Guide to

What Is Exploratory Data Analysis?

Ignore the crucial step of exploratory data analysis and you will likely find that your business intelligence is built on shaky foundations later on.

· Big Data Zone
Free Resource

Free O'Reilly eBook: Learn how to architect always-on apps that scale. Brought to you by Mesosphere DC/OS–the premier platform for containers and big data.

We talk a lot about the science side of data analysis and BI — the calculations and algorithms needed to perform complex queries. Sure, a big part of BI is math, but making sense of data — planning how to structure your analysis at one end and interpreting the results at the other — is very much an art form, too.

What Is Exploratory Data Analysis?

Exploratory data analysis (EDA) is the first step in your data analysis process. Here, you make sense of the data you have and then figure out what questions you want to ask and how to frame them, as well as how best to manipulate your available data sources to get the answers you need.

You do this by taking a broad look at patterns, trends, outliers, unexpected results, and so on in your existing data, using visual and quantitative methods to get a sense of the story this tells. You’re looking for clues that suggest your logical next steps, questions or areas of research.

Developed by John Tukey in the 1970s, exploratory data analysis is often described as a philosophy, and there are no hard-and-fast rules for how you approach it. That said, it also gave rise to a whole family of statistical-computing environments both used to help define what EDA is and to tackle specific tasks such as:

  • Spotting mistakes and missing data.
  • Mapping out the underlying structure of the data.
  • Identifying the most important variables.
  • Listing anomalies and outliers.
  • Testing a hypotheses / checking assumptions related to a specific model.
  • Establishing a parsimonious model (one that can be used to explain the data with minimal predictor variables).
  • Estimating parameters and figuring out the associated confidence intervals or margins of error.

Tools and Techniques

Among the most important statistical programming packages used to conduct exploratory data analysis are S-Plus and R. The latter is a powerful, versatile, open-source programming language that can be integrated with many BI platforms... but more on that in a moment.

Specific statistical functions and techniques you can perform with these tools include:

  • Clustering and dimension reduction techniques, which help you to create graphical displays of high-dimensional data containing many variables.
  • Univariate visualization of each field in the raw dataset, with summary statistics.
  • Bivariate visualizations and summary statistics that allow you to assess the relationship between each variable in the dataset and the target variable you’re looking at.
  • Multivariate visualizations for mapping and understanding interactions between different fields in the data.
  • K-means clustering, creating “centers” for each cluster based on the nearest mean.
  • Predictive models, for example, linear regression.

Where It Fits Into Your BI

With the right data connectors, you can incorporate EDA data directly into your BI platform, helping to inform your analysis. What’s more, you can set this up to allow data to flow the other way too, building and running statistical models in (for example) R that use BI data and automatically update as new information flows into the model.

For example, you could use EDA to map your lead-to-cash process, tracking the journey taken through every step and department to convert a marketing lead into a customer – with a view to streamlining this for smooth transition along the way.

Potential uses of this are wide-ranging, but ultimately, it boils down to this: exploratory data analysis is about getting to know and understand your data before you make any assumptions about it. It helps you avoid accidentally creating inaccurate models, or building accurate models that are built on the wrong data.

Do it right, and you gain the necessary confidence in your data to start deploying powerful machine learning algorithms with huge implications for business performance — but ignore this crucial step, and you’ll find that your BI is built on shaky foundations later on.

Easily deploy & scale your data pipelines in clicks. Run Spark, Kafka, Cassandra + more on shared infrastructure and blow away your data silos. Learn how with Mesosphere DC/OS.

Topics:
big data ,data analytics ,exploratory data analysis ,business intelligence

Published at DZone with permission of Gur Tirosh, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}