Over a million developers have joined DZone.

The Data Preparation Checklist: 6 Questions to Ask

This article covers best practices to simplify the data preparation process and increase its effectiveness.

· Big Data Zone

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

This article covers best practices to simplify the data preparation process and increase its effectiveness. If you’re interested in discovering additional best-in-class strategies to manage and enhance data, join our upcoming webinar with Aberdeen Group: Priming the Analytical Engine – Accelerated Insight with Efficient Data Prep

Data preparation is perhaps the most important step in any type of serious data analysis. And while it would be ludicrous to attempt to cover such a broad field of knowledge in one article, we’ve prepared a quick checklist that you can run through when preparing data for analysis. Hopefully, this will help you optimize the data preparation process and make sure you have all the important steps and bases covered.

Before You Start: Define the Business Questions

We’ve written before about questions to ask during requirements elicitation, but as a general guideline – any type of data analysis starts by becoming familiar with the business questions you’ll want to answer and the KPIs you intend to measure.

A firm understanding of the business requirements will enable you to later map these demands back to the data and types of analyses you’ll want to perform, while failing to understand what the business expects to see can in the beginning can lead to lots of wasted time and effort later down the line – so don’t skip this step!

Once you have a firm grasp of what your business expects to see as the final product of the analysis, you’ll want to start diving into the data. The first thing you’ll want to do is to find it.

1. Where is the Data?

The first set of questions refers to the physical locations in which your organization’s data is stored. For a small deployment, this could be as simple as a series of spreadsheets; for larger ones, you might be looking at multiple databases, Hadoop data lakes, cloud sources or a data warehouse (read about the differences between databases, date marts, and data warehouses here).

You will also need to find out whether you have the required permissions to access the data, and which types or formats of data you’ll be dealing with.

The questions you want to ask in this stage are:

  • Which data sources does my organization work with?
  • Do I have the required permissions or credentials to access the data?
  • What is the size of each dataset and how much data will I need to get from each one?
  • How familiar am I with the underlying tables and schema in each database?
  • Do I need all the data for more granular analysis, or do I need a subset to ensure faster performance?
  • Will the data need to be standardized due to disparity – e.g., by combining data from a SQL database with a NoSQL source such as MongoDB?
  • Will I need to analyze data from external sources, which resides outside of my organization’s data stores?

2. Do You Need to Change the Data?

Often data needs to be manually transformed or manipulated for effective analysis. This could be relevant when various tables or datasets use different formats for the same information, when the data is inconsistent or contains duplicate information, or when you want to group data in new ways.

Here’s what you’ll want to be asking:

  • For each individual source – is it complete? Accurate? Up to date?
  • In its current state, can I use the data to answer my business questions?
  • If there are inconsistencies or redundant values, what do I need to do to clean the data? Is it a matter of manually changing a few values or will a more systematic approach be necessary?
  • Will I be able to change the data in its original location, or would this need to be done in a secondary environment (e.g. cases where you do not have permissions to alter production data)?

3. How Will You Connect the Data?

If you’re working with many different data sources and tables, you’ll need to model the data in a way that enables dashboard users to quickly receive answers to ad-hoc queries by connecting related fields in different tables. The relationship between the various entities in your data model will determine the types of queries your future analysis will be able to answer, as well as the efficiency in which it does so.

Start by asking:

  • Which fields are appropriate to connect data together, from a business viewpoint?
  • What relationship will occur once these fields are connected? You’ll want to avoid many-to-many relationships.
  • Will my data model scale?
  • How easy will it be to add data sources and make changes to the model further down the road?
  • Can we simplify the relationship without affecting performance? Note that this might depend on the data preparation and analytics tools that you’re using.

4. Do You Need to Further Consolidate the Data?

For certain types of more complex analyses, you might want create new tables on top of your existing ones. One example of this can be a funnel analysis, in which you would want to take the basic information about an ongoing, multi-stage process and create various buckets into which each record would be categorized. Examples of questions that can help you understand whether you’re ready to go include:

  • Do I need to create summary tables for the types of analysis I want to perform?
  • Do I need to join data from the tables I’m working with an inner or outer join, or to combine these tables to create a new one?

5. How Will You Import the Data?

While there are certain situations in which you would create reports and analysis by querying the production databases, most BI tools and implementations will rely on creating an amalgamation of the data in a secondary environment which will serve as your analytical database. The questions you want to ask include:

  • Does the local or cloud server I move my data to have the sufficient software and hardware to crunch the amounts of data I’m dealing with? The two are somewhat dependent, as the right software can reduce hardware costs.
  • At what frequency do I need to import the data? This depends on the rate at which the original data changes or grows.
  • How will importing the data affect my production environment?

6. How Will You Verify the Results?

Before you can proudly announce that the data preparation is complete, you’ll want to make sure that the end result is accurate and that you haven’t made any mistakes along the way. To verify the data, ask questions such as:

  • Does is it make sense on a general level?
  • Are the measures I’m seeing in line with what I already know about the business?
  • Do calculations in my analytical environment return the same results as the same calculations performed manually on the original data?

Start Analyzing!

After you’ve gone through the entire checklist above, you’ll have identified the data, transformed it, built your data model, moved the data to an analytical database and verified the results. This could be a matter of hours, days or more – depending on the amount of data you’re working with and its complexity.

If everything went well, you’re good to go – so go ahead and start building some dashboards! And read our guide to dashboard design to make sure you follow the core principles that will help you tell a clear and understandable story with your data.

Hortonworks Sandbox is a personal, portable Apache Hadoop® environment that comes with dozens of interactive Hadoop and it's ecosystem tutorials and the most exciting developments from the latest HDP distribution, brought to you in partnership with Hortonworks.

data model,data analysis,hadoop,big data

Published at DZone with permission of Eran Levy, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}