What Is Data Validation?
A data expert answers this question, looking in to the ways data validation is performed, why it's useful, and challenges inherent in the process.
Join the DZone community and get the full member experience.Join For Free
Data validation is a method for checking the accuracy and quality of your data, typically performed prior to importing and processing. It can also be considered a form of data cleansing. Data validation ensures that your data is complete (no blank or null values), unique (contains distinct values that are not duplicated), and the range of values is consistent with what you expect. Often, data validation is used as a part of processes such as ETL (Extract, Transform, and Load) where you move data from a source database to a target data warehouse so that you can join it with other data for analysis. Data validation helps ensure that when you perform analysis, your results are accurate.
Steps to Data Validation
Step 1: Determine Data Sample
Determine the data to sample. If you have a large volume of data, you will probably want to validate a sample of your data rather than the entire set. You’ll need to decide what volume of data to sample, and what error rate is acceptable to ensure the success of your project.
Step 2: Validate the Database
Before you move your data, you need to ensure that all the required data is present in your existing database. Determine the number of records and unique IDs, and compare the source and target data fields.
Step 3: Validate the Data Format
Determine the overall health of the data and the changes that will be required of the source data to match the schema in the target. Then search for incongruent or incomplete counts, duplicate data, incorrect formats, and null field values.
Methods for Data Validation
You can perform data validation in one of the following ways:
Scripting: Data validation is commonly performed using a scripting language such as Python to write scripts for the validation process. For example, you can create an XML file with source and target database names, table names, and columns to compare. The Python script can then take the XML as an input and process the results. However, this can be very time intensive, as you must write the scripts and verify the results by hand.
Enterprise tools: Enterprise tools are available to perform data validation. For example, FME data validation tools can validate and repair data. Enterprise tools have the benefit of being more stable and secure, but can require infrastructure and are costlier than open source options.
Open source tools: Open source options are cost-effective, and if they are cloud-based, can also save you money on infrastructure costs. But they still require a level of knowledge and hand-coding to be able to use effectively. Some open source tools are SourceForge and OpenRefine.
Challenges in Data Validation
Data validation can be challenging for a couple of reasons:
- Validating the database can be challenging because data may be distributed in multiple databases across your organization. The data may be siloed, or it may be outdated.
- Validating the data format can be an extremely time-consuming process, especially if you have large databases and you intend to perform the validation manually. However, sampling the data for validation can help to reduce the time needed.
Data Validation and ETL
Whether you validate data manually or via scripting, it can be very time-consuming. However, after you have validated your data, a modern ETL tool can help you to expedite the process. As a part of your assessment of your data, you can determine which errors can be fixed at the source, and which errors an ETL tool can repair while the data is in the pipeline. You can then automatically integrate, clean, and transform data as it is moved to your data warehouse.
Published at DZone with permission of Garrett Alley, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.