Data Validation To Improve the Data Quality
This article emphasizes the importance of data validation and the checklist of validation rules that can be utilized to improve the data quality.
Join the DZone community and get the full member experience.Join For Free
In today’s world, a lot of businesses are deriving value from data and making key decisions based on the results of analytical insights, predictions, and forecasting. In such cases, data quality stands as an important factor. As important as it can be, there are various scenarios of compromising data quality, which may be of any form, such as miscalculations, invalid transformations and business logic, data skewness, missing data points, etc. This may result in generating out-of-sync analytic reports and invalid predictions, which may go unnoticed for a longer time. Even if it is identified at later stages, the damage might have already happened, and fixing production data is an expensive operation in terms of time, money, criticality, and any additional manpower it may require.
Performing data validation is the key to maintaining data quality and helps enhance the trustworthiness of the data. It can be adopted while building data pipelines and making it a part of the data processing phase. This will ensure minimal to no gaps between the source and target systems. Data validation can be implemented in two ways. One way is by building an automated script that accepts database, schema, and table names as parameters and can be utilized across multiple data pipelines. The other way is to build customized scripts for each pipeline with specific business logic. In this article, we focus on a checklist of validation rules that can be utilized while building such scripts to improve the quality of the data. Let’s dive into it.
When moving the data around, the first and most basic rule of validation is to match the counts between the source and target systems. If we are performing aggregation of numbers, the aggregated numbers should match the counts of the underlying source.
When working with historical tables, it is required to validate if there are any duplicate records and delete them. In cases where deleting is not an option, a new boolean column can be added to flag them as duplicated.
When a data type is being converted between source and target systems, ensure that the data is not lost. Validation needs to be in place to check if the conversion is resulting in nulls or if the data is truncated or defaulting to start values. For example, when converting the datetime2 to datetime format in SQL Server, if a column has empty space, the default value of ‘1900-01-01 00:00:00.000’ is resulted from the conversion.
If a column has more than 50% NULLs, it should be validated at the source. If the source is the same, there could be a potential bug in capturing the data at the source. Either way, question the use of bringing this column to the target.
String columns can have any data. A validation needs to be in place to check if the data is blank and has only spaces or special characters. To limit the inaccurate data, ENUM can be adopted wherever possible and define the set of values a column can have. In cases where ENUM is not supported, it can be useful to filter the values that are not expected using the WHERE clause. For example, this query can be used to check a valid weekday: select weekday from the calendar where weekday is in (‘Sunday,’’ Monday,’ ‘Tuesday,’ ‘Wednesday,’ ‘Thursday,’ ‘Friday,’ ‘Saturday’)
Another technique is to use regex to allow only valid strings. For example, an email column can be validated using:
Handling Decimals and Floats
Although both store decimal values, they do not yield the same result when calculating percentages, averages, rounding, etc. Choose the appropriate one based on your needs. For exact values, decimals can be used. For approximate values, a float can be used. Validation needs to be in place to check that when these data types are converted, the resulting value is not losing precision or overflowing.
When bringing a datetime from a source, check the format. If it has an offset or the source system is in a different timezone, it needs to be converted to the target timezone. In cases where a time difference at a smaller granularity is calculated for further analysis between two datetime columns, milliseconds need to be preserved.
Flag columns are the easiest to handle. Assigning a bit or boolean data type for flag columns makes it easy to validate as they allow only True/False and 1/0 apart from nulls.
In some cases, imposing data integrity constraints and validating the above rules alone does not guarantee clean data. Following best practices while building, scheduling, and maintaining data pipelines also plays a key role in ensuring data quality. I’m going to shed some light on those best practices in my next article. Stay tuned!
Opinions expressed by DZone contributors are their own.