DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • 6 Best Practices to Build Data Pipelines
  • Data Excellence Unveiled: Mastering Data Release Management With Best Practices
  • Want To Build Successful Data Products? Start With Ingestion and Integration
  • Optimizing ETL Workflows: Trends, Challenges, and Best Practices for DataOps Success

Trending

  • The Cypress Edge: Next-Level Testing Strategies for React Developers
  • Zero Trust for AWS NLBs: Why It Matters and How to Do It
  • How to Convert XLS to XLSX in Java
  • Optimize Deployment Pipelines for Speed, Security and Seamless Automation
  1. DZone
  2. Data Engineering
  3. Data
  4. Data Validation To Improve the Data Quality

Data Validation To Improve the Data Quality

This article emphasizes the importance of data validation and the checklist of validation rules that can be utilized to improve the data quality.

By 
Priyanka Kadiyala user avatar
Priyanka Kadiyala
·
Nov. 07, 23 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
2.3K Views

Join the DZone community and get the full member experience.

Join For Free

In today’s world, a lot of businesses are deriving value from data and making key decisions based on the results of analytical insights, predictions, and forecasting. In such cases, data quality stands as an important factor. As important as it can be, there are various scenarios of compromising data quality, which may be of any form, such as miscalculations, invalid transformations and business logic, data skewness, missing data points, etc. This may result in generating out-of-sync analytic reports and invalid predictions, which may go unnoticed for a longer time. Even if it is identified at later stages, the damage might have already happened, and fixing production data is an expensive operation in terms of time, money, criticality, and any additional manpower it may require.

Performing data validation is the key to maintaining data quality and helps enhance the trustworthiness of the data. It can be adopted while building data pipelines and making it a part of the data processing phase. This will ensure minimal to no gaps between the source and target systems. Data validation can be implemented in two ways. One way is by building an automated script that accepts database, schema, and table names as parameters and can be utilized across multiple data pipelines. The other way is to build customized scripts for each pipeline with specific business logic. In this article, we focus on a checklist of validation rules that can be utilized while building such scripts to improve the quality of the data. Let’s dive into it.

Matching Counts

When moving the data around, the first and most basic rule of validation is to match the counts between the source and target systems. If we are performing aggregation of numbers, the aggregated numbers should match the counts of the underlying source.

Uniqueness

When working with historical tables, it is required to validate if there are any duplicate records and delete them. In cases where deleting is not an option, a new boolean column can be added to flag them as duplicated.

Type Conversion

When a data type is being converted between source and target systems, ensure that the data is not lost. Validation needs to be in place to check if the conversion is resulting in nulls or if the data is truncated or defaulting to start values. For example, when converting the datetime2 to datetime format in SQL Server, if a column has empty space, the default value of ‘1900-01-01 00:00:00.000’ is resulted from the conversion.

NULLs

If a column has more than 50% NULLs, it should be validated at the source. If the source is the same, there could be a potential bug in capturing the data at the source. Either way, question the use of bringing this column to the target.

Handling Strings

String columns can have any data. A validation needs to be in place to check if the data is blank and has only spaces or special characters. To limit the inaccurate data, ENUM can be adopted wherever possible and define the set of values a column can have. In cases where ENUM is not supported, it can be useful to filter the values that are not expected using the WHERE clause. For example, this query can be used to check a valid weekday: select weekday from the calendar where weekday is in (‘Sunday,’’ Monday,’ ‘Tuesday,’ ‘Wednesday,’ ‘Thursday,’ ‘Friday,’ ‘Saturday’) 

Another technique is to use regex to allow only valid strings. For example, an email column can be validated using:
^(([^<>()[\]\\.,;:\s@”]+(\.[^<>()[\]\\.,;:\s@”]+)*)|(“.+”))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$

Handling Decimals and Floats

Although both store decimal values, they do not yield the same result when calculating percentages, averages, rounding, etc. Choose the appropriate one based on your needs. For exact values, decimals can be used. For approximate values, a float can be used. Validation needs to be in place to check that when these data types are converted, the resulting value is not losing precision or overflowing.

Handling DateTimes

When bringing a datetime from a source, check the format. If it has an offset or the source system is in a different timezone, it needs to be converted to the target timezone. In cases where a time difference at a smaller granularity is calculated for further analysis between two datetime columns, milliseconds need to be preserved.

Handling Flags

Flag columns are the easiest to handle. Assigning a bit or boolean data type for flag columns makes it easy to validate as they allow only True/False and 1/0 apart from nulls.

Conclusion

In some cases, imposing data integrity constraints and validating the above rules alone does not guarantee clean data. Following best practices while building, scheduling, and maintaining data pipelines also plays a key role in ensuring data quality. I’m going to shed some light on those best practices in my next article. Stay tuned!

Data processing Data quality Data validation Data (computing)

Opinions expressed by DZone contributors are their own.

Related

  • 6 Best Practices to Build Data Pipelines
  • Data Excellence Unveiled: Mastering Data Release Management With Best Practices
  • Want To Build Successful Data Products? Start With Ingestion and Integration
  • Optimizing ETL Workflows: Trends, Challenges, and Best Practices for DataOps Success

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!