DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Data
  4. Data Testing: The Missing Component of Data Quality
Content sponsored by IceDQ logo

Data Testing: The Missing Component of Data Quality

Data quality in production is incomplete without proper data testing. In this article, you will learn the basics of data testing and how to get started.

By 
Sandesh Gawande user avatar
Sandesh Gawande
·
Jul. 01, 23 · Analysis
Likes (1)
Comment
Save
Tweet
Share
5.6K Views

Data Quality is crucial for systems such as data warehouses, MDM, CRM, and other data-centric projects. However, DQ is often ignored in development until the system is fully operational in production. This results in a large gap in data quality as there was minimal or no testing done during the development phase of the project.

Data is like a product, and the data system is like a factory that produces it. In a factory, quality is split into two components: quality assurance and quality control. Let’s dive deeper and understand these concepts and how they apply to data quality.

Data quality
Quality Assurance (QA) vs. Quality Control (QC)

QA: The quality of the raw materials and the processing methods are verified by QA processes during product manufacturing to ensure minimum defects in the final product. 

QC: Even after QA, there will still be some gaps in manufacturing that are not caught by the QA processes, and that is why there is a final process of QC that checks the final product.

Quality assuarance
Quality control

QUALITY ASSURANCE

QUALITY CONTROL


Focus on processing to prevent defects in the product.


Identifying defects in the product before delivery.
  • Proactive
  • Reactive
  • Quality of the raw material input
  • Focus on the final product
  • Focus on processes

  • Process Orchestration


How do the QA and QC concepts from manufacturing apply to the data domain? What are the shortcomings of the traditional data quality methods where data testing is often ignored?  

Limitation of Legacy Data Quality Approach

The above comparison shows that DQ in the postproduction phase is QC, which is just 50% of the DQ process. The other 50% is Quality Assurance or Data Testing in development is missing. Here are a few shortcomings of traditional data quality:

  1. ETL Testing: Data quality tools are designed to test the data and not for ETL testing.
  2. Too Little, Too Late: Data Quality processes are only applied when the final data is delivered in production. By this time, the defective data system is already in production.
  3. Garbage In, Garbage Out: The raw data that was used during development was never tested. So, the developer was not aware of all the permutations and combinations of data that were possible. 
  4. Bad Process Results in Bad Data: The data processes that transform the data itself might be incorrect, and that will result in poor-quality data.
  5. Incorrect Orchestration of Data Processes: Usually, data processing is the execution of various data processes in a particular order and time. This is required to integrate data from multiple sources into a single unified view. 
  6. Incorrect Data Schema: If the data model has issues such as incorrect data types, wrong data type lengths, precisions, or missing constraints, then many data issues will later pop up in production data.

Remember, these issues can only be dealt with during the development phase of the project by implementing proper QA or data testing.

Understanding Data Testing

Data Testing is a method to test and certify the data and the data processes before deploying the code in production. It consists of raw data testing, ETL testing, and process orchestration.

 Understanding Data Testing


Data testing consists of the following activities.

Testing raw data to ensure the data is as expected.
Testing the ETL process to ensure it is coded correctly and transforming the data as per the requirements.
Business validation of the output of the data generated by the ETL process.

Getting Started With Data Testing

A major part of data testing involves testing ETL processes. In iceDQ, this is done by implementing the following steps:

1. Identify the Process for Testing

  • Identify the data process that is tested. In this case, we are taking the “LOAD_EMP_DATA” process.
  • Find the source database. Here, it is “ADVENTURE WORKS DB,” and the schema is “HR” 
  • Find the destination database “EDW_DB.”

2. Find the Source and Target Tables Used by the Data Process

  • The source table is “EMPLOYEE.”
  • The destination table is “EMPLOYEE_DIM.”

2. Find the Source and Target Tables Used by the Data Process.

3. Connecting to Databases

  • Create a connection to the source database from where the data is read by the process.
  • Create another connection for the target database where the processed data is loaded.

Connecting to Databases

4. Creating and Executing Data Testing Rules

  • In this case, most of the data transformation is happening for the employee’s name.
  • The source has a “first name” and “last name” as columns.
  • However, the destination has a complete name. 
  • The data transformation combines the first and last names to provide the complete name. 
  • Hence, a check is created to reconcile the data transformation.

Creating & Execute Data Testing Rules

5. Reviewing the Output of Data

  • Once the test is executed, iceDQ will generate data exceptions that show the defects in data transformations, as highlighted below.
  • Based on the defects, the data process is certified or rejected.

Reviewing the Output of Data

Conclusion

infinite

We at iceDQ believe that data quality is not just a production concept, but equal or more efforts are put into data testing during development. A combined approach of Data Testing during development and Data Monitoring in production will provide the best outcome for data quality, and that too in a very effective way.


Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!