DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Identifying Data Warehouse Quality Issues During Staging and Loads to the DWH

Identifying Data Warehouse Quality Issues During Staging and Loads to the DWH

Before looking into data quality problems during data staging, we need to know how the ETL system handles data rejections, substitutions, cleansing, and enrichment.

Wayne Yaddow user avatar by
Wayne Yaddow
·
Jan. 18, 19 · Tutorial
Like (2)
Save
Tweet
Share
11.03K Views

Join the DZone community and get the full member experience.

Join For Free

This is the fourth blog in a series on Identifying Data Integrity Issues at Every DWH Phase.

Before looking into data quality problems during data staging, we need to know how the ETL system handles data rejections, substitutions, cleansing, and enrichment. To ensure success in testing data quality, include as many data scenarios as possible. Typically, data quality rules are defined during design. For example:

  • Reject non-numeric data to be loaded in a numeric field.
  • Load fields with full contents; i.e. data fields are not truncated/trimmed as specified in data mappings.
  • Ensure that no duplicate records or field contents are loaded.
  • Substitute NULL if a decimal field has non-numeric data.
  • Verify and correct the "city/state fields" if necessary based on the ZIP code.
  • Ensure that data integrity constraints are properly handled (, rejected, replaced, etc.).

For another example, consider the following business rule: "If no match is found in the look-up table, then load it nonetheless, but report the error in the error log/report." Depending on the data quality rules of the application being tested, scenarios to test might include NULL key values, duplicate records in source data, and invalid data types in fields (e.g., alphabetic characters in a decimal field).

A data cleaning process may be executed in the data staging area in order to improve the correctness of the data warehouse. The staging and DWH load phases are considered a most crucial point of data warehousing where the full responsibility of data quality efforts exist. It is a prime ETL phase for validating data quality from source or auditing and cleaning data issues. Other factors that can contribute to data quality problems during staging and subsequent ETLs:

  • Conflicting business rules used by various data sources.
  • The inability to schedule extracts on time, or within the allotted time interval when updating the DWH.
  • Inability to capture all changes (ex., inserts, updates, deletes) in source files.
  • The absence of an effective and centralized source metadata repository.
  • Misinterpretation of slowly changing dimensions (SCDs) in ETL code.
  • Errors in the transformation or substitution values for NULL values in the ETL process.
  • The absence of automated or effective unit testing facility in ETL tools.
  • The dearth of effective error reporting, validation, and metadata updates in ETL code.
  • Inappropriate ETL process for data insert/update/delete functions.
  • Loss of data during the ETL process (rejected records, dropped fields).
  • An inability to restart the ETL process from checkpoints without losing data.
  • Lack of automatic and effective data defect and correction functions in the ETL code.
  • Inability to incorporate profile, cleansing, and ETL tools to compare and reconcile data and associated metadata.
  • Misaligned primary and foreign key strategies for the same type of entity (e.g., one table stores customer information using the Social Security Number as the key, another uses the CustomerID as the key, and still another uses a surrogate key).

It is always important to review detailed test scenarios with business users and technical designers to ensure that everyone is in agreement. Data quality rules applied to the data will usually be invisible to the users once an application is in production; users will only see what has been loaded in the data warehouse. For this reason, it is important to ensure that what is done with invalid data is reported to users for verifications. These data quality reports present valuable information that sometimes reveals systematic errors.

[Read Wayne's paper on BI and Data Warehouse Testing: Identifying Data Integrity Issues at Every DWH Phase]

Data integrity Data warehouse Data quality Staging (data) Extract, transform, load Database

Published at DZone with permission of Wayne Yaddow. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • What Should You Know About Graph Database’s Scalability?
  • 9 Ways You Can Improve Security Posture
  • Top Five Tools for AI-based Test Automation
  • Learning by Doing: An HTTP API With Rust

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: