{{announcement.body}}
{{announcement.title}}

Part 2: Focus on Quality Risks for Data Integration Projects

DZone 's Guide to

Part 2: Focus on Quality Risks for Data Integration Projects

Part two of this 3-part article series covers various examples of quality risks for data integration projects and the best practices for tackling them.

· Integration Zone ·
Free Resource

Introduction

In part one of this 3-part series, we covered why assessing risks early and often is key, the best practices for addressing risks, and best practices for common risk mitigation. Part two of this article series will cover examples of quality risks for integration projects and best practices for tackling them.

Identify Data Integration Risks

The first step in solving problems is to identify them. Doing so on a data integration project entails registering every issue that might disrupt the normal flow of project events. A primary outcome of this step is a list of risks. 

A QA lead commonly heads a risk-based testing practice. However, this person alone will not be able to devise the entire list; the whole QA team’s input makes a significant impact on the risk identification process.

The extraction, transformation, loading (ETL), and testing processes remain the most underestimated, under-budgeted phases of most data integration projects. And a primary reason why the ETL portion of a project often raises more questions than it resolves has to do with a lack of understanding of the source data quality. 

During data integrations, much data is transformed, integrated, structured, cleansed, and grouped into a single structure. These various types of data movement and changes could lead to a risk of data corruption; therefore, data testing is a highly critical endeavor.

Examples of Data Integration Risks and Associated Best Practices

The following are typical data integration project risks and challenges accompanied by recommended best practices. This compilation of risks is not exhaustive. Note that risks in this table represent only those commonly known to be of “high” impact.

Potential Project  Risks

Best Practice Recommendations

UNCERTAIN SOURCE DATA QUALITY

Data integration effort may not meet the planned schedule because the quality of source data is unknown or inadequate

  • Conduct formal data profiling of source data early (i.e., during requirements gathering) to understand whether data quality meets project needs. Inaccuracies, omissions, cleanliness, duplicates, and inconsistencies in the source data can be identified and resolved before or even during the extract/transform process.
  • Since duplicate data often exist in multiple source systems, identify all sources and discuss with the users which are most applicable
  • Implement commercial/automated data quality tools accompanied by consultation and training

DATA DICTIONARIES AND DATA MODELS ARE FLAWED

Data within sources and targets cannot be easily interpreted by developers and QA

Ensure accurate and current documentation of data models and mapping documents.

  • Use automated documentation tools
  • Create meaningful documentation of data definitions and data descriptions in a data dictionary
  • Create procedures for maintaining documentation in accordance with changes to the source systems
  • Provide training to QA team by data stewards/owners

SOURCE & TARGET DATA MAPPING ISSUES

Source data may be inaccurately mapped due to the absence of data dictionaries and data models

  • Data dictionaries and data catalogs should be maintained to support all data associated with the project. Higher quality data mapping documents will be the result.
  • Implement a data mapping change control system to keep a record of every change.

EXCESSIVE DEFECTS IN DATA SOURCES

Data defects are found at a late stage during development

  • Ensure that data requirements are complete and that data dictionaries are available and current
  • Profile all data sources and target data after each ETL
  • Prepare to cleanse and fix dirty data
  • Assure the continued maintenance of data mapping and all other specification documents

NO MASTER TEST PLAN OR TEST STRATEGY ESTABLISHED

A master test plan/strategy does not exist or is inadequate in scope

A Test Strategy / Master Test Plan documents the overall structure and objectives of all project testing — from unit testing to component to system and performance tests. The MTP covers activities over the data integration lifecycle and identifies evaluation criteria for the testers.

EXCESSIVE DEFECTS IN TARGET DATA

Much of loaded target data is in error after ETL’s

  • Ensure that the target data sampling process is high quality
  • Use test tools that provide for extensive data coverage 
  • Choose  a data sampling approach that’s extensive enough to avoid missing defects in both source and target data
  • Choose an appropriate technology to compare source and target data to determine whether both source and target are equal or target data has been transformed
  • Verify that no data or metadata has been lost during ETL processes. The data warehouse must load all relevant data from the source application into the target according to business rules.
  • Check the correctness of surrogate keys that uniquely identify rows of data
  • Check data loading status and error messages after ETL’s
  • Verify that data types and formats are as specified during database design
  • Verify that every ETL session completed with only planned exceptions.

PROJECT-WIDE TESTING NOT COORDINATED

Inadequate or nonexistent source to target data flows tests

A data quality audit should include validation that the information in a source system (such as a CSV) is accurate so that there is a high level of confidence that it can be trusted when it is added to target integration.

STAFF TESTING SKILLS ARE INSUFFICIENT

Qualified resources with the required knowledge of data integration testing are unavailable.

  • Invest in data integration testing courses, training resources recruit staff with data testing experience, engage services consultants.
  • Invest in specialized data roles: data analysts, data quality analysts, data testing toolset skills, and data engineers.

SOURCE TO TARGET DATA TRANSFORMATION CODE IS COMPLEX

Complex transformations without required test tools or tester skills may not be easy to test.

  • Early validation of table join complexity, queries, and resulting business reports 
  • Clarify business requirements; develop and test     pseudo queries before programming data extractions and loads
  • Validate the number and accessibility of source data fields

PLANNED TESTING IS PREDOMINANTLY MANUAL

Minimal testing automation has been adopted for ETL’s, data profiling, unit, and regression tests

  • Invest in automated unit and regression test tools for faster and reusable testing suites.
  • Consider automation testing tools available for metadata verifications, data format checking, row counts, uniqueness checks, data cleansing, load tests, performance tests, smoke tests

DATA VOLUME SCALABILITY CAUSES PERFORMANCE ISSUES 

Growing data volumes due to changing requirements

  • Employ skills and toolsets for data volume estimations
  • Load the database with the peak expected production volumes to help ensure that the amount of data can be loaded by the ETL process within the agreed-on time window

 

Conclusion

Modern systems have made it easier to access data from a wide variety of sources. With that capability comes the challenge of integrating disparate data in various formats from multiple data streams, possibly with duplicate and poor data quality, into a single data repository. 

To address the above challenges, data integration risks must be understood so those risks can be mitigated to identify and resolve corrupt, inaccurate, irrelevant, or duplicate data. That can be better accomplished with the use of the best QA and testing processes during the data integration process.

Part three is already published on DZone! The final part discusses exactly how to develop a data integration master test plan.

Wayne Yaddow is an independent consultant with 20 years of experience leading data migration/integration/ETL testing projects at organizations such as J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, and IBM.  He has taught courses and coached IT staff on data warehouse, ETL, and data integration testing. Wayne continues to lead ETL testing and coaching projects on a consulting basis. [ wyaddow.academia.edu/research#papersandresearch]

Topics:
data accuracy, data governance, data integrity, data testing, data validation, data warehouse testing, dzone's 2014 guide to big data, etl testing

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}