Part 2: Focus on Quality Risks for Data Integration Projects
Part two of this 3-part article series covers various examples of quality risks for data integration projects and the best practices for tackling them.
Join the DZone community and get the full member experience.
Join For FreeIntroduction
In part one of this 3-part series, we covered why assessing risks early and often is key, the best practices for addressing risks, and best practices for common risk mitigation. Part two of this article series will cover examples of quality risks for integration projects and best practices for tackling them.
Identify Data Integration Risks
The first step in solving problems is to identify them. Doing so on a data integration project entails registering every issue that might disrupt the normal flow of project events. A primary outcome of this step is a list of risks.
A QA lead commonly heads a risk-based testing practice. However, this person alone will not be able to devise the entire list; the whole QA team’s input makes a significant impact on the risk identification process.
The extraction, transformation, loading (ETL), and testing processes remain the most underestimated, under-budgeted phases of most data integration projects. And a primary reason why the ETL portion of a project often raises more questions than it resolves has to do with a lack of understanding of the source data quality.
During data integrations, much data is transformed, integrated, structured, cleansed, and grouped into a single structure. These various types of data movement and changes could lead to a risk of data corruption; therefore, data testing is a highly critical endeavor.
Examples of Data Integration Risks and Associated Best Practices
The following are typical data integration project risks and challenges accompanied by recommended best practices. This compilation of risks is not exhaustive. Note that risks in this table represent only those commonly known to be of “high” impact.
Potential Project Risks |
Best Practice Recommendations |
UNCERTAIN SOURCE DATA QUALITY Data integration effort may not meet the planned schedule because the quality of source data is unknown or inadequate |
|
DATA DICTIONARIES AND DATA MODELS ARE FLAWED Data within sources and targets cannot be easily interpreted by developers and QA |
Ensure accurate and current documentation of data models and mapping documents.
|
SOURCE & TARGET DATA MAPPING ISSUES Source data may be inaccurately mapped due to the absence of data dictionaries and data models |
|
EXCESSIVE DEFECTS IN DATA SOURCES Data defects are found at a late stage during development |
|
NO MASTER TEST PLAN OR TEST STRATEGY ESTABLISHED A master test plan/strategy does not exist or is inadequate in scope |
A Test Strategy / Master Test Plan documents the overall structure and objectives of all project testing — from unit testing to component to system and performance tests. The MTP covers activities over the data integration lifecycle and identifies evaluation criteria for the testers. |
EXCESSIVE DEFECTS IN TARGET DATA Much of loaded target data is in error after ETL’s |
|
PROJECT-WIDE TESTING NOT COORDINATED Inadequate or nonexistent source to target data flows tests |
A data quality audit should include validation that the information in a source system (such as a CSV) is accurate so that there is a high level of confidence that it can be trusted when it is added to target integration. |
STAFF TESTING SKILLS ARE INSUFFICIENT Qualified resources with the required knowledge of data integration testing are unavailable. |
|
SOURCE TO TARGET DATA TRANSFORMATION CODE IS COMPLEX Complex transformations without required test tools or tester skills may not be easy to test. |
|
PLANNED TESTING IS PREDOMINANTLY MANUAL Minimal testing automation has been adopted for ETL’s, data profiling, unit, and regression tests |
|
DATA VOLUME SCALABILITY CAUSES PERFORMANCE ISSUES Growing data volumes due to changing requirements |
|
Conclusion
Modern systems have made it easier to access data from a wide variety of sources. With that capability comes the challenge of integrating disparate data in various formats from multiple data streams, possibly with duplicate and poor data quality, into a single data repository.
To address the above challenges, data integration risks must be understood so those risks can be mitigated to identify and resolve corrupt, inaccurate, irrelevant, or duplicate data. That can be better accomplished with the use of the best QA and testing processes during the data integration process.
Part three is already published on DZone! The final part discusses exactly how to develop a data integration master test plan.
Wayne Yaddow is an independent consultant with 20 years of experience leading data migration/integration/ETL testing projects at organizations such as J.P. Morgan Chase, Credit Suisse, Standard and Poor’s, and IBM. He has taught courses and coached IT staff on data warehouse, ETL, and data integration testing. Wayne continues to lead ETL testing and coaching projects on a consulting basis. [ wyaddow.academia.edu/research#papersandresearch]
Opinions expressed by DZone contributors are their own.
Trending
-
Part 3 of My OCP Journey: Practical Tips and Examples
-
A React Frontend With Go/Gin/Gorm Backend in One Project
-
What Is Envoy Proxy?
-
From On-Prem to SaaS
Comments