Building an End-to-End Data Warehouse Testing Strategy and Test Plan
Let's focus on how to build an end-to-end data warehouse testing strategy and test plan.
Join the DZone community and get the full member experience.Join For Free
Editor's note: Wayne Yaddow is an independent consultant with over 20 years' experience leading data migration/integration/ETL testing projects at organizations including J.P. Morgan Chase, Credit Suisse, Standard and Poor's, AIG, Oppenheimer Funds, IBM, and Achieve3000. Additionally, Wayne has taught IIST (International Institute of Software Testing) courses on data warehouse, ETL, and data integration testing, and he leads numerous ETL testing and coaching projects on a consulting basis.
In my previous blog, I looked at why and how to test a data warehouse project. This time, let's focus on how to build an end-to-end data warehouse testing strategy and test plan. Ultimately, the success of a data warehouse solution is highly dependent upon your ability to plan, design, and execute a set of effective tests that expose issues with data inconsistency, data quality, data security, the ETL process, performance, business flow accuracy, and the end user experience.
Overall, the primary focus of testing should be on the end-to-end ETL process. This includes validating the loading of all required rows, correct execution of all transformations, and successful completion of the cleansing operation. The team also needs to thoroughly test SQL queries, stored procedures or queries that produce aggregate or summary tables. Finally, with digital transformation initiatives focusing on delivering exceptional customer engagement, it is becoming increasingly critical for the test team to design and execute tests that accurately mirror the customer perspective.
Building an End-to-End Data Warehouse Testing Strategy
The following figure shows a representative data warehouse implementation from identification of source data (lower left) to report and portal reporting (upper left). In between, several typical phases of the end to end data warehouse development process are depicted (for example, source extract to staging, dimension data to the operational data store (ODS), fact data to the data warehouse and report and portal functions extracting data for display and reporting). The graphic illustrates that all data load programs and resulting data loads should be verified throughout the end-to-end QA process.
Figure 1: End-to-end data warehouse process and associated testing.
An end-to-end data warehouse test strategy documents a high-level understanding of the anticipated testing workflow. The strategy will be used to verify that the data warehouse system meets its design specifications and other requirements.
Successfully executing this strategy requires the following skills:
- Understanding fundamental concepts of data warehousing and its place in an information management environment.
- Understanding how the testing process fits into the data warehouse development process.
- Development of data warehouse test strategies, test plans and test cases — what they are and how to develop them, specifically for data warehouses and decision support systems.
- Creating effective test cases and scenarios based on business and user requirements for the data warehouse.
- Participating in reviews of the data models, data mapping documents, ETL design and ETL coding; providing feedback to designers and developers.
- Participating in the change management process and documenting relevant changes to decision support requirements.
Additionally, you will want to follow core best practices such as:
- Formal QA data track verifications should begin early in the ETL design and data load process and continue through deployment and into production.
- Testers should be given early access to the ETL development environment so they can assess the quality of early data loads and offer valuable feedback to development teams. Such early access can dramatically aid preparations for formal testing and identify issues early.
- Where projects utilize offshore or contract test teams, they may discover the need for more adequate and representative samples of data (production data, if possible) for test planning and test case design.
- For all project stakeholders, data models, database design documents (LLDs), ETL design and data source to target mapping documents need to be kept in sync until transition.
- Data warehouse test automation (particularly for regression testing) and associated tools are critical for supporting agile and iterative development processes.
Writing an Effective Data Warehouse Test Plan
An effective test plan is the cornerstone of the entire data warehouse testing effort. The plan will help test engineers validate and verify data requirements from end to end (source to target data warehouse). A primary purpose of a formal test program is to verify data requirements as stated in the:
- Business requirements document
- Data models for source and target schemas
- Source to target mappings
- ETL design documents
As requirement documents specifications are the "what" for ETL development, the test plan can serve as the "what" for the test process. The test plan describes how the QA staff will verify that the data warehouse meets requirements. Properly constructed, a test plan is a contract between the QA team and all other project stakeholders.
In addition to the data requirements, the test planning effort should also consider:
- Configuration management system
- Project schedule
- Data quality verification process
- Incident and error handling system
- QA staff resources estimates and training needs
- Testing environment budget and plan
- Test tools
- Test objectives
- QA roles and responsibilities
- Test deliverables
- Test tasks
- Defect reporting requirements
- Entrance criteria that should be met before formal testing commences
- Exit criteria that should be met before formal testing is completed
Common Data Warehouse Issues to Watch For
As you design and execute the test plan, think about identifying issues such as:
- Inadequate ETL and stored procedures (use design documentation to aid in test planning).
- Field values that are null when specified as "not null."
- Field constraints and SQL that are not coded correctly for the ETL tool.
- Excessive ETL errors after entry to formal QA.
- Source data that does not meet table mapping specifications (e.g. dirty data).
- Source to target mappings that 1) were not reviewed before implementation, 2) are incorrect or 3) are not consistently maintained throughout the development lifecycle.
- Data models that are not adequately maintained during the development lifecycle.
- Duplicate field values (in either source or target data) that were defined as
DISTINCTin mapping specifications.
- ETL SQL/transformation errors leading to missing rows and invalid field values.
- Constraint violations in source (perhaps could be found through data profiling).
- Target data that is incorrectly stored in non-standard formats.
- Primary or foreign key values that are incorrect for important relationship linkages.
Read more from Wayne's Introduction to Data Warehouse Testing paper.
Opinions expressed by DZone contributors are their own.
Getting Started With the YugabyteDB Managed REST API
Operator Overloading in Java
How To Use Pandas and Matplotlib To Perform EDA In Python
How To Approach Java, Databases, and SQL [Video]