How to Test Your Data Warehouse?
How to Test Your Data Warehouse?
Join the DZone community and get the full member experience.Join For Free
Read the 2019 State of Database DevOps Report for the very latest insights
This article is not going to explain what is Big Data, Data Warehouse, Business Intelligence or the ETL Process. There are a lot of articles, slides and books out there....but sadly not a single open source and straight-forward framework. The truth is that (most likely) you'll have to do it by yourself. My approach is like in the Custom mobile testing framework - spin testing around the developers tools (SQL Server Management Studio, mySQL Workbench, Pentaho, Jaspersoft, Birt etc.). And here the tSQLt is a light in the dark.
There is one key point to remember - DW testing is data centric, while software testing is code centric. The connections between the DW components are groups of transformations that take place on data. These transformation processes should be tested as well to ensure data quality preservation. Introduced DW testing and validation techniques, broken the testing and validation process into four well defined and high-level processes namely;
- Integration testing
- System testing
- Data validation
- Acceptance testing
Thus this types of testing has again been divided in the following sub-phases:
- Understanding of Requirements
- Development of Test Plan and Test Design
- Preparation of Test Case
- Execution of Test Case
Working on the concept of a prototype model, the testing activities summarized for the four items described are:
- Multidimensional Schema – The various testing activities carried out for this item are: Workload Test, Hierarchy Test, Conformity Test, Usability Test, Nomenclature check, Performance Test, Early loading Test, Security Test and Maintainability Test.
- ETL Procedures – The various testing activities involved for this item are: Code Test, Integrity Test, Integration Test, Administrability Test, Performance/Stress Test, Recovery Test, Security Test and Maintainability Test.
- Physical Schema – The various testing activities carried out for this item are: Performance/Stress Test, Recovery Test and Security Test.
- Front-end – The various testing activities involved are: Balancing Test, Usability Test, Performance/Stress Test and Security Test.
The approaches that can used here are mainly:
Manual Sampling – The testing activities covered under this approach are:
- End-to-End Testing – It checks that data is properly loaded into the systems from which data warehouse will extract data for nreport generation.
- Row count Testing – To avoid any loss of data, all rows of data are counted after the ETL process to ensure that all the data is properly loaded.
- Field size Testing – It checks that the data warehouse field should be bigger than the data field for the data being loaded, as if it is not checked will lead to data truncation.
- Sampling – The sample used for testing must be a good representation of whole data.
Reporting Tool – The testing activities covered under this approach are:
- Report Testing – The reports are checked to see that the data displayed in the reports are correct and can be used for decision-making purpose.
Major phase is Requirements Definition. Data used in the warehouse originates from multiple source systems. Therefore, defining good requirements can be extremely challenging. Successful requirements are those structured closely to business rules and address functionality and performance. These business rules and requirements provide a solid foundation to the data architects. Requirements are one of the keys to success. The whole team should share the same tools from the project toolbox. It's recommend that everyone work from the same set of documents and requirements (version of files). Creation of rich set of test data avoiding combinational explosion of data needs to be done. So the automation approach would be combination of Tasks automation and Tests automation. A different strategy for creating artificial test data - the source systems are required to create the test data. Thus, realistic but artificial source data is created.The testers should be skilled as well.
- Data completeness. Ensure that all expected data is loaded.
- Data transformation. Ensure that all data is transformed correctly according to business rules and/or design specifications.
- Data quality. Ensure that the ETL application correctly rejects, substitutes default values, corrects or ignores and reports invalid data.
- Performance and scalability. Ensure that data loads and queries perform within expected time frames and that the technical architecture is scalable
- Integration testing. Ensure that the ETL process functions well with other upstream and downstream processes.
- User-acceptance testing. Ensure the solution meets users’/clients’ current expectations and anticipates their future expectations.
- Regression testing. Ensure existing functionality remains intact each time a new release is completed.
The Test fixture strategy is to use Fresh fixture - ensuring that tests do not depend on anything they did not set up themselves. Combined with Lazy (Initialization) setup ( If we are happy with the idea of creating the test fixture the first time any test needs it, we can use Lazy Setup in the setUp method of the corresponding Testcase Class to create it as part of running the first test. Subsequent tests will then see that the fixture already exists and reuse it) and Shared fixture (we partition the fixture needed by tests into two logical parts. The first part is the stuff every test needs to have present but is never modified by any tests—that is, the Immutable Shared Fixture. The second part is the objects that any test needs to modify or delete; these objects should be built by each test as Fresh Fixtures. Most commonly, the Immutable Shared Fixture consists of reference data that is needed by the actual per-test fixtures. The per-test fixtures can then be built as Fresh Fixtures on top of the Immutable Shared Fixture).
Minimal Fixture (use of smallest and simplest fixture possible for each test).
Another option (although it’s considered as Anti-pattern) is Chained Tests (let the other tests in a test suite set up the test fixture).
Pesticide Paradox - offten test automation suffers from static test data. Without variation in
the data or execution path, bugs are located with decreasing frequency. Automation should be
designed so that a test-case writer can provide a common equivalence class or data type as a
parameter, and the automation framework will vary the data randomly within that class and
apply it for each test run. (The automation framework should record the seed values or actual
data that was used, to allow reruns and retesting with the same data for debugging purposes.)
Can be utilized a TestFixtureRegistry via dedicated table - it’ll be able to expose various
parts of a fixture (needed for suites) via discrete fixture holding class variables or via Finder
Methods. Finder Methods helps us avoid hard-coded values in DB lookups in order to access
the fixture objects. Those methods are very similar to those of Creation Methods, but they
return references to existing fixture objects rather than building brand new ones. Should make
those immutable. NOTE: usage of Intent-Revealing Names for the fixture objects should be
enforced in order to support the framework’s lookup functionality and better readability. To
keep the current OO design the following implementation can be used - check if such registry
already exists and remove it; NOTE: in consideration must be taken the Data Sensitivity and
Context Sensitivity of the tests that’ll rely on this module.
The Test Approach
The suggested test phases are based on the development schedule per project along with the need to comply with data requiremens that need to be in place when the new DWH goes live.
Data Warehouse (internal testing within ETL validating data stage jobs)
Data validation should start early in the test process and be completed before phase 2 testing begins. Some data validation testing should occur in the remaining test phases, but to a much lesser extent.
Important business processes where performance is important should be identified and tested (when available) in the phase 1. Performance testing should be continued in the later test phases as the application will be continuously enhanced throughout the project. In addition to phase 1 testing, there will also be unit and functional testing. As unit testing is completed for a program, the tester will perform functional tests on the program. While functional testing takes place with one program, the developer continues with redeveloping and unit testing the next program.
Toward the end of phase 1, the data warehouse group will be testing the data stage jobs. Redevelopment and unit testing should be completed then functional testing finishing a couple weeks afterwards. A final formal test will cap the end of phase 1 testing.
Data Warehouse (Repository testing and validation)
In addition to the above tests, phase 2 should also cover remaining test items that may not been tested in phase 1 such as:
Phase 2 testing will be important because it is the final testing opportunity that the functional area testers will have to make sure the DW load works as expected before moving to regression testing in phase 3. Some performance tests and data validation should be included in this phase.
Phase 3 testing is comprised of regression test periods to test updates that are required as part of the Company gaming platform. The functional area testers should have sufficient time to test in each regression test period.
Business and client Acceptance
Phase 4 testing is limited. In addition to the functional area testers, end users will probably be involved in this final test before the new system goes live. In customer acceptance testing, no new tests should be introduced at this time. Customer acceptance tests should have already been tested in prior test phases.
Opinions expressed by DZone contributors are their own.