Ensuring Data Integrity In DW/BI Systems
This article focuses on DW/BI data integrity testing, which should help you gain and maintain data accuracy and consistency through the project lifecycle.
Join the DZone community and get the full member experience.Join For Free
- Ensuring Data Integrity in DW/BI Systems
- Planning for Three Categories of DW/BI Data Integrity Testing
- Classifications of Data Integration for DW/BI Systems
- Verifying Data Integrity in Schemas, ETL Processes, and BI Reports
- Key Takeaways
Planning for Three Categories of DW/BI Data Integrity Testing
Data warehousing and business intelligence users need reliable data.
In the Gartner Group’s Online IT Glossary, data integrity and data integrity testing are defined as follows:
Data Integrity: 1) The quality of the data residing in data repositories and database objects. 2) A measure that users consider when analyzing the reliability of data.
Data Integrity Testing: Testing to assure that DW/BI data is correctly moved, copied, derived, or converted accurately.
Data integrity processes should not only help you understand a project’s data integrity but also help you gain and maintain data accuracy and consistency throughout its lifecycle. This includes data management best practices that prevent data from being erroneously altered each time it is copied or moved. Processes should be established to maintain DW/BI data integrity at all times.
Data, in its production state, is the driving force behind industry decision making. Errors with data integrity commonly arise from human error, noncompliant operating procedures, errors in data transfers, software defects, compromised hardware, and physical compromise to devices.
This article provides a focus on DW/BI “data integrity testing” – i.e., testing processes that verify:
- Data warehouse source and target schemas
- ETL processing of source and target data
- Business intelligence front-end report applications
We describe how data integrity testing is addressed in each of the above categories and provide a checklist in Figure 3 to aid in test planning and test case development.
Classifications of Data Integrity for DW/BI Systems
Data Integrity is an umbrella term that refers to the consistency, accuracy, and correctness of data stored in a database. There are three primary types of data integrity: entity, domain, and referential.
- Entity Integrity ensures that each row in a table (for example) is uniquely identified and without unintended duplication. Entity integrity can be enforced by placing primary key and foreign key constraints on specific columns. Testing may be achieved by including duplicate or null values in test data.
- Domain Integrity requires that each set of data values/columns falls within a specific permissible defined range. Examples of domain integrity include correct data type, data format, and data length. Testing may be accomplished, in part, using null, default, and invalid values.
- Referential Integrity is concerned with keeping the relationships between tables synchronized. For example, referential integrity can be enforced with Primary Key (PK) and Foreign Key (FK) relationships.
Verifying Data Integrity in Schemas, ETL Processes, and BI Reports
The framework in Figure 1 illustrates the primary DW/BI components that are generally tested in each phase of end-to-end DHW/BI testing. Data integrity testing often requires a major portion of project time and resources.
Figure 1: A general framework for DW/BI end-to-end data verifications
1) Verifying data source/target requirements and schema designs
Requirements and schema-level tests confirm to what extent the design of each data component matches the targeted business requirements.
This process should include the ability to verify:
- Business and technical requirements for all source and target data
- Data integrity specifications technically implemented (DBMS, file systems, text files, etc.)
- Data models for each implemented data schema
- Source to target data mappings vs. data actually loaded into DW targets. Examples of sources and associated targets include source data that are loaded to staging targets as well as staging data that are loaded to data warehouse or data mart targets
Schema quality represents the ability of a schema to thoroughly and efficiently define information/data. A schema in this definition refers to the schema of a data warehouse regardless if it is a conceptual, logical, or physical schema, star, constellation, or normalized schema. The definition is extended here to include the schemas of all data storage used in the entire data warehouse system, including data sources, data staging, the operational data store, and the data marts. It is advantageous to assess the schema quality in the design phase of the data warehouse.
Detecting, analyzing, and correcting schema deficiencies will boost the quality of the DW/BI system. Schema quality can be validated from various dimensions: schema correctness, schema completeness, schema conformity, schema integrity, interpretability, tractability, understandability, and concise representation.
2) Testing source and target data integrity
A properly-designed ETL system:
- Extracts data from source systems
- Enforces data quality and consistency standards
- Conforms data so that data sources can be used collectively
- Delivers data in a format that enables application developers to build applications
DW integrity testing and evaluations focus on ETL processes. Various functional and non-functional testing methods are applied to test the ETL process logic. The goals are to verify that valid and invalid conditions are correctly processed for all source and target data, ensuring primary and foreign key integrity (i.e., referential integrity), the correctness of data transformations, data cleansing, and application of business rules.
3) Testing BI report data integrity
BI applications provide an interface that helps users interact with the back-end data. The design of these reports is critical for understanding and planning the data integrity tests.
Complex business rules are often applied to original source data then loaded to the DW. The DW information is then delivered in the form of BI dashboards/reports and is available for business executives to make crucial decisions with confidence. It is therefore essential to assure that DW data (that is further processed in BI applications) is accurate. Those conducting BI application tests must prepare to answer the business teams’ questions: “Do these reports reflect the DW numbers correctly? If so, or if not so, how can we make sure we are looking at the right data to make the right decisions”?
The traditional approach for BI report testing has several flaws. Much time is often needed to identify and fix issues discovered in BI reports. The numbers on BI dashboards are often aggregated to a high level from the lower granularity of the DW. And this raw data is often transformed at each iteration through its journey to the dashboards. Choosing any value from a BI dashboard, then working to reconcile it against DW data tables, is a complicated and tedious process.
A DW/BI Data Integrity Testing Framework and a Checklist
DW/BI data integrity verification is categorized here as follows. Figure 2 shows a verification classification framework for the techniques applicable to sources and targets in DW, ETL process, and BI report applications.
Figure 2: Framework for DW/BI Data Integrity Verifications
The “what”, “when”, and “where” of DW/BI data integrity testing framework are represented in Figure 3. Information in Figure 3 can be used as a checklist when planning DW and BI tests.
- Column headings in Figure 3 represent the when and where of data integrity testing
- Rows represent what data-related item should be tested
Figure 3: A Sampling of verifications in the three categories of data integrity testing: schemas, ETL processes, and BI reports
Verifications Of Data Source Requirements And Schema Designs
ETL Source & Target Data Integrity Tests
BI Report Data Integrity Tests
- Data in its final state is the driving force behind organizational decision making.
- Raw data is often changed and processed to reach a usable format for BI reports. Data integrity practices ensure that this DW/BI information is attributable and accurate.
- Data can quickly become compromised if proper verification measures are not taken while moving data from each environment to become available to DW/BI projects. Errors with data integrity commonly arise through human errors, noncompliant operating procedures, data transfers, software defects, and compromised hardware.
- By applying the strategies introduced here, you should improve quality and reduce time and costs when developing a DW/BI project.
Opinions expressed by DZone contributors are their own.