Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Why and How to Test Data Warehouse Projects

DZone's Guide to

Why and How to Test Data Warehouse Projects

Testing data warehouse implementations has become an imperative for instilling trust in the data that the business relies upon.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Now that every company strives to be a data-driven company, data quality is more critical than ever. According to Gartner, the average company loses $8.2B USD annually as a result of poor data quality. Moreover, a recent Experian study found that 4 in 5 (83%) businesses see data as an integral part of forming a business strategy, yet they suspect 30% of their contact and prospect data may be inaccurate. With 'improving the customer experience' being called out as a top priority for 2018, the research also reports that 69% believe inaccurate data is undermining their ability to provide this.

Testing data warehouse implementations has become an imperative for instilling trust in the data that the business relies upon. Complex business rules and transformation logic are built using ETL logic and thus demand diligent and thorough testing.

Why Test Data Warehouse Projects?

Here are just a few of the many reasons to thoroughly test the data warehouse and apply a QA process that is specific to data and ETL testing:

  • There is often a massive amount of source data from varied types of data repositories.
  • The quality of source data cannot be assumed and should be profiled and cleaned.
  • Source data may be inconsistent and/or redundant.
  • Many source data records may be rejected; ETL/stored procedure logs will contain messages that must be acted upon.
  • Required source field values may be missing.
  • Source data history, business rules, and audits of source data may not be available.
  • Enterprise-wide data knowledge and business rules may not be available to verify data.
  • Since data ETLs must often pass through multiple phases before loading into the data warehouse, extraction, transformation, and loading components must be thoroughly tested to ensure that the variety of data behaves as expected within each phase.
  • Heterogeneous sources of data (e.g., mainframe, spreadsheets, Unix files) will be updated asynchronously over time then incrementally loaded.
  • Transaction-level traceability will be difficult to attain in a data warehouse.
  • The data warehouse will be a strategic enterprise resource and heavily relied upon.

What Does Data Warehouse Verification Involve?

A good understanding of data modeling and source-to-target data mappings equip the tester with information to develop an appropriate test strategy. Hence, it's important that during the project's requirement analysis phase, the QA team works to understand the data warehouse implementation to the greatest extent. Data warehouse testing strategies will, in most cases, be a consortium of several smaller strategies. This is due to the nature of data warehouse implementations.

Different stages of the data warehouse implementation (source data profiling, data warehouse design, ETL development, data loading and transformations, etc.), require the testing team's participation and support. Unlike some traditional testing, test execution does not start at the end of the data warehouse implementation. In short, test execution itself has multiple phases and is staggered throughout the lifecycle of the data warehouse implementation.

Verification across data warehouse project testing phases can include:

  • Data completeness: Ensure that all expected data is loaded by means of each ETL procedure.
  • Data transformations: Ensure that all data to be transformed is completed correctly according to business rules and design specifications.
  • Data quality: Ensure that the ETL process correctly rejects, substitutes default values, corrects, ignores and reports invalid data.
  • Performance and scalability: Ensure that data loads and queries perform within expected time frames and that the technical architecture is scalable.
  • Integration Testing: Ensure that the ETL process functions well with other upstream and downstream processes.
  • User-acceptance testing: Ensure the data warehousing solution meets users' current expectations and anticipates their future expectations.
  • Regression testing: Ensure existing functionality remains intact each time a new release of ETL code and data is completed.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,data warehouse ,data warehouse testing ,data quality

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}