Quality Testing Requires Quality Data: A Checklist for Generating Better Test Data
Quality testing requires a robust testing strategy that makes use of both production and synthetic testing.
Join the DZone community and get the full member experience.Join For Free
My two most recent blogs have made the case for a new TDM paradigm called “Test Data Automation,” considering how a logistical approach to TDM (subset, mask, clone) undermines testing speed and quality. Last week I considered the value of introducing synthetic test data generation from the perspective of compliance.
You may also enjoy: 4 Keys to Better Test Data Management
This week’s blog considers synthetic test data from the perspective of testing rigor. Next week will conclude the discussion by considering further how “Test Data Automation” also drives up testing agility.
Copying Production Data Alone Cannot Drive Rigorous Testing
In 2019, the fundamental approach to test data creation remains largely the same as it has been for decades. TDM “best practice” for most vendors and organizations amounts to moving copies of production data to test environments as quickly as possible. In fact, 65% of organizations use production data in testing. 36% mask it and 30% subset data before provisioning it to test environments.
Using production data in QA is itself not always a problem – the test data will often reflect complex system logic accurately and masking effectively can elevate compliance concerns. The challenge for testing rigour arises when QA teams depend on production data alone. This remains the case for many organizations, and only 18% synthesize data using automated techniques [ii].
The challenge to test quality is seen clearly by considering what production data copies do not contain. In terms of achieving full functional test coverage, test data must include:
- Data for testing new or updated functionality. This is by definition missing from production data which is drawn from user behaviour exercised against a previously released system. Production data, therefore, lacks the data needed to test new functionality and will be out-of-date for system logic that has been updated in the latest release. Imagine that a new field has been added to a web registration form: how can production data already contain this data, let alone the full spread of valid and invalid values required to rigorous test the form?
- Outliers and edge cases: Production data is typically narrow and highly repetitious, focusing on expected user behavior. Users almost always behave in the way the system intends, and do not produce the data combinations needed to test outliers and edge cases. Production data, therefore, lacks the outliers and edge cases needed for sufficient test coverage.
- Negative Scenarios: Production data copies for the same reason tend to be “happy path” in focus, lacking the data needed to test negative scenarios. However, it is these negative scenarios that cause the most severe defects. It is the role of QA to test against unexpected behavior in advance of a release, for instance, testing error handling and data validation.
This low coverage allows defects to slip through QA, where they can impact user experience and a business’s bottom line. The time and expense required to fix the bugs are further driven up, and research has long established that the effort of fixing defects increases the later they are detected in the software delivery lifecycle. To truly assure quality, a new approach to test data provisioning is required.
A Hybrid Approach to Creating Quality Test Data
My blog last week set out how synthetic test data generation can help alleviate compliance concerns in testing. The same approach is capable of generating data combinations not found in production, enabling rigorous testing of fast-changing systems.
Last week’s blog also observed how organisations cannot swap all their copied data for synthetic data at once. Instead, a “hybrid approach” should be taken, strategically augmenting existing test data with rich synthetic test data. Some requirements for adopting this approach are set out below.
(Some) Requirements for Synthetic Test Data Generation
Generating complex data to test complex systems requires a flexible approach. The good news is, many of the skills and techniques required build naturally on effective masking and subsetting. All three tasks require a solid understanding of the relationships that exist in production data, and must reflect complex data models accurately. Synthetic test data generation is therefore not an impossible leap for organizations who currently mask and subset.
Some TDM tools additionally provide automated data modeling, further simplifying and accelerating the process of synthetic test data generation. This automated modeling identifies the trends in data that must be retained for testing, establishing the relationships within relational databases, files, and mainframe data sources. It builds the rules needed to retain referential integrity, including, for example, complex numerical and temporal patterns within data.
QA teams and DBAs then only need to provide definitions for the additional data variables required in testing, generating a coverage-optimized set of data that matches production data models accurately:
Different techniques can be used in this “fill-in-the-blanks” approach to defining data combinations needed for rigorous QA. You can use scripting, while some tools provide data generation functions. The flexibility required to test complex applications requires a range of techniques, and some generation tools provide a combination of methods when defining synthetic variables.
Generating data to test complex systems also requires a range of different techniques for creating the resultant data. These techniques include:
- Direct to Databases: This generates data directly into relational databases.
- Via the front-end: There is not always direct access to the back-end database, and sometimes it is simpler to push data through the front-end. This will reflect the data model in the back-end databases. When performed as a step during automated testing, it also prepares accurate data for testing on-the-fly.
- Via the middle layer: Alternatively, you might want to leverage the API layer, inputting data via SOAP and REST calls.
- Using files: Testing certain systems will require data to be sent via files, including XML, EDI, and more.
- Data for the mainframe: Mainframe data can be particularly difficult to create, and organizations still often laboriously input data via green screens. Tools like Ostia’s Portus create a transparency layer to a wide-range of mainframe data types, while data can also be sent in files via FTP or via REST Calls. Sometimes, it might be preferable to send data via generated scripts and files (XML, EDI, JCL, IDCAM, etc.).
- Virtual data generation: This generates synthetic Request-Response pairs for service and message virtualization. Generating accurate virtual data is useful for creating complete test environments when system components are missing or unavailable.
Finally, synthetic test data generation for complex systems must be able to reflect complex business logic and rules. The resultant data must reflect any instances where a prior event in the system influences subsequent events. For instance, if a value entered by a user on one screen is aggregated on a subsequent screen, this business rule must be reflected in the synthetic test data. Data generation must therefore include “event hooks” to define this chronology, and the rules should be easy to define.
Driving Rigorous Testing for Complex Systems
Creating combinations of data not found in production is essential for rigorously testing complex and fast-changing systems. Creating this complex data requires a flexible approach. For organizations today, this might include one or all of the following characteristics:
- Data generation should build on the skills already required for masking and subsetting, as many organizations already perform these processes;
- Automated data modeling removes much of the complexity of defining generation rules for complex test data;
- A combination of multiple techniques defining the data combinations needed by QA. This might include scripting and a combination of both out-of-the-box and custom expressions;
- The methods for creating data based on the rules and definitions must also be flexible, for instance generating data directly to databases, or via the front-end, the middle layer, and files.
- Data generation must also reflect business rules accurately, for instance using easy-to-define “Event Hooks.”
Synthetic test data generation can generate the negative scenarios and outliers needed to maximize test coverage. Generating the missing data combinations needed by testing can furthermore improve QA agility, the subject of next week’s blog.
Published at DZone with permission of Thomas Pryce. See the original article here.
Opinions expressed by DZone contributors are their own.