The Cost of Bad Data
The Cost of Bad Data
Learn about the hidden sinkholes with bad data that can absorb a feature team's productivity and success rate.
Join the DZone community and get the full member experience.Join For Free
When developing new functionality or expanding existing features, members of the feature team can find themselves battling a hidden enemy — bad test data. In this article, I am going to talk about the challenges of not having adequate test data for feature validation and offer some options to avoid falling into this hidden sinkhole that can absorb a team's success rate.
How Bad Data Is Bad?
One of the first topics of discussion when starting a new project is to understand the source of the non-production data sources, which will be used to validate a feature team's work. Nothing scares me more to hear something like, "We have a test database that was created from production a few years ago."
When I ask if the test database can be recreated, I typically receive an unfavorable response with a comment that basically states that refreshing the database "would have a significant impact on the testers using the system." Or that doing so will "just take too much time."
In my view, the feature team targeting this data source is already working at a disadvantage because they likely have unknown updates to the existing data, which could generate false negatives (and even false positives) when the testing begins.
You may also be interested in: Is Poor Data Science the Cause of Digital Transformation Failure?
How Can the Data Get Bad?
Consider a very simple example where there is a CUSTOMER table in the database. As one might expect, this simple example becomes less simple when advanced needs find their way (or try to find their way) into the CUSTOMER table.
Perhaps a feature team was expanding the CUSTOMER table with some aspects that were tested, only to later be removed in order to focus on higher priorities. In my experience, when this happens, the data in the database is often left as-is, with attributes or values in place that do not match the current state of the application.
Now, when the new feature team begins to do work on that same CUSTOMER table, the existence of the orphaned data can lead to results that will not match the same use cases when executed against the production data source. To the new feature developer, they are often unaware of this extra data, especially if there is something like a Stored Procedure making changes were are not readily available to comprehend.
How to Avoid Bad Data
The easiest way to avoid this kind of bad data is to remember to keep your data clean. Of course, there are several factors that can cause this task to be left uncompleted (contract ends unexpectedly, individual changes jobs, etc.).
One of the best ways to avoid this scenario is to follow a pattern similar to what Salesforce does, which is to provide the ability to refresh a non-production data source from the production instance. Of course, this functionality will need to handle confidential data — obfuscating or replacing real data with fake data that is impossible to cross-reference.
Taking this approach is expensive, with little to no stakeholders lining up to fund this kind of effort. Keep in mind that once the database replacement process is figured out, items like end-to-end tests will need to be updated to handle data refreshes as well. Another aspect is that finding the time to actually perform the updates is often just as big of a challenge as funding the refresh automation process.
If taking the wholesale replacement approach is not viable, Plan B provides a mechanism to introduce new sets of data into the database that can be used by the validation aspect of the feature team.
This basically means that a new CUSTOMER would be added into the test environment, using only production instance attributes plus any newly added items as a part of the current feature branch or parent develop branch.
If for some reason more bad data is introduced, this issue can still be effective, as future feature tests will be completed against another new CUSTOMER record. Of course, database schema changes must be treated like program code — reverting things that do not link to a valid feature.
While watching the recent (Frontline) Amazon Empire: The Rise and Reign of Jeff Bezos film, the presentation spends a decent amount of time talking about the power of data has over the customer. One data scientist talked about how predictable customers were after taking time to study every page view, cart action, and general product searching. The one with (good) data is certainly the king, as noted in the presentation.
I feel like having bad data always puts one at a disadvantage. This certainly rings true for the feature team developer who is blindsided as a result of something failing due to bad data that was left lingering long after the original feature was backlogged.
This disadvantage often leads to additional debugging and eventually scanning of the database, only to realize there is a data issue from the data being in an unpredictable state. In this case, the feature team often takes a hit for something that was outside of their control, causing other features to pause while the team responds in their own implementation of production support.
Have a really great day!
Opinions expressed by DZone contributors are their own.