Tips for Eliminating Poor Data
The accumulation of bad data leads to an increase in the number of errors in the system, so it is very important to build a continuous process of eliminating them.
Join the DZone community and get the full member experience.Join For Free
The Best Approach To Handling Poor Data
There are many ways to evaluate poor data, but the following approach has proved to be the most effective and universal in practice.
To weed out poor data, you need to:
- Clearly define criteria for poor data
- Perform data analysis against these criteria
- Find out the sources of this poor data
- Fix poor data
- Fix poor data sources
Criteria for poor data can be matching the data to a certain type or format, to a range, its completeness, the absence of duplicates, and others.
Next, you need to check all the data or some of them for compliance with these criteria.
At the same time, if the amount of data being checked is large, it makes sense to check only part of the data at the initial stages since most sources of errors can be identified and corrected even on a small sample.
And after correcting these errors, the entire dataset can already be checked.
The source of poor data can be a person who made a mistake while performing data input, such as a POS employee.
It can also be an external information system or some process performing internal calculations in your own information system.
After identifying poor data and its sources, there are two directions to work on:
- Fix already existing poor data
- Prevent the appearance of such data in the future
In the first direction, depending on the source, you either need to manually correct the data, or reload them from an external system, or perform correct calculations in your information system.
In the second direction, it is usually necessary to correct the processes that cause the appearance of poor data.
In case of staff errors, you can train them or add input validation.
If poor data comes from an external system, then you need to discuss the exchange format with counterparties.
If poor data is the result of internal calculations, then you need to correct the corresponding algorithms in your information system.
The Efficiency of This Technique
The given technique is extremely effective due to the clear definition of criteria based on the needs of the business.
Also, clear criteria allow you to automate the process of identifying poor data, which allows you to quickly inform about their appearance and respond to it in a timely manner.
For example, you can organize an email notification with the results of data quality checks.
Periodicity of the Bad Data Evaluating Process
It is very important that this data evaluation process be run regularly.
This will allow you to correct errors in data sources in a timely manner and, as a result, avoid time-consuming manual corrections, as well as minimize business risks associated with the use of poor data.
The frequency of starting the data evaluation process depends on the information system type and the data format itself.
In an analytical system, part of the data may remain unchanged, and it is enough to check such data once and eliminate errors, then repeat the process only for new data.
In the online system, large amounts of data can change, and you need to check the entire dataset; in this case, you can constantly check part of the set, and only in case of errors, check the entire dataset for specific errors only.
Speaking of specific values, this process can be daily for systems that are sensitive to any errors, up to once per month, if the data quality allows a certain amount of errors without a significant impact on business processes.
Regardless of the chosen launch frequency, the process of improving data quality itself must exist throughout the entire lifecycle of an information system.
What Can Lead to the Accumulation of Bad Data?
The absence of a process for handling poor data leads to an accumulation of errors.
Moreover, errors in the original data can lead to secondary errors resulting from working with poor data.
Eventually, all these factors will lead to the appearance of a constant component that negatively affects the company's profit.
Building data quality management processes at the early stages of system deployment is very important.
This will give a set of evaluation criteria and algorithms for analyzing poor data, informing users, and working with sources of poor data on small volumes with the least amount of effort.
After all, it is better to prevent errors than to eliminate them.
Opinions expressed by DZone contributors are their own.