DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • When Perfect Data Breaks: The Journey from Data Quality to Data Observability
  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever
  • Modernizing Cloud Data Automation for Faster Insights
  • Toward Intelligent Data Quality in Modern Data Pipelines

Trending

  • Detecting Bugs and Vulnerabilities in Java With SonarQube
  • When Snowflake Lies to You: Understanding False Failures in dbt Pipelines
  • Using LLMs to Automate Data Cleaning and Transformation Pipelines
  • Offline-First Patch Management for 10,000 Edge Nodes: A Practical Architecture That Scales
  1. DZone
  2. Data Engineering
  3. Data
  4. Tips for Eliminating Poor Data

Tips for Eliminating Poor Data

The accumulation of bad data leads to an increase in the number of errors in the system, so it is very important to build a continuous process of eliminating them.

By 
Yuri Danilov user avatar
Yuri Danilov
·
Jul. 05, 23 · Analysis
Likes (1)
Comment
Save
Tweet
Share
2.8K Views

Join the DZone community and get the full member experience.

Join For Free

The Best Approach To Handling Poor Data

There are many ways to evaluate poor data, but the following approach has proved to be the most effective and universal in practice.

To weed out poor data, you need to:

  • Clearly define criteria for poor data
  • Perform data analysis against these criteria
  • Find out the sources of this poor data
  • Fix poor data
  • Fix poor data sources

Criteria for poor data can be matching the data to a certain type or format, to a range, its completeness, the absence of duplicates, and others.

Next, you need to check all the data or some of them for compliance with these criteria.

At the same time, if the amount of data being checked is large, it makes sense to check only part of the data at the initial stages since most sources of errors can be identified and corrected even on a small sample.

And after correcting these errors, the entire dataset can already be checked.

The source of poor data can be a person who made a mistake while performing data input, such as a POS employee.

It can also be an external information system or some process performing internal calculations in your own information system.

After identifying poor data and its sources, there are two directions to work on:

  • Fix already existing poor data
  • Prevent the appearance of such data in the future

In the first direction, depending on the source, you either need to manually correct the data, or reload them from an external system, or perform correct calculations in your information system.

In the second direction, it is usually necessary to correct the processes that cause the appearance of poor data.

In case of staff errors, you can train them or add input validation.

If poor data comes from an external system, then you need to discuss the exchange format with counterparties.

If poor data is the result of internal calculations, then you need to correct the corresponding algorithms in your information system.

The Efficiency of This Technique

The given technique is extremely effective due to the clear definition of criteria based on the needs of the business.

Also, clear criteria allow you to automate the process of identifying poor data, which allows you to quickly inform about their appearance and respond to it in a timely manner.

For example, you can organize an email notification with the results of data quality checks.

Periodicity of the Bad Data Evaluating Process

It is very important that this data evaluation process be run regularly.

This will allow you to correct errors in data sources in a timely manner and, as a result, avoid time-consuming manual corrections, as well as minimize business risks associated with the use of poor data.

The frequency of starting the data evaluation process depends on the information system type and the data format itself.

In an analytical system, part of the data may remain unchanged, and it is enough to check such data once and eliminate errors, then repeat the process only for new data.

In the online system, large amounts of data can change, and you need to check the entire dataset; in this case, you can constantly check part of the set, and only in case of errors, check the entire dataset for specific errors only.

Speaking of specific values, this process can be daily for systems that are sensitive to any errors, up to once per month, if the data quality allows a certain amount of errors without a significant impact on business processes.

Regardless of the chosen launch frequency, the process of improving data quality itself must exist throughout the entire lifecycle of an information system.

What Can Lead to the Accumulation of Bad Data?

The absence of a process for handling poor data leads to an accumulation of errors.

Moreover, errors in the original data can lead to secondary errors resulting from working with poor data.

Eventually, all these factors will lead to the appearance of a constant component that negatively affects the company's profit.

Summary

Building data quality management processes at the early stages of system deployment is very important.

This will give a set of evaluation criteria and algorithms for analyzing poor data, informing users, and working with sources of poor data on small volumes with the least amount of effort.

After all, it is better to prevent errors than to eliminate them.

Data analysis Data quality Data (computing)

Opinions expressed by DZone contributors are their own.

Related

  • When Perfect Data Breaks: The Journey from Data Quality to Data Observability
  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever
  • Modernizing Cloud Data Automation for Faster Insights
  • Toward Intelligent Data Quality in Modern Data Pipelines

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook