DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. What Is Data Cleansing?

What Is Data Cleansing?

In this post, we discuss the concepts of dirty data and data cleansing, and take a look at some factors to keep in mind when exploring data cleansing tools.

Garrett Alley user avatar by
Garrett Alley
·
Aug. 14, 18 · Analysis
Like (3)
Save
Tweet
Share
10.74K Views

Join the DZone community and get the full member experience.

Join For Free

As a business grows and matures, the size, number, formats, and types of its data assets change along with it. Evolutions in payroll systems, new network hardware and software, emerging supply-chain technologies, and the like can all create the need to migrate, merge, and combine data from multiple sources. "Dirty" data — data that contains redundancies, includes duplicate records, is missing information, or has been otherwise corrupted in the process of being imported or merged — is one inevitable result. Data transformation, which involves "massaging" data to make its fields and formats conform to those of its destination, can also be the source of hair pulling and sleepless nights.

The art and science of handling these odious tasks is called "data cleansing."

Clean Up Your Dirty Data

The goal of data cleansing is to improve data quality and utility by catching and correcting errors before it is transferred to a target database or data warehouse. Manual data cleansing may or may not be realistic, depending on the amount of data and number of data sources your company has (this blog post has some good information on how to approach manual data cleansing). There are data cleansing tools designed to take some of the difficulty out of the process.

Regardless of the methodology, data cleansing presents a handful of challenges, such as correcting mismatches, ensuring that columns are in the same order, and checking that data (such as date or currency) is in the same format. Depending on the situation, other difficulties may include enriching data with supplementary information on the fly, revising or updating schema, and detecting errors. These data discrepancies may have originated from human error, aging (data such as contact information degrades over time), omissions due to optional fields in forms, or merge errors.

Both manual and automatic data cleansing execute the same basic steps, in varying order:

  1. Import data via an API or in .csv (or another delimited text format).
  2. Format data to match the destination database.
  3. Re-create missing data, wherever possible.
  4. Correct errors, such as spelling.
  5. Reorder columns and rows to match the target database.
  6. Compare and delete duplicate records.
  7. Enrich data by merging in additional information (such as adding data from purchased marketing and sales databases), if desired.

Remove the Headaches From the Process

Some "data wranglers" are born do-it-yourselfers and prefer to detect and correct their data inconsistencies manually. However, this methodology is error-prone and does not scale as a company grows. For those seeking an automated solution, selecting the ideal data-cleansing software should be top priority.

Here are some features and functionality to look for when comparing vendors of automated data-cleansing software:

The ability to catch and correct errors in real time. Errors are inevitably introduced in the process of integrating data from multiple sources. Premium data-cleansing solutions include the ability to catch these errors and restream them for exactly-once processing.

Parse formats to enable proper merging. In merging data from multiple sources, there are often inconsistencies in data format. Your data-cleansing software must be able to match one format (such as M-D-Y) to another (such as D-M-Y).

Automatically revise or update schema. A master database can be an important tool for deriving insights from your data. However, integrating all of your data in this way requires every data source to have the same schema. The best data-cleansing software will detect this and revise schema, wherever necessary.

Enrich data before merging it into a data warehouse. It is not unusual for companies to add supplementary data from a commercial source to incoming data. An example is to enrich incoming demographic data with corrected address data (such as Zip+4 information) before transporting it to the data warehouse.

The ability to view data in real time. Ideally, data scientists and engineers can monitor data streams in real time, enabling them to detect and correct problems before the data is merged into the data warehouse.

Data science Data cleansing

Published at DZone with permission of Garrett Alley, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • How To Use Java Event Listeners in Selenium WebDriver
  • Automated Testing With Jasmine Framework and Selenium
  • How To Create a Failover Client Using the Hazelcast Viridian Serverless
  • A Deep Dive Into AIOps and MLOps

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: