What Is Data Cleansing?
In this post, we discuss the concepts of dirty data and data cleansing, and take a look at some factors to keep in mind when exploring data cleansing tools.
Join the DZone community and get the full member experience.Join For Free
As a business grows and matures, the size, number, formats, and types of its data assets change along with it. Evolutions in payroll systems, new network hardware and software, emerging supply-chain technologies, and the like can all create the need to migrate, merge, and combine data from multiple sources. "Dirty" data — data that contains redundancies, includes duplicate records, is missing information, or has been otherwise corrupted in the process of being imported or merged — is one inevitable result. Data transformation, which involves "massaging" data to make its fields and formats conform to those of its destination, can also be the source of hair pulling and sleepless nights.
The art and science of handling these odious tasks is called "data cleansing."
Clean Up Your Dirty Data
The goal of data cleansing is to improve data quality and utility by catching and correcting errors before it is transferred to a target database or data warehouse. Manual data cleansing may or may not be realistic, depending on the amount of data and number of data sources your company has (this blog post has some good information on how to approach manual data cleansing). There are data cleansing tools designed to take some of the difficulty out of the process.
Regardless of the methodology, data cleansing presents a handful of challenges, such as correcting mismatches, ensuring that columns are in the same order, and checking that data (such as date or currency) is in the same format. Depending on the situation, other difficulties may include enriching data with supplementary information on the fly, revising or updating schema, and detecting errors. These data discrepancies may have originated from human error, aging (data such as contact information degrades over time), omissions due to optional fields in forms, or merge errors.
Both manual and automatic data cleansing execute the same basic steps, in varying order:
- Import data via an API or in .csv (or another delimited text format).
- Format data to match the destination database.
- Re-create missing data, wherever possible.
- Correct errors, such as spelling.
- Reorder columns and rows to match the target database.
- Compare and delete duplicate records.
- Enrich data by merging in additional information (such as adding data from purchased marketing and sales databases), if desired.
Remove the Headaches From the Process
Some "data wranglers" are born do-it-yourselfers and prefer to detect and correct their data inconsistencies manually. However, this methodology is error-prone and does not scale as a company grows. For those seeking an automated solution, selecting the ideal data-cleansing software should be top priority.
Here are some features and functionality to look for when comparing vendors of automated data-cleansing software:
The ability to catch and correct errors in real time. Errors are inevitably introduced in the process of integrating data from multiple sources. Premium data-cleansing solutions include the ability to catch these errors and restream them for exactly-once processing.
Parse formats to enable proper merging. In merging data from multiple sources, there are often inconsistencies in data format. Your data-cleansing software must be able to match one format (such as M-D-Y) to another (such as D-M-Y).
Automatically revise or update schema. A master database can be an important tool for deriving insights from your data. However, integrating all of your data in this way requires every data source to have the same schema. The best data-cleansing software will detect this and revise schema, wherever necessary.
Enrich data before merging it into a data warehouse. It is not unusual for companies to add supplementary data from a commercial source to incoming data. An example is to enrich incoming demographic data with corrected address data (such as Zip+4 information) before transporting it to the data warehouse.
The ability to view data in real time. Ideally, data scientists and engineers can monitor data streams in real time, enabling them to detect and correct problems before the data is merged into the data warehouse.
Published at DZone with permission of Garrett Alley, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.