Dirty, Disparate, Duplicated Data - Stop the 3 Bad Ds from Destroying Your Business

DZone 's Guide to

Dirty, Disparate, Duplicated Data - Stop the 3 Bad Ds from Destroying Your Business

In this article, we discuss how to prevent your organization from building up a data source with dirty, duplicate, and disparate data.

· Big Data Zone ·
Free Resource

Imagine this.

Your boss wants to implement a new CRM for managing customer information. The new system will incorporate information in the form of a complete customer journey – from the point they first make contact to post-acquisition. Everyone’s excited and eager to test out new possibilities. Lead generation will now be automated. Email marketing will be centralized and automated. Customer service will be automated. Everything looks good. The timeline for migration is 2 months.

Wait for it.

You and your team begin interacting with IT to study the data. And that’s when all hell breaks loose.

You discover that nearly 60% of your records have some sort of quality issue.

The records you currently have does not even map well into the new system’s strict data governance. The same customer information is stored multiple times across different data sources in varying formats. Sales stores it in Excel, marketing stores it in an online project management tool, customer service stores it in a ticketing tool. The data is duplicated to kingdom come. You’re officially in trouble.

You have no other choice but to redefine the project and let higher-ups know.

Migration is halted. There’s tension in the workplace. Fingers are pointed. It hits your company with a brick that all this while the data you relied on for key business decisions was flawed.

Dirty, disparate, duplicated data is a crisis.

And here’s how you can resolve it.

Stop Blaming IT and Take Action at Business Levels 

It’s easy to blame IT for a data migration project gone wrong or even for bad data. But here’s the deal. Bad data is NOT an IT problem. It’s a business problem. Until companies don’t realize that bad data is a business problem, no solution can be effective enough. 

Why is it a business problem? Simply because data impacts business decisions, not IT decisions. Also, IT is not present at the point of data capturing and at the point of data usage. Let’s understand this with a real-life example. Customers fill in information on the company’s landing page, which, by the way, does not have proper data controls in place. People are filling in random phone numbers, incomplete addresses or nicknames instead of their complete names. 

Given that they need to fill this information several times over the course of their interaction with the company, there is a high chance for this data to be flawed. This data will be used by several departments to make strategic business decisions – such as running a new campaign, promoting a new product, etc. 

Because data directly impacts business, it must be managed by business users who know the data well. For this reason, there are several data quality solutions available in the market that allows business users to work directly with data, taking the load off IT.

Understand the Type of Data Crisis You’re Dealing With 

How do you explain a data quality crisis? Well, you’ll need to identify how damaging it is to your organization and the costs associated with it. Evaluate your data against these three categories. If you’ve checked off most of what’s given here, you have to come up with a strong data quality improvement plan.

Dirty Data: Also commonly known as messy data, it includes problems like:

  • Typos in names, addresses, and other data fields.
  • Punctuations, such as full stops, dashes in alphabet or numbers only fields.
  • Negative spacing in data fields.
  • Messy upper/lower casing of names.
  • Non-standardized data, such as writing Strt., Street, or St.
  • Use of abbreviations and nicknames instead of actual names.

Disparate Data: Information stored in clusters over different sources in different formats.

  • Different departments storing the same information in different ways.
  • No unified entity information.
  • Clusters of flawed information.
  • Lack of a 360-customer view.

Duplicated Data: The mother of all data quality problems.

  • A new record is created every time an entity’s information is updated. One entity may have three records based on the kind of information they provide.
  • A record going by different versions of their names.
  • Users re-registering using different email IDs or phone numbers
  • Partial duplicates – data that may seem duplicates but are not carbon copies

Unless your data has been regularly updated or sorted, it’s highly possible that your organization is storing inherently bad data. The cost of this kind of bad data? It’s huge.

Highlight the Cost of Bad Data to Executives 

Your business leaders are more interested in the business cost of bad data. Interestingly, the cost is not limited to financial losses. It’s also a reputation and credibility cost.

Imagine accidentally sending out a subscription renewal to a list that chose to unsubscribe from your service.

Or imagine the cost of return mails when addresses are verified or validated.

And these are just minor problems. The loss is more severe when the company plans a migration, an upgrade or wants to make a business decision but the data does not support it. If you want your executives to take the matter seriously, you will have to dive deeper and show them concrete reasons they need to immediately sign off a data quality improvement plan.

Invest in Tools that Provide a Complete Data Quality Framework

This one is important. There are dozens of data cleaning tools, hacks, algorithms, and whatnot out there. But remember, you are not just cleaning or matching data. You need to implement a data quality framework.

A data quality framework is a tool, a protocol, a system that you can use to assess and fix data quality issues within an organization. The framework becomes a part of your organizational process.

This framework should help you:

  • Integrate Your Data: Connect your data from different sources into one centralized system.
  • Profile Your Data: Give you an in-depth review of problems plaguing your data.
  • Clean Your Data: Fix all those errors and typos, casing, and punctuations.
  • Match Your Data: Consolidate data from disparate sources and give a single source of truth.
  • Merge Your Data: Help you create a master record of your data.
  • Make Sense of Your Data: You need data you can trust! 

Whichever tool you propose to use must have this framework to help you achieve data quality goals that lead us to the final point:

Achieving Data Quality Goals

The goal is not to have 100% perfect data because that’s not bound to happen. The goal is to make sure that your data is fit for its intended use. This would mean data that is:

  • Accurate: Information that is not duplicated or defected.
  • Complete: Has all the right information (complete address, complete phone numbers, etc.)
  • Valid: Information that is verified and valid (no fake addresses, fake names, etc.)
  • Accessible: Can be accessible by key stakeholders in an organization.
  • Standardized: Data should have the same standards and formats across the data source.

Establishing a data quality program will help your organization become operationally efficient, reduce costs, keep customers happy and most importantly set you up for success. Don’t let bad data take your business down.

big data, data analysis, data deduplication, data matching

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}