It's Time to End Bad Data. Here's How Data Quality Can Help.
It's Time to End Bad Data. Here's How Data Quality Can Help.
Data is your most valuable asset. It’s time look at all data with a data quality lens and combat any existing data myopia in your organization.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Bad data has never been such a big deal. Why? Well, according to IDC’s latest report, "Data Age 2025," the projected size of the global data sphere in 2025 would be the equivalent of watching the entire Netflix catalog 489 million times (). In a nutshell, the global data sphere is expected to be 10x the 2016 data sphere volume by the year 2025. As the total volume of data continues to increase, we can also infer that the volume of bad data will increase as well unless something is done about it.
No doubt, every data professional will incessantly chase bad data as it's the bane of every digital transformation. Bad data leads to bad insight and ultimately biased decisions. That’s why it’s crucial to spot bad data in your organization. But it’s also hard to do it.
How to Spot Bad Data
Bad data can come from every area of your organization under diverse forms from business departments: sales, marketing, or engineering. Let's take a look at a few common categories of bad data:
- Inaccurate: Data that contain a misspelling, wrong numbers, missing information, blank fields, etc.
- Non-compliant: Data not meeting regulatory standards.
- Uncontrolled: Data left without continuous monitoring becomes polluted over time.
- Unsecured: Data left without control and vulnerable to access by hackers.
- Static: Data that is not updated and becomes obsolete and useless.
- Dormant: Data that is left inactive and unused in a repository lose its value as it’s neither updated nor shared.
If Data Fuels Your Business Strategy, Bad Data Could Kill It
If data is the gasoline that fuels your business strategy, bad data can be compared to a poor-quality oil in a car engine. Frankly, there is no chance you’ll go far and fast if you fill the tank with or poor-quality oil. This same logic applies to your organization. With poor data, results can be disastrous and cost millions.
Let's take a look at a recent "bad data" example from the news. A group of vacationers in the United States followed their GPS application to go sight-seeing. Because there was some bad data present, they wound up driving directly into a lake rather than the destination they intended. Now, let’s visualize a future where your car will be powered by machine learning capabilities. It will be fully autonomous and choose directions and optimize routes on its own. If the car drives you into the lake because of poor geo-positioning data, this will end up costing the carmaker quite a bit in repairs and even more to brand reputation. According to Gartner, poor data quality cost rose by 50% in 2017, reaching 15 million dollars per year for every company. You can imagine this cost will explode in the upcoming years if nothing is done.
Time for a wakeup call.
Results from the 2017 Third Gartner Chief Data Officer (CDO) survey show that the data quality role is again ranked as the top full-time role staffed in the office of the CDO. But truth is that little has been done to solve the issue. Data quality has always been perceived by organizations as a difficult play. In the past, the general opinion is that achieving better data quality is “too lengthy” and "complicated.”Fortunately, things have changed. Over the last two years, data quality tooling and procedures have dramatically changed. And it’s time for you to take the data bull by the horns.
Let’s take a closer look at a few common data quality misconceptions.
"Data Quality Is Just for Traditional Data Warehouses."
Today, data is coming from everywhere, and data quality tools are evolving. They are now expanding to cover any type of data whatever their type, nature, and source. And it's not only data warehouses. It can be on-premises data or cloud data, data coming from traditional systems, and data coming from IoT systems. Faced with data complexity and growing data volume, modern data quality tooling uses machine learning and natural language processing capabilities to ease up your work and separate the wheat from the chaff. My advice is to start early. Solving data quality downstream at the edge of the information chain is difficult and expensive. It’s 10x cheaper to fix data quality issues at the beginning of the chain that at the end.
"Once You Solve Your Data Quality, You’re Done."
Data management is not a one-time operation. To illustrate, let's look at the example of social networks. The number of social media posts, video, tweets, and pictures added per day is in excess of several billion entries. This rate only continues to increase at lightning speed. It’s also true for business operations. Data is becoming more and more real time. You then need “in-flight Data Quality”. Data Quality is becoming an always-on operation, a continuous and iterative process where you constantly control, validate and enrich your data, smooth your data flows and get better insights. You also simplify your work if you link all your data operations together on a single managed data platform.
Let’s take Travis Perkins as an example. Rather than trying to fix inaccuracies in their product data for multi-channel retailers, they created a data quality firewall into their supplier's portals. When suppliers introduce their product's characteristics, they have no chance but to enter data that meets Travis Perkins' data quality standards.
"Data Quality Falls Under IT Responsibility."
Gone is the time when data was simply an IT function. As a matter of fact, data is now a major business priority across all lines of business. A security breach, data loss, or data mismanagement may lead your company to bankruptcy. Data is the whole company’s priority, as well as a shared responsibility. No central organization — whether it's IT, compliance, or the office of the CDO — can magically cleanse and qualify all the data. The top-down approach is showing limits. This all about accountability. Like the cleanliness of public spaces, it all starts with citizenship.
Let's look at a recent example of the Alteryx Leak. A cloud-based data repository containing data from Alteryx, a California-based data analytics firm, was left publicly exposed, revealing massive amounts of sensitive personal information for 123 million American households. This is what happens when you fail to establish a company-wide data governance approach where data has to run across data quality and security controls and processes before it can be published widely.
Bad data management has immediate negative business consequences. Today, good data management requires company-wide accountability. Otherwise, it leads to penalties, bad reputation, and negative brand impact.
"It’s Hard to Control Data Quality."
Data management isn't just a matter of control anymore, but a matter of governance. IT should understand that it's better to delegate some data quality operations to business because they’re the data owners. Business users then become data stewards. They feel engaged and play an active role in the whole data management process. It’s only by moving from an authoritative mode to a more collaborative role that you will succeed in your modern data strategy.
"But It’s Still Hard to Make All Data Operations Work Together."
IT and business may have their own separate tools to manage data operations. But having the right tools for relevant roles is not enough. You will still need a control center to manage your data flows. You need a unified data platform where all the data operations are linked and operationalized together. Otherwise, you will risk breaking your data chains and ultimately fail to optimize your data quality.
Building a solid data quality strategy with the right platform is not complicated anymore. However, it still requires all data professionals in your organization to react and establish a clear, transparent and governed data strategy.
Data is your most valuable asset. It’s time look at all data with a "data quality lens" and combat any existing data myopia in your organization.
To go further into data quality, I recommend taking a look at a recent Gartner report that reflects eight changing tends that shape data quality tooling.
Published at DZone with permission of David Talaga , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.