What Is Data Quality?
What Is Data Quality?
If the data your using isn't of a high quality, your insights will be skewed. We discuss a few tactics for ensuring the quality of your data sets.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Data is everywhere. As the volume, sources, and velocity of data creation increase, businesses are grappling with the reality of figuring out what to do with it all and how to do it. And if your business hasn't determined the most effective way to use its own data, then you're missing out on critical opportunities to transform your business and gain a decisive advantage.
Of course, without good data, it's a heck of a lot harder to do what you want to do. Whether you're launching a new product or service, or simply responding to the moves of your biggest competitor, making smart, timely business decisions depends almost entirely on the quality of data you have at hand.
People try to describe data quality using terms like complete, accurate, accessible, and de-duped. And while each of these words describes a specific element of data quality, the larger concept of data quality is really about whether or not that data fulfills the purpose or purposes you want to use it for.
Why Data Quality Is So Tough
Nearly 85% of CEOs say they're worried about the quality of data they're using to make decisions. Part of that fear comes from the fact that poor data has proven to cost companies up to 25% of their annual revenue in lost sales, lost productivity, or bad decisions.
Clearly, achieving data quality is still a challenge for many organizations, but the solutions are not as illusory as they seem. Most businesses experience some or all of these problems that directly impact the quality of their data:
- Isolated data. Otherwise known as "data silos," these separate data groups are either owned by a particular business unit or contained within a particular piece of software. The problem with siloed data is that it's inaccessible to the rest of the organization because the software may not be compatible with anything else or the business unit tightly controls user permissions. And while the data may offer useful, even extraordinarily valuable insight, since it can't be easily accessed, the business can't form a complete picture of it, let alone benefit from it.
- Outdated data. Enterprise structures are large and complex with multiple teams and departments. Thus, gathering data across the organization is often a slow and laborious process. By the time all the data is collected, some — if not most — of it has already fallen behind in relevance, therefore greatly reducing its value to the organization.
- Complex data. Data comes from many different sources and in many different forms. Data is generated from smartphones, laptops, websites, customer service interactions, sales and marketing, databases, and more. And it can be structured or unstructured. Making sense of the volume and variety of data coming in and standardizing it for everyone to use is a resource-intensive process many organizations don't have the bandwidth or expertise to keep up with.
How to Achieve Quality Data
Like any worthwhile business endeavor, improving the quality and utility of your data is a multi-step, multi-method process. Here's how:
- Method 1: Big data scripting takes a large volume of data and uses a scripting language that can communicate and combine with other existing languages to clean and process the data for analysis. While engineers appreciate the agility of scripting, it does require a significant understanding of the types of data that need to be synthesized and the specific contexts in which the data exists to know which scripting language to use. Errors in judgment and execution can trip up the whole process.
- Method 2: Traditional ETL (extract, load, transform) tools integrate data from various sources and load it into a data warehouse where it's then prepped for analysis. But it usually requires a team of skilled, in-house data scientists to manually scrub the data first in order to address any incompatibilities with schemas and formats that exist between the sources and the destination. Even less convenient is that these tools often process in batches instead of in real-time. Traditional ETL requires the type of infrastructure, on-site expertise, and time commitment that few organizations want to invest in.
- Method 3: Open source tools offer data quality services like de-duping, standardizing, enrichment, and real-time cleansing along with quick signup and a lower cost than other solutions. However, most open source tools still require some level of customization before any real benefit is realized. Support may be limited for getting the services up and running, which means organizations once again have to fall back on their existing IT team to make it work.
- Method 4: Modern data integration removes the manual work of traditional ETL tools by automatically integrating, cleaning, and transforming data before storing it in a data warehouse or data lake. The organization defines the data types and destinations and can enrich the data stream as needed with, for example, updated customer details, IP geolocation data, or other information. The transformation process standardizes data from all sources and in all formats to make it usable to anyone in the organization. And because it processes data in real time, users can check the data stream and correct any errors as they're happening.
Published at DZone with permission of Garrett Alley , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.