Understanding Data Quality
Data has always been the heart of organizations. It is the keystone for running day-to-day business smoothly and for implementing new strategies in an organization. The ability to analyze data and make data-driven decisions is becoming increasingly important.
Individuals are also highly benefited by the use of data. Be it investing in stocks or finding a suitable house to buy, data provides a wealth of information for us to make decisions. Data is the foundation of decision-making and provides information, helps derive various insights, and helps make predictions required for effective decision-making. There are multiple sources from which data is collected. For example:
- Internal databases: These constitute an organization’s most relevant and reliable data source. They are usually in a structured format and commonly record data from various internal applications like ERP (Enterprise Resource Planning), CRM (Customer Relationship Management), and HCM (Human Capital Management).
- Flat files: Flat files are one of the most used data sources for an organization. Flat files arise from sources that are external to an organization, or when there is no proper mechanism to integrate various internal data sources. For example, a vendor can send periodical flat files that can be uploaded to an organization’s internal databases. Also, in cases where is no integration between two or more applications in an organization, flat files serve as a medium to exchange data. Most of the time, the data in a flat file is considered unreliable, and several checks are performed to verify and validate the data.
- Web services and APIs: Web services are a highly preferred medium for communication and data exchange between different applications. They provide a standardized way to communicate and exchange data. They are reliable, and data validation can be embedded easily.
- Other sources like data from social media, blog posts, audio, and videos are gradually becoming vital sources of information that need to be stored and analyzed.
However, not all data are useful or serve a given need. For instance, let’s say I am looking to buy a house. However, I get data that provides historical trends of house purchases from a different area other than where I am considering. This does not fit my need. The data is not going to serve the purpose unless the information is good enough.
Data that is fit for intended use is termed as useful data. Bad data inhibits the process of analysis. Finding a reliable dataset straight away is very difficult. We have to craft and nurture good data. In this Refcard, we will discuss various techniques to manage, monitor, and improve data quality in an organization. Some of this can also be useful for individuals who rely on data for their activities.
High-quality data has the following properties:
- Fit for use — correct and complete.
- Proper representation of the real-world scenario to which it refers.
- It is usable, consistent, and accessible.
Data quality can be measured based on the following dimensions:
- Completeness: Is there any missing or unusable data?
- Conformity: Does the data conform to a standard format?
- Consistency: Are the data values providing consistent information or giving conflicting information?
- Accuracy: Is the data correct or out-of-date?
- Duplicates: Are the data records or attributes repeated where they should not be repeated?
- Integrity: Is the data referenceable or there are missing constraints?
There are two main characteristics that define data quality.
1. Data Usability
Usablility means the data can contribute relevant content required for a particular task. For example, data on customer age or location might contribute well to a customer retention program for the consumer packaged goods industry. However, data on weather or the soil quality of customer locations might not be usable for this retention activity. However, this weather or soil quality data might be useful for customers targeted for the floral industry. So, data usability correlates with its ability to drive action/insight for particular tasks, and it needs to be an accurate representation relevant to the work. When similar data is present at multiple locations such as different databases and data warehouses, they need to be synchronized to have the same representation of the data.
2. Data Quantity
Data quantity defines the amount of data required for an analysis. Estimating and assessing the data quantity at the beginning of a data quality initiative is crucial for the success of the program. Do we need too little or too much data? What is the number of observations? What are the drawbacks of not having much data? These are questions that can help us decide the tools and techniques required to drive the data quality initiative.
Manually inspecting the data to ensure fit for use is the best way to ensure data quality. This is possible when the data quantity is too small. However, with the volume of data we currently have, it is too high to rely solely on the manual process. To eliminate human error and reduce data inaccuracies, we have to depend on various technologies and techniques. We need to follow a data quality strategy to ensure the data is of high quality. There are different phases that provide the ability to manage, monitor, and improve data quality, as given below:
- Parsing and standardization: A process to extract pieces from data to validate if it follows a specific pattern. If it doesn’t fit the pattern, the data is formatted to provide consistent values.
- Generalized cleansing: A process to remove errors and inconsistencies in data.
- Matching: A process to compare, identify, or merge related entities across two or more sets of data.
- Profiling: A process to analyze the content of a dataset for validating the accuracy, consistency, and uniqueness of data.
- Monitoring: A process to continuously access and evaluate the data to ensure it is fit for the purpose.
- Enrichment: A process to enhance data quality by using data from various internal and external sources.