Data Integrity: A Data-Driven Organization's Biggest Concern
Although many organizations have implemented a system for data collection and analysis, it's crucial that they can also maintain the integrity of their data.
Join the DZone community and get the full member experience.
Join For FreeIn the InsideView Alignment Report 2020, more than 70 percent of revenue leaders rank data management as the highest priority. Although many organizations have implemented a system for data collection and analysis, their biggest concern is now maintaining the integrity of their data.
The term "data integrity" is sometimes used as a process or a state of data. Either way, it refers to data being accurate, valid, and consistent across all data sources.
In layman terms, data integrity refers to the data that your team can trust, feel assured that is protected, and use for whatever purpose they want, without worrying about data quality.
These aspects are extremely important, especially for data analysts that integrate and bring data together data from multiple sources to derive useful insights by enriching customer data and retaining customers.
Types of Data Integrity
Data integrity has various aspects, but at a high level, it can be divided into two types: physical and logical. Both of these types define a number of methods and constraints that enforce integrity in datasets.
Physical Data Integrity
Physical data integrity relates to protecting data against external or physical calamities, such as power outages, natural disasters, hackers, etc. These problems make it impossible for users to access data from the database; they are usually triggered by human errors, storage reductions, security breaches, malware, etc.
Logical Data Integrity
Logical data integrity relates to how the data is stored and modeled within the database, as well as the logical constraints implemented to keep the data accurate, valid, and consistent across multiple sources.
Logical data integrity is further divided into four types:
1. Entity Integrity
Entity integrity means uniquely identifying each entity in your database. This helps to avoid duplicate records since every new record must have a unique identifier. These identifiers — also called primary keys in relational databases — cannot be null and are usually referenced in other datasets to prevent data duplication. For example, in a customer database, SSN can be used as the unique identifier that ensures the entity integrity of the dataset.
In the absence of uniquely identifying attributes, complex data matching and fuzzy matching algorithms are required to match data accurately and find out which records belong to the same entity.
2. Referential Integrity
Referential integrity refers to the presence of foreign keys in a relational database. Foreign keys are created to refer to an existing entity in another table. Relating records in this way avoids creating duplicate record entries and using information from an existing schema table. For example, an employee database can have employee information in one table and job role information in another, and foreign keys are used to relate/mention a job role in the employee information table.
An important thing to note here is that a table has a unique, not-null list of primary keys, but multiple unique records can share the same foreign key (as multiple people can have the same job role in the organization).
3. Domain Integrity
Domain integrity means correct (domain-specific) values have been used in each column of the database. For example, in an employee database where address information is added, the column Country can have a list of possible values, and any value that does not fall in that list is incorrect and must be updated which can be cleaned by address scrubbing.
4. User-Defined Integrity
When users define their own custom rules or constraints on a column, it is termed as user-defined integrity. For example, if a user defines that the lead source for prospects database can be: Google Adwords, Website, or Cold Call, then a value outside of these three will be invalid.
Threats to Data Integrity
The terms "data integrity" and "data security" are used interchangeably, but they are not the same. Data security measures are performed to attain data integrity. Moreover, maintaining data integrity is a complex task. Data security is one way to achieve data integrity, but there are more, as a number of things pose as threats to data integrity. A few of them are highlighted below:
Human Error
Every 400 per 10,000 entries in a database is inaccurate due to human error. This is a significant number; the implementation of unique identifiers, integrity constraints, and other validation checks are all overridden due to human mistakes.
Inconsistencies in Data Formats
Without proper data formats and types defined, data values within the same column are stored using a different pattern and format, which leads to inconsistencies in the database. To prevent such inconsistencies, it is important to define validation patterns and the correct data types.
Integration Error
While integrating data from multiple sources, data integrity is usually compromised. The reason is the difference in data structure, validation checks, and integrity constraints across each source. One data can be saving Phone Number as char data type with max char limit of 15, while the other is saving it as number data type, with max char limit of 13.
Internal Privacy Breaches
This usually happens when your data lands in the wrong hands — either an employee who is misusing the company’s data repository, or hackers trying to break through your firewall to get to the data. In any case, securing the data from such privacy breaches is an important task.
Signs of Data Integrity
To understand whether your data has integrity, you need to look for the following signs:
Accessibility
Is your data present at the right place and accessible whenever needed at the right time? If there is no proper or easy access to your data, then your data might be at the risk of losing its integrity. Faster and optimized retrieval of data from the database is a key sign that the data’s integrity is being maintained.
Validity
Do values of a column in your dataset have the same data type and format? Valid data is easily seen by noticing how many values in your database do not conform to appropriate validation checks, for example, a creation date for a record having a value "21.21.21." The day and year 21 make sense, but the month is invalid.
Completeness
Does your database contain a lot of null values? In case your dataset does not have the record of certain values, then it’s better to choose a generic non-null term (such as Not provided or N/A), rather than leaving the column values empty. This will help you to understand whether the values are missing or incomplete, or were deliberately left empty.
Uniqueness
Do your dataset records uniquely identify entities? This is seen by assessing whether all records in the database reflect a unique identity and one entity information does not span multiple records. In case your dataset contains duplicates, you will need to employ data matching algorithms to identify which records belong to the same entity. In case duplicates are non-exact, you may require a combination of fuzzy matching algorithms to compute match confidence levels, and make a decision.
Scenarios Where Data Integrity Becomes Crucial
Now that we have covered the basics of data integrity, let’s discuss the real-world scenarios where data integrity plays a significant role.
Masking Personally Identifiable Information
A common practice to hide personally identifiable information is to mask actual data with dummy data. This process is extensively used in healthcare and other government institutes to protect individual privacy. If data integrity is not maintained across the dataset, it could be very difficult to attain the actual data back from the dummy data, since the original data was inaccurate.
Ensuring Compliance with Data Standards
Compliance standards, such as HIPPA, GDPR, etc. state the importance of data integrity. For example, GDPR Article 5(1) states that personal data should be:
“Accurate and, where necessary, kept up to date; every reasonable step must be taken to ensure that personal data that are inaccurate, having regard to the purposes for which they are processed, are erased or rectified without delay.”
This clearly depicts how crucial it is to maintain data integrity in your database for complying to necessary standards.
Driving Business Intelligence
Reliable data insights are the biggest benefit of capturing data and maintaining it in a clean, standardized manner. Data analysts spend 80 percent of their time managing data and correcting data mistakes and only 20 percent of the time in actually analyzing and deriving insights from it. Organizations are increasingly employing systems that process their data and give descriptive insights, but they are still finding it difficult to trust these insights. The reason goes back to how the data is being captured, structured, and related across tables in the database.
Conclusion: Data Integrity Produces Reliable Insights
In this article, we covered basic and advanced aspects of data integrity, as well as mentioned a few scenarios where data integrity becomes crucial. Although sustaining the integrity of your data seems like a resource- and time-intensive initiative, it saves you time in the longer run, as your data-driven insights become more reliable, accurate, and actionable.
Published at DZone with permission of Zara Ziad. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments