Masking Data in Practice — Part 1
Read this article in order to learn about data masking, why to use it, what needs to be masked, and where masking takes place.
Join the DZone community and get the full member experience.Join For Free
Even small extracts of data need to be created with caution if they are for public consumption. Sensitive data can 'hide' in unexpected places, and apparently innocuous data can be combined with other information to expose information about identifiable individuals. If we need to deliver an entire database in obfuscated form, the problems can get harder. Phil Factor examines some of the basic data masking techniques and the challenges inherent in masking certain types of sensitive and personal data while ensuring it still looks like the real data and preserving its referential integrity and distribution characteristics.
This article describes the practicalities of data masking, the various methods we can use, and the potential pitfalls. In subsequent articles, I'll demonstrate how we can mask or sanitize different types of data using tools such as SQL Clone, Data Masker for SQL Server, and SQL Data Generator.
Basics of Data Masking
Data masking, sometimes called data sanitization or data protection, is a term for the technology and processes that are used to anonymize or pseudonymize personal, private, or sensitive data. Anonymization refers to the process of entirely removing all real data values while maintaining the overall verisimilitude of the data so that it looks real. Pseudonymization attempts to maintain a part of the data while anonymizing the rest of the attributes because they are sensitive or irrelevant.
Why Mask Data?
The custodian of data has a difficult balancing act between conforming with privacy laws, discretion, and the obligations of confidentiality, on one hand, and the need to make certain parts of that data available for development work, training, testing, medical research, open government, security, and the law. It is no longer legally possible to do this if the data contains information that must be protected. The data professional is faced with the practical task of providing data, legally, in a way that cannot possibly reveal, by inference attacks, private information about individuals, or confidential business information.
There wasn't much call for data masking until recently. If you were required to provide a data extract for research, legal evidence, public information, or security, it was sufficient to create a view, or set of views, that included only the required information. Sensitive information, and information beyond the remit of the information request, simply wasn't included. It was this view that was exported. One of the main drivers for masking has been the increasing demand from developers to use all the data from the live database of the application to develop with as part of a DevOps process.
What Can Go Wrong?
It is relatively simple to pseudonymize the contents of a single table, or even a handful of tables so that the real data cannot be gleaned by casual inspection. It is more difficult to achieve a robust pseudonymization that is proof against decoding by a skilled person.
It is a frighteningly complex task to anonymize or pseudonymize the data of an entire database in order to do development or testing, and there have been many cases where the personal information about individuals has been extracted from public, 'pseudonymized' data. You also need to mask data that, by itself, seems innocuous, but when used with other publicly-available data, can identify individuals.
It is relatively easy to entirely anonymize data, but this doesn't meet many requirements. To be useful for development or test work, for example, the masked data must have the same distribution and characteristics as the original data. For medical, epidemiological research, it is often vital that the medical information and locations are correct. For example, if the long-term effects of a nuclear reactor accident are being investigated, the locations cannot be masked. We cannot tamper with information that is required for court proceedings. Data for staff training must be as realistic to casual inspection as possible. Generally, data must be pseudonymized, and partially anonymized, to fit the purpose for which the data is required.
Databases aren't designed to be able to erase data without a trace. Quite the contrary; databases carefully log every change. They also are happy to leave old data on the data pages until the space is required by fresh data. If you are intent on producing an anonymized version of a database, the safest approach is to entirely re-create the database by the build and bulk insert or to use Common Criteria Evaluation Assurance Level 4+ (EAL4+) with SQL Server Enterprise Edition.
What Sort of Data Needs to Be Masked?
Although personally-identifiable data is currently the focus of legislation, there are plenty of other types of data that need to be masked before it can be made public, such as business and financial information or information that can aid the carrying out of a criminal offense. The organization that processes or stores the data must determine which data elements need to be masked and how based on the legislative framework within which it operates.
Personally-identifiable information (PII) is common to most data masking requirements. PII is any data that can be used to identify a living person and includes such elements as name, date of birth, National Identification Number, address details, phone numbers or email addresses, disabilities, gender identity or sexual orientation, court orders, electronic wage slips, union affiliations, biometric and 'distinguishing feature' information, and references to the serial number of devices such as laptops that are associated with, or assigned to, a person.
However, though a single data set can be successfully anonymized, the more data sets that are combined, the easier it is to identify individuals. An example of this is data that identifies where an individual was at a given time. This can be a hotel or travel booking, internet access record, meeting appointment, or GPS data. By combining two or more of these datasets, we can often identify a record as belonging to an individual.
To protect data, we need to know in which columns, of which tables, and in which databases it is held. Data masking is relatively easy when each item of data is held in just one column, but real life isn't like that. If the data has been denormalized to speed up reporting, then we must find every table to which the data we want to mask has been added.
Where Does Masking Take Place?
Data Masking software either does its obfuscation in-situ, in passage, or uses the 'On-The-Fly' approach, sometimes known as the 'Server-To-Server' technique.
On-The-Fly Data Masking
In the 'On-The-Fly' technique, the obfuscation is done as part of copying or cloning the source database. The pseudonymization rules are applied as part of the process of moving the data from the source database to the target. This has the advantage that the data is never present in an un-masked form in the target database, so there is no danger of the original data being recoverable from the target database's transaction log. If the changes were audited, that part of the database that does the audit can and should be withheld.
Although there are great security advantages to the obfuscation being performed on the original, security-fenced server, the process becomes 'all-or-nothing' and must be repeated if it proves to be inadequate. Dynamic data masking can sometimes be used to copy versions of the data via a login for which dynamic masking has been applied, but this type of masking does not produce realistic data and can easily fall foul of constraint checks when the data is imported into the target.
In-Passage Data Masking
In-passage data masking takes place after the data is exported from the source and before it is imported. This usually works on the securely-held text files of the contents in comma-separated value (CSV) format, XML, JSON, or another transfer format. The process can involve Regex at its simplest but is more usually read into some form of a document database.
This avoids the problems of artifacts of the data appearing in logs and audit trails. It also side-steps the interference of constraints, rules, and triggers. Unfortunately, this will only be a postponement of the consistency checks on the masked data because the masked data will be checked on import and will need to comply with database rules once these checks are re-enabled.
In-situ data-masking takes place after the database is copied so that the software simply operates on the copied database. The masking rules are executed and controlled by software that then connects to the target and controls the execution of the masking rules. Although this allows incremental application of masking rules over time, the original data will remain in the logs and on the database pages.
There are several methods for masking data depending on the type of data being masked, and on the masking requirements.
Data in one or more columns of a table can be randomly substituted with values from an appropriate list, or by using a data generator that can supply credible values that pass check constraints. Credit card numbers can, for example, be substituted with numbers that obey the validation rules of the provider. This preserves the look and feel of the existing data but requires a library of credible values. The disadvantage of this is that it is bound to alter the way that the data is distributed unless the data generation technique can check the original distribution and conform to it.
Data in one or more columns can be shuffled randomly against the key fields, much like a deck of cards. This might require certain rows to be respected or some other condition to be applied. Although this technique gets over the problem of data distribution, the obvious problem is that the individuals with unusual surnames such as Jack O'Diamonds are still there. Shuffling would not have cheered the victims of Ashley Madison if they had unusual names. Also, unless the shuffling is truly random, card-sharp techniques can be employed to unshuffle the pack.
Search and Replace Masking
This is used to modify text data by means of rules and regular expressions. It is often used to remove names or identifiers from text, substituting random length '###' strings.
Each number or date value in a column can be varied by some random percentage, whilst maintaining the original variance, range, and distribution. This can useful where it would otherwise be possible to identify individuals by an exact match, such as date of birth, but is otherwise more of a blurring than an obfuscation. It will, for example, give an approximate value for salary information.
A technique commonly used by dynamic data masking within the RDBMS. Certain fields are replaced with a mask character (such as an '###'). This removes the actual content while preserving the same formatting. For example, dates such as these:
If accessed by a login for whom masking is appropriate, it would appear as:
The masking characters effectively remove much of the sensitive content from the record while still preserving as much as possible of the look and feel. However, dynamic data masking can be defeated if the user has direct access to the database to execute queries because the masking rules are held at table-level, and the masking software cannot track the data from the original table. However, it is useful where users must be trained on the live system, and can interact with the data only via an application.
Stay tuned for part two of "Masking Data in Practice."
Published at DZone with permission of Phil Factor. See the original article here.
Opinions expressed by DZone contributors are their own.