Over a million developers have joined DZone.

Masking Data in Practice — Part 2

DZone's Guide to

Masking Data in Practice — Part 2

Read part two of "Masking Data in Practice" to learn about difficulties with masking, such as denormalization and row-level data consistency.

· Database Zone ·
Free Resource

New whitepaper: Database DevOps – 6 Tips for Achieving Continuous Delivery. Discover 6 tips for continuous delivery with Database DevOps in this new whitepaper from Redgate. In 9 pages, it covers version control for databases and configurations, branching and testing, automation, using NuGet packages, and advice for how to start a pioneering Database DevOps project. Also includes further research on the industry-wide state of Database DevOps, how application and database development compare, plus practical steps for bringing DevOps to your database. Read it now free.

This is part two of "Masking Data in Practice." Part one can be found here.

Practical Difficulties With Masking

All this might sound straightforward to achieve, but the detail of data masking is as complicated as the database that is being obfuscated and depends also on the purpose of data masking.


If a database has been denormalized, the sensitive data will be stored in several tables and isn't always likely to be in an obvious place. The name of a customer, for example, will appear against addresses, phone numbers, invoice headers, correspondence, JSON documents, XML fragments, references, logs, transcriptions of conversations, and so on. To mask even a simple customer name could be nearly impossible. Even a well-normalized database can accidentally reveal personal information if an XML or text field is stored.

In some cases, before changing a value, there must exist several arcane rules that specify what else needs to be altered and where.

Row-Level Data Consistency

Column data is very rarely unrelated. An obvious example is that a male customer is unlikely to have the title of 'Lady' or 'Mrs' or to be called 'Joyce'. The last names of customers will vary greatly according to where they live. Company names vary considerably with their line of business: The Red Lion, White Hart, Eaton Burger Bar, or Wheatsheaf are unlikely to be Banks, for example. Context is important.

A single column cannot even be shuffled without harming the verisimilitude of data. To ensure that data is correct within its context, data masking software must relate columns both internally, within a table, and externally, table to table. Successful masking requires that you can modify columns in groups of rows within a table to contain identical values to the one you've changed or to modify columns in another table to contain identical values to the one you've just changed. You sometimes need to substitute different values according to a value in another column.


Databases have many constraints, rules, and triggers that are there to ensure that data is consistent and reliable. In other words, they are there to restrict the very activity you are attempting: the direct alteration of data in the database tables.

  • A CHECK constraint can do basic checks on the value in a column but can also ensure that there is consistency at the table level. By altering one or more values, these CHECK constraint rules can be violated.
  • You can also run into problems if the column containing the value you are altering is participating in a PRIMARYKEY or FOREIGNKEY constraint. This can mean that many masking rules can only be executed in a particular order or in a particular way.

You can, of course, temporarily disable triggers, unique keys, check constraints or foreign key constraints while you perform the masking. This can be extremely useful, but the disabling of triggers can conceivably result in inconsistency depending on the operation that the triggers need to perform. Also, of course, you'll have to re-enable the constraints and keys at some point, and this can be the time you realize that there are a lot of inconsistencies that need mending.

Faced with the difficulties of altering data within a database, you might think that a better approach is to apply in-passage masking to text versions of the base tables, using scripts and Regexes or by reading the files into a document database and processing it there.

This gets over the problem of log artifacts, which can reveal the original sensitive data, but the problems begin when the sanitized data is imported into the target: the empty copy of the database that is destined to be the masked version. This is because at some point, triggers, rules, and constraints will have to be enabled and the chances of the data being gracefully absorbed into the new database can be frighteningly remote.

Distributed Databases

Another problem can happen to your extracted data set if it originates in more than one database for instance. Masking software tends to work only on a single database and you can get problems with masking the data within several databases in a way that yields consistent data.

Masking Data That Is Part of a Primary Key

If you attempt to alter the data in a column that participates in a PRIMARYKEY, then you are likely to destroy the referential integrity of the database. To do this effectively, using a substitution strategy in a database, you will need to create a correlation table, which contains copies of the before-and-after values of the column to be masked. The correlation table is then used to relate the masked or substituted key values to the original ones, making sure that the new values are unique.

Having temporarily disabled all the relevant UNIQUE and FOREIGNKEY constraints in the target table, the correlation table is then used to alter the values in the PRIMARYKEY of the target table and then it is used to replace the original key values in the FOREIGNKEY columns of any referencing tables with the new value. Then, we can re-enable the constraints in the target table. Finally, the temporary correlation table must be dropped for security reasons since it forms a lookup mechanism between the new values and the old values.

For each referencing table, the FOREIGNKEY columns will generally have an index, which will, to speed the process, need to remain in place during the update. Likewise, the columns containing the original PRIMARYKEY values, in the correlation table, should be indexed.

Free Text

Documents, memos, transcripts, and disciplinary notes are difficult to sanitize in-situ. Although it is sometimes possible just to use, 'Lorem ipsum dolor sit amet, no pro mundi propriae signiferumque, tale perfecto urbanitas duo et. Ne errem apeirian euripidis nam, an mazim necessitatibus per. Vim salutandi assentior cotidieque et, vim ne ullum putent.. etc.', this is unlikely to help with training.

The most useful approach is likely to be the use of markov chains to produce meaningless text that mimics the style of the original. Homer can, for example, be obfuscated to:

'Nor yet e'en so Achilles let his counsel take:
who should strike the bird?
As less in skill not one was absent;
nor of his lofty vessel's prow. The dark whirlwind's force
as this wine we pour their hearts' best blood
Theirs and their fam'd Allies who freely in thy house
receiv'd, for twenty days, the heav'nly Maid; Daughter of Cisseus.'

Structured Text (XML/JSON)

Structured text, such as in an XML column, can contain the very values that are being obfuscated so they cannot be left alone. There is no point altering only the name column in a table if the names also appear in an XML document that is supposed to hold supplementary information.

This structured text can contain data that is used by the application under test, training, or development, so it will need to be parsed and, if necessary, subject to field-by-field masking just as if it were normal relational data. There is no short-cut to this chore.

Intelligent Keys

Some data items such as postal codes, employee numbers, credit card numbers, and bank codes have a structure containing several attributes. Unless the substitution is done with the same rules, they are likely to fail any validation checks performed within the application and to violate CHECK constraints within the database.

The problem arises in the front-end application data-entry screens, which will check the content of these 'intelligent keys' prior to the update. If we've masked data in these columns in a way that violates the internal rules governing the attributes of this data, then any screen that displays the value will never permit an update because the validity checks fail.

You can, of course, try shuffling the data in these columns, but a postal code, for example, could easily be checked against an address in the same row, so it is possible that the shuffled data will be rejected as inconsistent.

Training and Demonstration

Sometimes we need to anonymize a database for training or demonstration purposes in such a way that certain data is consistent enough to support a training 'scenario'. If you need to demonstrate an application process involving a certain user with, for example, certain accounts or other records, then the obfuscation process must take this into account. The easiest approach to this problem is to introduce consistent records into the obfuscated version of the database after the obfuscation is completed.

Specialist Data

There is certainly data that is common to many databases and can, therefore, be served by a list such as names of countries, currencies, and regions. Likewise, we can perform substitution from lists containing typical forenames or last names, valid credit card numbers from the most popular providers, and so on.

However, substitution of data soon gets complicated when we consider how to mask specialized words such as the names of manufacturing parts, zoological taxonomies because now we must deal with the diversities of culture. Even in Europe, lists would need to be in the sixteen most common languages.

There comes a time when lists need a supplement, which in its simplest form might be a reverse RegEx expression that can generate random numbers or strings according to a regex expression.

This technique is most effective if the proportion of NULL values introduced into the output can be specified and the distribution of random values can be altered to fit the distribution of the data. Ideally, there should be a way of scripting random data generators to accommodate any sort of condition constraints or data consistency.


Quite often, there is no need at all to mask data. The most effective way to protect sensitive data is simply not to include it in the data set that you wish to distribute. Many requirements for masked data, such as for distributing reports publicly, can be accommodated far better by creating a view or set of views that give the required data without any of the unnecessary sensitive data fields. Why extract a credit card number, for example, if it must be obfuscated anyway? Why disguise people's names when they need not be in the data extract?

Even data extracts need to be created with great caution. Apparently, innocuous data can be combined with other information to expose information about identifiable individuals. Location information, identifiers, or descriptive information such as racial origins can be used to unmask data. Fields containing compound identifiers such as bank codes can contain location information. XML fields can occasionally compromise a data extract by containing some or all of the original record in denormalized form.

In cases where an entire database must be delivered in obfuscated form, the problems can get worse. If the log file is included, the obfuscated data can be either partially or completely recovered, and it is also possible to reconstruct some of the original data by examining the data pages unless there has been Residual Information in the database (RIP).

Also, the referential nature of data in the average database means that the apparently simple matter of changing the data in databases can become complicated quickly. Where data has been denormalized, a record can store the same value in several places, so changing the value is a matter that needs caution and research. Replacing a value in a column that is participating in a primary key can be almost impossible without a great deal of skill and knowledge.

There are times when development and testing require data that is as close as legally possible to the real production data. It is likely to be a regular demand. Therefore, if the creation of a masked or anonymized database is done regularly, it is best to automate the process so the pain of doing so isn't wasted and can be reused. Also, the routine can be tested alongside the database as it is extended and developed to catch the problems that are occasionally introduced to the obfuscation process during development work.

New whitepaper: Database DevOps – 6 Tips for Achieving Continuous Delivery. Discover 6 tips for continuous delivery with Database DevOps in this new whitepaper from Redgate. In 9 pages, it covers version control for databases and configurations, branching and testing, automation, using NuGet packages, and advice for how to start a pioneering Database DevOps project. Also includes further research on the industry-wide state of Database DevOps, how application and database development compare, plus practical steps for bringing DevOps to your database. Read it now free.

data masking ,database ,distributed databases ,xml ,json

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}