Open-Sourcing Datanymizer: In-flight Template-Driven Data Anonymization
Enter Datanymizer: your flexible privacy-preserving friend
Join the DZone community and get the full member experience.Join For Free
Production systems often need to store sensitive data, including personally-identifiable information (PII). Developers often need their test systems to have data that is as close to that in the production systems as is reasonably possible. Whilst it was always best-practice, legal data protection regimes such as HIPAA, HITECH, CPRA, and GDPR means it’s even more important to ensure that any personal data remains only where it’s strictly needed, and is properly masked or anonymized when being transferred elsewhere.
There are a number of different ways to bridge this gap, such as designing a strict separation between database tables that hold PII and those which don’t, allowing the PII tables to be skipped on export and replaced with synthetic data on the development systems. This approach can certainly work, but it relies on the system adhering to this design pattern, and the synthetic data being kept closely enough in step with the production equivalents to not cause problems.
An alternative might be to generate a special kind of “cleansed” dump on the production system, with PII already masked or replaced with synthetic data, ready for developers to import, keeping the risk of any sensitive data ever leaving the production environment low.
This is the approach Datanymizer takes.
Fakers, anonymizers, and obfuscators — there are various free and open-source data anonymization tools that have been around for a long time and work pretty well, so why did we create a new one? The one that supports globals, uniqueness constraints, inline rules, and other cool features.
We had some particular requirements we wanted our tool to meet. We didn't want the anonymizer to take a "raw" dump and mutate it. Instead, we needed to provide an already anonymized dump, without access to real data. The configuration that determined how the real system data would be anonymized should have been kept separate from that data.
We also wanted a tool that was flexible about how the anonymization itself takes place, ideally allowing the use of templates to populate field contents.
Enter Datanymizer: Your Flexible Privacy-Preserving Friend
Datanymizer does all of these things: you define a configuration which specifies what to do (and not do), and it then dumps data directly from your database, applying the rules that you define, and it even integrates the Tera templating engine to enable complex values to be synthesized.
The output is an anonymized SQL dump, written either to a file or directly to standard output, ready to be imported into a database using your normal tools.
There are several ways to install
pg_datanymizer. Choose a more convenient option for you.
Homebrew / Linuxbrew:
The README contains an example configuration that you can use as a starting point.
Now you can invoke Datanymizer to generate a cleansed dump of your data:
It creates a new dump file
/tmp/dump.sql with a native SQL dump for PostgreSQL database. You can import fake data from this dump into new PostgreSQL database with the command:
You can specify a list of tables which should never be included in a dump:
For dumping only
For ignoring those tables and dump data from others.
You can also specify data and schema filters separately.
You can specify global variables available from any
Datanymizer includes built-in support (“rules”) for certain types of value, including a
pipeline filter which allows multiple rules to be executed in sequence. Other filters include
email, ip, words, first_name, last_name, city, phone, capitalize, template, digit, random_number, password, datetime and more.
Uniqueness is supported by the
email, ip, phone, and
Uniqueness is ensured by keeping track of values that have been generated where uniqueness is required, and re-generating any which are duplicates of those in the list.
You can customize the number of attempts with
try_count. This is an optional field, the default number of tries depends on the rule.
We plan to implement the following additional features soon:
- Pre-filtering: for example, if it is necessary to dump not all users but those matching specific criteria (e.g., 100 users, aged 27 years old or more, named Alexander), supporting arbitrary SQL queries for filtering.
- Data generation: when you don’t need to anonymize existing data, but instead generate synthetic data based upon certain rules.
Datanymizer currently supports PostgreSQL databases, although MySQL (and so also MariaDB) support is planned. Contributions are of course very welcome!
Published at DZone with permission of Elizabeth Lvova. See the original article here.
Opinions expressed by DZone contributors are their own.