Backup and Anonymize Your Cosmos Collections With the Cosmic Clone Tool
Learn about the Cosmos Clone tool and see how you can backup and anonymize your Cosmos collections.
Join the DZone community and get the full member experience.Join For Free
As part of an application lifecycle, we are periodically required to refresh our non-production (dev/test) environments with production data. This helps us test applications with the right data and ensures we do not leak any obvious defects. It also enables us to test for performance of our application, as we will have the same quantity of data as in production. Further, testing on real data is bound to inspire confidence on an application release.
But copying live data increases the risks and the exposure of confidential information. A non-production database is likely to be accessed by developers and business analysts who may not have the same access in a live environment. They might only be interested in testing a feature but should not be exposed to the confidential information in the live system itself. To reduce such risks, data needs to be anonymized. i.e., personally identifiable/confidential information is removed or replaced with dummy values.
Thus, restoring data from production to a test environment is a two-part exercise. The first part involves copying the database with all its data, code (procedures/views), and settings (indexes/RU's) to a test environment. The second part involves the anonymization of confidential or sensitive information in the copied content.
Cosmic Clone is a utility that was developed to help ease the above process and aid in the copy and anonymization of a cosmos collection. This tool helps in the creation of a backup copy of your Cosmos Collection in few clicks and provides options to anonymize data in attributes that may contain personally identifiable or sensitive information.
Cosmic Clone provides options like below and enables us to create a new collection with all the settings, code, and documents intact. And as an exact replica of the source collection, one can opt out of any of these settings as well
To anonymize various attributes, the tool allows us to provide rules to indicate the attributes to anonymize and the possible values to replace them with. There are also options to perform a random shuffle of the data.
With a few clicks, the tool begins to copy the collection. It also allows us to save the anonymization rules used for the copy of data, such that they can be reused in a subsequent run of the tool.
Similar backup and anonymize scenarios are applicable for various cases such as:
Reporting and Analytics
Consider as an example, that you need to generate analytics related to the number of people in different departments of your company. But your cosmos collection also has information on Mobile and contact details of various employees, which is bound to be irrelevant to the current scenario. It is in your best interest to anonymize such fields in the copy of your data. You could define a simple rule like below and run the tool to anonymize such data.
In most cases, it would be wise to anonymize data that is irrelevant to the analysis or analytics at hand.
Data Validation Post-Release
Consider scenarios where you need a copy of the data to validate before and after a period of time. For example, you have rolled out a few major changes to your collection structure, including changes in partition key and indexes to a few columns and added in a few new object types onto the same collection. You would need a backup copy to validate or compare with the old data.
For scenarios where you need to debug a production issue that cannot be replicated in a non-production environment, it is likely to be caused by a remote data scenario that was unaccounted for in testing. You would need to restore a copy of the collection to debug rather than risk modifying live data with test values.
Data protection regulations such as GDPR now mandate data anonymization in all non-production environments. Microsoft's core services engineering teams have a mandatory task to anonymize their lower environments, which recurs every 90 days. In such cases, usage of the cosmic clone tool can save the manual effort of a developer, as they will no longer need to write, test, update, or maintain their own anonymization scripts.
As the usage of Azure Cosmos DB continues to rise, self-serve capabilities such as backup, restore, and anonymization of a data collection continue to become more essential. Cosmic Clone is a handy utility that aids in this endeavor. The out-of-the-box anonymizations options are a huge advantage that help perform the first of its kind data masking tasks on a Cosmos database. It is sure to save time from routine backup restore tasks, which is time that can be spent on more productive work. Cosmic Clone has the potential to become a handy tool in every Cosmos Developer/DBA's arsenal.
For a complete walkthrough of the tool, visit the GitHub page, as the tool is now publicly available.
Disclaimer: Please note this is not an official tool from the Azure Cosmos DB team, but a utility developed by an independent developer within Microsoft IT.
Opinions expressed by DZone contributors are their own.