An Introduction to Data Masking
An Introduction to Data Masking
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Dealing with Production Data is a challenge, but most organizations around the world have safeguards in place which secure the production environment properly.However,when it comes to non-production environments like Dev (Development) environment or Test Environment etc., they still do not have proper security in place. Protecting sensitive data is not an only an organization’s moral responsibility, but in certain cases it is also demanded by governing standards. This data can belong to customer or even organization’s employees. Either way, proper protections should be in place to ensure that data residing with the organization is secure. For a deeper look into data masking, check out the information security courses offered by the InfoSec Institute.
Objective & Scope:
This article will focus on giving readers an overview of data masking. Implementation strategies that focuses on the “how to” factor of data masking solutions are out of scope for this article. Having said that, we’ll still cover an example which shows how the entire process works to simplify the concept and explain the same to the users.Readers should note that this is not the only way in which data masking can be performed.
The aim of this article is to introduce users to the concept of data masking and what it can achieve for an organization. We will also list some of the commercial products which can be used for masking sensitive organization data.
What is Sensitive Data?
The definition of Sensitive data is pretty broad and changes from country to country, organization to organization, and even individual to individual. In some country like United States – data like the Social Security Number is considered to be extremely sensitive. Similarly health records also are considered sensitive information.
Globally, every county accepts that Credit/Debit Card Data is sensitive data – explicitly the Card Number and Pin/CVV/CVV2 details.
While we have discussed SSN and Card Data, we covered regionally. Every organization also has certain data classified as sensitive. Example: An Employee’s salary details can be considered to be sensitive data. Similarly,intellectual property or research data is also considered to be sensitive in nature. This changes from organization to organization.
Why Secure Data?
There have been cases where critical customer data, when lost, causes an organization to face lawsuits and spend millions of dollars to settle them. This can be a huge cost to any organization in an unfortunate event where critical customer data is lost.
Certain compliance standards like PCI DSS have specific requirements that deal with Data Security. I will not cover each and every compliance standard, however I will explain PCI DSS standard as one example.
One of the PCI DSS Requirements 6.3.4 says “Production data (live PANs) are not used for testing or development”
The requirement stated above is clear enough – an organization can’t use live PAN (Permanent Account Numbers). However the trickier part is implementation. Without this data – how to develop and test the application? As a matter of fact, we just need a Permanent Account Number, it need not be a valid one!
Understanding this clearly can help us mask our data by mapping existing live PAN’s to dummy PAN’s. Testers only need production like data which can help them simulate testing – not the live data. PCI DSS requirement emphasizes on PAN’s because this is one of the most sensitive card holder data. The requirement further adds that production data should not be used in development or Testing environment. This is where data masking can be helpful.
What is Data Masking?
Data masking is nothing but obscuring specific records within the database. Masking of data ensures that sensitive data is replaced with realistic but not real data in testing environment thus achieving both the aims – protecting sensitive data and ensuring that test data is valid and testable.
There can be many ways in which data masking can be implemented. It could be as a substitution of existing records with expected test data or shuffling of certain characters or numbers, thus generating a new record.Alternately, it could be as complex as using proprietary algorithms to scramble or obfuscate a part of the record with a random data generated using the algorithm which has all properties that original data had.
Data masking is not just about Test Data. In fact this concept can be applied to every situation where an organization does not want to reveal real data. Example: Salary information of every employee. Whether any compliance standard explicitly considers this as sensitive data or not, salary-related information still remains sensitive information from an organization’s standpoint, and thus protecting the same makes sense for it. Data masking techniques can be applied here as well. There can be many such scenarios. Following section explains this example in detail to understand how
Let’s try and expand on the problem statement discussed previously. How can we ensure that valid test data is present, but at the same time, we are not leaking any employee’s salary either? We can’t change employee id’s or numbers because employee id or numbers can be primary keys in the database messing around with these records and applying encryption, or changing the primary key will render data records useless.
We can solve this problem by scrambling the salary details of employees. However, there are a couple of challenges before we scramble the details. If the expectation is to recover the scrambled data back, it may not be straight forward. Assuming that live data is independent and we are not concerned with what happen to test data as long as length, data type and other business constraints are met, we can use Data Field Substitutionto change the test data.
We can create a copy of employee’s salary field and randomize the records. Once we have a list of randomized salary fields ready, we can use it to replace the existing salary fields by replacing them with the new list. Alternately, we can also create a list of salaries on our own and then replace them. The end goal is to ensure that each employee’s salary is scrambled with realistic data – but not real data.
Published at DZone with permission of Ryan Fahey , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.