Data masking

When talking about data masking, most people think about encryption, or pseudonymization. These are just two of the technics used to mask data. Data masking is about creating a substitute of certain data in such a way that it does not reveal its original data. Data masking is primarily used to protect personal identifiable information (PII), but can also be used to protect other sensitive information. This does not mean that PII or other sensitive information should be masked at all times. No, data masking should be used when others, who do not have the right clearance or security level, get access to the data.

For example, you have a customer support department which contains your incident, change and problem management processes. But it also contains information about your customers, as they are able to log an incident and see its progress. When you upgrade that application to a new version, you want to test this first in your test environment, but you are not allowed to just copy your production database into your test environment (because of privacy laws and missing consent from the customers). Your testers should not be able to access the PII data in that database.

Data masking helps you use that database with all the data required to test the new version, but still be compliant to laws and regulations about the use of PII data.

How

Data masking can be performed multiple ways. The most commonly used methods are:

Pseudonymization
This is where the data, like a name, IP address or personal identification number, is switched or mapped to a pseudonym or alias. A pseudonym or alias is only used once on a single data set, so it is possible to invert the pseudonymization. It’s therefore important to keep the pseudonymization secret very secure and only accessible to a very limited amount of people with strong authentication methods.

Anonymization
This is where the data that you need to anonymized, like a name, IP address or personal identification number, is removed or changed. Using this method, it is not possible to retrieve the original data.

Encryption
By encrypting the data, it cannot be read without the decryption key. In most cases, you are able to encrypt a file or database, but not on specific data within that file or database. So using this technique depends on the usage in encrypted form.

Shuffling
In some cases, just shuffling data, like a name, IP address or personal identification number, can be enough. With this technique, you shuffle all the data where you make sure that the original data is not placed back in its original location. For example, a salary table will list all actual salaries, but it will not reveal which salary belongs to each employee.

Risks

Certain information can only be used when the data is properly masked. Which method to use depends on your risk analyses and related laws and regulations. Always check with your compliance department of your legal advisor if the proposed method is sufficient for your purpose.

To prevent the masked data from being linked to the original data, it’s important to have a procedure in place how you mask data and how someone can start this process. Going around this process should be prevented or made impossible.

The last risk i want to mention is that you want to make sure that people working with the masked data do not have access to the original data unless there is a very good reason for it and is explainable to your customers or auditors.

Example Control Ruleset

When the following controls are used, you should be compliant for this topic:

  • A procedure is setup when and how to mask data
  • When data is masked, it is verified that the masking was performed succesfully
  • Unmasked data is not used outside the production environment/applications (like acceptance, test, development)

Related links

Data masking Tools from Gartner
Data masking at Wikipedia
Data marking at SQL on Azure
Data masking using AWS DMS or Lake Formation
Data marking on Google Cloud