Primary tabs

Pseudonymization is a term popularized since the introduction of General Data Protection Regulation (GDPR). Pseudonymization, in simple terms, is defined as processing or transforming data in a way which preserves the privacy of an individual while still maintaining statistical relevance for data analysis and data science use-cases.

Pseudonymization vs Anonymization

Often pseudonymization and anonymization are confused and interchangeably used. Anonymization is aimed at removing the identifying data all together from a dataset. In an ideal situation, anonymization is impossible to reverse engineer to extract the original data. Pseudonymization, on the other hand replaces the identifying data with aliases, pseudonyms, hashed values etc. In an ideal situation, it may be possible to reverse engineer the data in order to reveal the original data. This is done by hiding or obscuring personally identifiable information (PII) in a way that just by looking at the data, it is not possible to identify a person.

Let us consider a dataset with information of actual people with explicit details such as names, email IDs, phone numbers and social security numbers. If we use a hashing technique with a key applied on the PII, it is possible to reverse engineer the pseudonymized data to reveal the original data if the key is compromised or the technique is known. On the other hand, if we add random noise to the PII, it is nearly impossible to decipher the original data by means of reverse engineering.

In general, both pseudonymization and anonymization allow data to reside in a format which is not easily decipherable to people with unauthorized access to data, thus enhancing the security in case of data compromise. Both methods allow an organization to be compliant with GDPR along with reduction in legal liability and compliance burden. These methods also help end-users feel safer and gain trust in using the service when they know their data is going to be handled with care.

Ways to implement pseudonymization

There are four common ways of implementing pseudonymization:

  1. Masking: In masking, the PII records are either completely or partially altered without changing the data characteristic. For example, an original phone number could be 123-456-789; but a masked phone number might look like 435-362-357. This method is typically realized with help of shuffling or substituting characters. The advantage of this approach is that it retains the usability of the data to some extent while removing PII.
  2. Clustering: In this method we can cluster the data so that it lies between a range of values. For example, a date of birth, "12-Jan-1990" could be converted into "1-Jan-1990 to 31st Jan 1990".
  3. Encryption: In this, the data is encrypted in a way that only a secured encryption or decryption key can be used to extract the original data.
  4. Hashing: This is the most popular way to implement pseudonymization, where a particular value corresponds to a particular hash value. This is a common technique used by payment processing solutions such as Apple Pay and Google Pay. The advantage of this approach is that the hash value for a particular remains the same all throughout the dataset, enabling statistical analysis on the data without revealing PII.

In practice, pseudonymization lowers the risk of identifying PII in a dataset, whereas anonymization minimizes the risk to the maximum extent practically possible. At the end of the day, pseudonymization is not fool-proof but acts a major deterrent in the case of data compromise while retaining statistical significance for analytics purposes.


Swati Choudhary

Swati Choudhary

Machine Learning Engineer

Swati Choudhary works in the field of Machine Learning Engineering at CGI Advanced Analytics Solutions.