Common misconceptions

The following are some common misconceptions about anonymization and pseudonymization. The content is based on “10 misunderstandings related to anonymization”, a joint paper produced by the European Data Protection Supervisor (EDPS) and the Spanish Data Protection Authority (AEPD).

“Pseudonymization is the same as anonymization”

Pseudonymization involves processing personal data in such a way that they can no longer be linked to a specific individual without the use of additional information stored separately from the original dataset. This means that with access to supplementary information (such as a code key), individuals could potentially be identified. Therefore, pseudonymized data are still considered personal data.

Anonymization, on the other hand, involves permanently removing all identifying information from a dataset and irreversibly breaking any link to supplemenentary data sources that could potentially identify an individual. Once research data have been fully anonymized, they can no longer be traced back to any specific person and are no longer classified as personal data.

“Encryption is anonymization”

Encryption uses cryptographic keys – either a shared secret key or a combination of private and public keys – to transform information in a way that reduces the risk of misuse while maintaining confidentiality over a limited period. However, because the transformation is reversible (the data must be decryptable), encryption is not the same as anonymization.

The keys used for decryption are a form of “supplementary information” (see above) that can make personal data readable and thereby re-identifiable. In theory, you might think that encrypted research data become anonymous if you erase the decryption key, but this is not necessarily the case. You cannot assume that encrypted data are undecryptable just because the key is said to be erased or unknown.

Many factors affect the long-term confidentiality of encrypted data, including the strength of the encryption algorithm and key, possible data leaks, implementation flaws, the volume of encrypted data, and future technological advancements. While encryption is not anonymization, it is a useful tool for pseudonymizing research data containing personal information.

“Research data can always be anonymized”

No. It is not always possible to reduce or eliminate the risk of re-identification to an acceptable level while preserving the usefulness of the dataset for the intended research purpose. Anonymization is about finding a balance between reducing the risk of re-identification and maintaining data utility.

Some types of research data, or specific research contexts, make sufficient anonymization difficult or impossible – for example, when there are very few individuals with a particular characteristic or variable; when the data types are very distinctive and vary so much between individuals that they can be identified; or when the dataset contains many demographic variables or geographic information.

“Anonymization is forever”

No. Anonymization and how it is implemented affects the risk of re-identification. While 100% anonymization may be ideal from a data protection perspective, it is not always achievable, so there is often a residual risk of re-identification.

Anonymization is not only about removing direct identifiers from a dataset, but also about breaking links to supplemental sources of data that could enable re-identification. However, contexts change over time. New knowledge, advances in AI, increased computational power, or new applications of existing technologies may make it possible to re-identify individuals from datasets previously thought to be anonymous.

Additionally, future data leaks or the release of new supplementary data sources could retrospectively compromise anonymity. For these reasons, anonymization may not remain future-proof.

“There is no risk of re-identification in anonymized data”

The term “anonymous data” should not be understood as a binary concept where data are either anonymous or not. Rather, it exists on a spectrum. Except in specific cases where data are extremely generalized, the risk of re-identification is never zero. Each record in a dataset has a probability of re-identification, depending on how easily it can be distinguished from others. There are methods to assess this risk, and such assessments should be conducted when anonymization is first implemented and followed up over time.

Read more about methods to reduce the risk of re-identification.

“Anonymization can be fully automated”

Tools can assist in automating parts of the anonymization process, especially the identification and removal of direct identifiers in the material. But complete automation is unlikely to be enough. Manual review and expert input are necessary, particularly from researchers or support staff familiar with the material. Anonymization is not only about the data’s internal properties but also about the broader context in which they are used, and human judgment is needed to evaluate the contextual risk of indirect identification.

“Anonymization makes research data useless”

The goal of anonymization is to prevent identification of individuals within a dataset. While anonymization techniques may limit how the resulting dataset can be used, this does not make research data useless. How useful the data are depends more on the research purpose and what level of re-identification risk is considered acceptable.

In some cases, anonymization may not be possible due to the research purpose. In such situations, researchers may need to either work with personal data under appropriate safeguards (e.g., by pseudonymizing them) or refrain from processing the data altogether.

“An anonymization process that worked for one research project will work for mine”

Anonymization processes must be tailored to the nature, scope, and context of the data, as well as to the objectives of the research project. There is no universal, one-size-fits-all solution.