Common misconceptions
The following are some common misconceptions about anonymisation and pseudonymisation. The content is based on “10 misunderstandings related to anonymization”, a joint paper produced by the European Data Protection Supervisor (EDPS) and the Spanish Data Protection Authority (AEPD).
“Pseudonymisation is the same as anonymisation”
Pseudonymisation involves processing personal data in such a way that they can no longer be linked to a specific individual without the use of additional information stored separately from the original dataset. This means that with access to supplementary information (such as a code key), individuals could potentially be identified. Therefore, pseudonymised data are still considered personal data.
Anonymisation, on the other hand, involves permanently removing all identifying information from a dataset and irreversibly breaking any link to supplemenentary data sources that could potentially identify an individual. Once research data have been fully anonymised, they can no longer be traced back to any specific person and are no longer classified as personal data.
“Encryption is anonymisation”
Encryption uses cryptographic keys – either a shared secret key or a combination of private and public keys – to transform information in a way that reduces the risk of misuse while maintaining confidentiality over a limited period. However, because the transformation is reversible (the data must be decryptable), encryption is not the same as anonymisation.
The keys used for decryption are a form of “supplementary information” (see above) that can make personal data readable and thereby re-identifiable. In theory, you might think that encrypted research data become anonymous if you erase the decryption key, but this is not necessarily the case. You cannot assume that encrypted data are undecryptable just because the key is said to be erased or unknown.
Many factors affect the long-term confidentiality of encrypted data, including the strength of the encryption algorithm and key, possible data leaks, implementation flaws, the volume of encrypted data, and future technological advancements. While encryption is not anonymisation, it is a useful tool for pseudonymising research data containing personal information.
“Research data can always be anonymised”
No. It is not always possible to reduce or eliminate the risk of re-identification to an acceptable level while preserving the usefulness of the dataset for the intended research purpose. Anonymisation is about finding a balance between reducing the risk of re-identification and maintaining data utility.
Some types of research data, or specific research contexts, make sufficient anonymisation difficult or impossible – for example, when there are very few individuals with a particular characteristic or variable; when the data types are very distinctive and vary so much between individuals that they can be identified; or when the dataset contains many demographic variables or geographic information.
“Anonymisation is forever”
No. Anonymisation and how it is implemented affects the risk of re-identification. While 100% anonymisation may be ideal from a data protection perspective, it is not always achievable, so there is often a residual risk of re-identification.
Anonymisation is not only about removing direct identifiers from a dataset, but also about breaking links to supplemental sources of data that could enable re-identification. However, contexts change over time. New knowledge, advances in AI, increased computational power, or new applications of existing technologies may make it possible to re-identify individuals from datasets previously thought to be anonymous.
Additionally, future data leaks or the release of new supplementary data sources could retrospectively compromise anonymity. For these reasons, anonymisation may not remain future-proof.
“There is no risk of re-identification in anonymised data”
The term “anonymous data” should not be understood as a binary concept where data are either anonymous or not. Rather, it exists on a spectrum. Except in specific cases where data are extremely generalised, the risk of re-identification is never zero. Each record in a dataset has a probability of re-identification, depending on how easily it can be distinguished from others. There are methods to assess this risk, and such assessments should be conducted when anonymisation is first implemented and followed up over time.
Read more about methods to reduce the risk of re-identification.
“Anonymisation can be fully automated”
Tools can assist in automating parts of the anonymisation process, especially the identification and removal of direct identifiers in the material. But complete automation is unlikely to be enough. Manual review and expert input are necessary, particularly from researchers or support staff familiar with the material. Anonymisation is not only about the data’s internal properties but also about the broader context in which they are used, and human judgment is needed to evaluate the contextual risk of indirect identification.
“Anonymisation makes research data useless”
The goal of anonymisation is to prevent identification of individuals within a dataset. While anonymisation techniques may limit how the resulting dataset can be used, this does not make research data useless. How useful the data are depends more on the research purpose and what level of re-identification risk is considered acceptable.
In some cases, anonymisation may not be possible due to the research purpose. In such situations, researchers may need to either work with personal data under appropriate safeguards (e.g., by pseudonymising them) or refrain from processing the data altogether.
“An anonymisation process that worked for one research project will work for mine”
Anonymisation processes must be tailored to the nature, scope, and context of the data, as well as to the objectives of the research project. There is no universal, one-size-fits-all solution.