Methods for qualitative data

There are many ways to pseudonymize qualitative data and as a researcher, you are best equipped to determine which methods are most suitable for your research data. Below are some general tips and commonly used techniques for pseudonymizing qualitative research data.¹

7 tips for working with qualitative data

1. Use digital tools for qualitative data

There are various tools and resources available for structuring qualitative data. While these may not be specifically designed for pseudonymization, they can be helpful in organizing and gaining an overview of the material to be processed. We discuss several potentially useful tools in the section Tools.

2. Never work on the original data

You will likely have to experiment and refine your methods to find a solution that fits your needs. Therefore, you should always work on a copy of your data instead of on the original data to avoid any irreversible loss of important information.

3. Document changes

Keep thorough, updated records of your changes. For example, create a codebook for pseudonyms or other codes used during the pseudonymization process.

4. Be consistent

Apply your pseudonymization strategy consistently. For example, use [square brackets] for masked information or other special characters. Avoid using formatting such as italics or bold, which may be lost when exporting or converting text files. When you transcribe interviews, consider marking each proper name with a special character that you do not use elsewhere in the text (e.g., #). This will simplify the process of anonymizing names later on.

5. Review background material

Review any documentation or other background material associated with your research data – such as sampling details, contact lists, or method descriptions - for identifying information. Ensure that the material does not contain any accessible code keys and that method descriptions and metadata do not contain details that could directly or indirectly identify a participant.

6. Evaluate the outcome

Verify that your protective measures effectively protect the participants’ identities. Test whether individuals can still be identified using reasonable effort and check whether there are any external data sources that might be used for re-identification.

7. Avoid collecting personal data unless necessary

If you plan to make your data publicly available, carefully consider which personal data you need to collect. In most cases, personal data, even if indirectly identifying, cannot be shared openly, and research data – including personal data – may not be deleted unless a formal disposal decision has been made and the retention period has expired. (For more information, see the section on legal aspects.)

Sometimes it may be necessary for the research question to collect personal data; in other cases, it is not necessary. Remember that personal data include both directly identifying information and any information that can indirectly and together with additional data identify an individual. For example, if you do not plan to collect direct identifiers like names or personal ID number but want to collect age and income data, consider using ranges instead of exact values (e.g., age 18–29 or 30–39; income under £2,000, £2,001–3,000) to reduce the risk of re-identification.

Common methods for qualitative data

1. Replace proper names with aliases or pseudonyms

Perhaps the most common method of handling proper names in qualitative data is to replace them with aliases or pseudonyms. Although this might seem simple, a dataset can quickly become difficult to navigate – for you, your collaborators, or secondary users – if the changes are not done systematically and documented in detail.

Plan a coherent coding system for names in advance, especially if you will be working with large datasets or in projects involving multiple researchers. Carefully document all changes in a codebook, so that you can keep track of your pseudonymized material.

Typically, occurrences of first and last names are replaced with a single first name alias or a pseudonym. However, in datasets with many participants, it is important to create unique aliases or pseudonyms to avoid confusion. In some cases, using an alias with a first and last name may improve the material's clarity and coherence. Alternatively, you can use a more descriptive pseudonym or combination of pseudonyms (e.g., [teacher, school 2, region 4]). It is up to you as a researcher to assess the balance between the risk of re-identification and the usefulness of the data.

Note that using aliases instead of real names results in pseudonymized – not anonymized – data, as long as any supplementary information (e.g., a code key linking aliases to real names) exists. Such datasets contain personal data and must be handled accordingly.

2. Categorize general names and nouns

Names and designations that appear only once or a few times and that are not essential to understanding the data can usually be replaced with a general pseudonym rather than a unique alias. General names, designations, and nouns that describe entities, places, or people are often replaced by broad categories such as [sister], [grandfather], [man], [politician, man, municipality 5], or [teacher, woman, primary school].

As noted under 1 above, you may sometimes need to use multiple pseudonyms to keep the pseudonymized data coherent and meaningful. When using multiple categories of pseudonyms, always consider the broader context of your data and assess which other data sources could be used to re-identify participants.

If your study population and sample size are small and relate to a specific location – for example, employees at a single workplace – be restrictive about the level of detail revealed through pseudonyms. If your population is larger – for example, all adults in Sweden aged 18–85 – you can generally afford to be less restrictive.

3. Change or remove sensitive information

Sensitive personal data are subject to special rules under the GDPR. Examples of information that could cause a person harm if they are disclosed include medical diagnoses, political opinions, statements about colleagues, drug use, and sexual activity.

If your data contain sensitive personal data, you must be particularly careful in how you generalize and categorize the data. The choices you make will of course depend on the purpose of your research. If your research focuses on drug use, for example, it would not be appropriate to remove details of drug use. In such cases, consider the following alternatives:

reduce the level of detail in other parts of the dataset, for example by generalizing other indirect identifiers into very broad categories
omit certain indirect identifiers entirely
be particularly cautious with geographic data and information about third parties
apply additional technical safeguards, such as encryption and data access agreements, if you wish to or must share the data.

Note that it is not always obvious what information is sensitive or might risk harming an individual. For example, information about bark beetle infestations is not inherently sensitive, but if can be linked to a specific geographical area and landowner, it could have serious financial consequences.

4. Generalize and categorize background information

Background variables and indirect identifiers – such as gender, age, education, income, political affiliation, occupation, and municipality of residence – are often important for understanding the data. These variables may also play a key role in analytical comparisons.

However, detailed background variables can increase the risk of re-identification as they can, if the information is detailed enough, be used to discern an individual in the material. You should therefore identify and generalize or group such variables into broader categories wherever possible.

This is similar to pseudonymization techniques used for quantitative data, where the information is generalized into categories to diminish the level of detail. You might, for instance, minimize the level of detail in data by:

converting age to age brackets
recoding municipality into county
classifying income as low, medium, or high
recoding political affiliation as right or left
grouping occupation or workplace into public or private sector.

How you recode background variables and indirect identifiers will depend on how the data will be used, what analyses you want to be able to make, and how openly you intend to share the data. A helpful starting point is to follow established classifications and standards, such as those published by Statistics Sweden (SCB), to guide your recoding.

References

1. The information on this page is partly based on information from Data Management Guidelines [Online]. Tampere: Finnish Social Science Data Archive [distributor and producer]. <https://www.fsd.tuni.fi/en/services/data-management-guidelines/anonymisation-and-identifiers/#anonymising-qualitative-data> (Retrieved 2025-07-15.)"