Resources
Below is a collection of links to guides, handbooks, research articles, and videos that may be helpful when working with data containing personal information.
Guides and handbooks
Anonymisation and personal data
This guide, produced by the Finnish Social Science Data Archive (FSD), provides a basic introduction to anonymization and pseudonymization. It includes practical recommendations for working with both quantitative and qualitative data.
Handbok i statistisk röjandekontroll (Handbook on statistical disclosure control; in Swedish)
This handbook from Statistics Sweden (SCB) is primarily intended as a guide for statistical agencies in applying statistical disclosure control when producing and publishing official or other statistics. However, it can also be a useful resource for researchers working with microdata, for example through SCB’s microdata platform MONA. For researchers using microdata, anonymization or de-identification is typically not sufficient – statistical disclosure control is also required before data can be disclosed and shared.
Guide to de-identification methods: Opinion 05/2014 on anonymisation techniques
This opinion from the Article 29 Working Party within the EU (today replaced by the European Data Protection Board), addresses the two main strategies for de-identification, or anonymization: randomization and generalization. It explores specific methods within each category and outlines their respective strengths and weaknesses.
Encryption for researchers
Encryption can provide an additional layer of access control for researchers working with sensitive data. This guide, created at Ghent University, explains what encryption is, when to consider using it, and how to apply it in practice.
Data Privacy Handbook
Developed at Utrecht University, this handbook for data containing personal information can be seen as the Dutch counterpart to SND's handbook. It focuses both on providing basic legal knowledge and offering an overview of useful methods and tools for researchers working with personal data.
Data Management Expert Guide
This general guide to data management is produced by CESSDA, a European research infrastructure working to improve access to social science research data. It is primarily intended for social science researchers and provides best practices and strategies for effective data management, with an emphasis on the FAIR principles. In other words: how can researchers make their data Findable, Accessible, Interoperable, and Reusable?
Anonymisering av personopplysninger (Anonymization of personal data; in Norwegian)
This guide from the Norwegian Data Protection Authority (Datatilsynet) is for individuals and organizations seeking support in anonymizing personal data. It covers key legal principles, highlights risk factors, and discusses the pros and cons of various anonymization techniques.
Videos
Practical introduction to the sdcMicro tool
A walkthrough demonstrating the sdcMicro tool in RStudio from a CESSDA Train the Trainer Workshop, “Anonymisation for data sharing in practice”. The video shows how to identify variables, or combinations of variables, that pose a risk for re-identification and demonstrates how various aggregations affect that risk.
Data Anonymization Workshop Series
A series of four workshops from McGill University introducing and exploring anonymization of research data. The first two workshops focus on quantitative research; the two latter have a focus on qualitative research.
Workshop 1: Reducing Risk: An Introduction to Data Anonymization
Workshop 2: ARX – Anonymizing data in theory and practice
Workshop 3: Ethically sharing qualitative data
Workshop 4: Qualitative data sharing: A roadmap and resources to facilitate responsible and ethical data sharing
Amnesia: High-accuracy Data Anonymization
This webinar from OpenAIRE serves as both an introduction to anonymizing research data and a demonstration of the Amnesia tool. Amnesia transforms research data containing personal information to provide k-anonymity and km-anonymity.
Anonymisation in theory and practice
Three videos from the British National Centre for Research Methods (NCRM) that offer an introduction and practical guide to anonymization, statistical discloure control, and how to assess disclosure risk in research data.
Five-Step Guide to Statistical Disclosure Control
A series of five videos from the United Nations OCHA Centre for Humanitarian Data. The videos outline the process of assessing disclosure risk in research data:
Step 1: Prepare the Disclosure Risk Assessment
Step 2: Selecting Your Key Variables
Step 3: Run the Assessment
Step 4: Read the Assessment Results
Step 5: Manage Data Responsibly
Research articles
A tutorial in assessing disclosure risk in microdata
Statistical agencies and other public agencies make legal and ethical considerations to improve the confidentiality of data shared with researchers, but there may still be disclosure risks. This tutorial provides an overview of things to consider in assessing and preventing such risks, with a particular focus on quantitative risk measures.
Taylor, L., Zhou, X.-H., & Rise, P. (2018). A tutorial in assessing disclosure risk in microdata. Statistics in Medicine, 37(25), 3693–3706. https://doi.org/10.1002/sim.7667
Factors that affect likeliness of survey participation
This study used a vignette experiment to investigate how likely research subjects were to participate in surveys with varying topic sensitivity and risk of disclosure.
Couper, M. P., Singer, E., Conrad, F. G., & Groves, R. M. (2008). Risk of disclosure, perceptions of risk, and concerns about privacy and confidentiality as factors in survey participation. Journal of Official Statistics, 24(2), 255–275. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3096944/
Anonymisation of unstructured data
Much of the literature on data anonymization has focused on structured data (such as tables) used in quantitative research. This study examines two approaches for anonymizing unstructured data – for example, text documents or images – often used in qualitative research. Using two case studies, the study illustrates the challenges encountered when trying to anonymize unstructured datasets using two methods: a risk-based approach, and a strict approach.
Weitzenboeck, E. M., Lison, P., Cyndecka, M., & Langford, M. (2022). The GDPR and unstructured data: is anonymization possible? International Data Privacy Law, 12(3), 184–206. https://doi.org/10.1093/idpl/ipac008
Anonymisation of big data
“Big data” refers to large amounts of data that can be analyzed to reveal patterns, trends, and associations. It is increasingly common in the social sciences, for example in studies of online behaviour. This study discusses challenges and issues in researching big data, including anonymization and re-identification.
Weinhardt, M. (2021). Big data: Some ethical concerns for the social sciences. Social Sciences, 10(2), 36. https://doi.org/10.3390/socsci10020036
An introduction to synthetic data
Synthetic data are artificially generated fictive data. Instead of modifying an existing dataset to make it less identifiable, a new dataset is generated, containing fictive individuals and values. This study introduces synthetic data by explaining what synthetic data is, why they may be useful, and how to use them.
Jordon, J., Szpruch, L., Houssiau, F., Bottarelli, M., Cherubin, G., Maple, C., Cohen, S. N., & Weller, A. (2022). Synthetic Data – what, why and how? arXiv:2205.03257. https://doi.org/10.48550/arXiv.2205.03257