Tools
When working with research data that contain personal or sensitive information, it can be helpful to use various tools to manage and protect the data. There are tools to, for example, assist in assessing the risk of re-identification, systematically preparing a dataset for disclosure, or generating synthetic data. On this page, we introduce several types of tools and software that can be useful for managing quantitative and qualitative data, as well as tools for encryption, for generating synthetic data, and for working in secure computing environments.
Tools for quantitative data
There is a variety of tools for statistical disclosure control in quantitative data – that is, tools that help you assess and minimize the risk of re-identification within your dataset. Many of these tools also offer protective measures for reducing this risk, as well as functions to evaluate the usability of data after applying these measures. Below are some of the most common tools for statistical disclosure control.
sdcMicro
sdcMicro helps identify variables or combinations of variables that pose a risk of re-identification. It provides users with a quick overview of a dataset and makes it possible to aggregate variables, evaluate the impact of changes on the re-identification risk, and measure how data modifications affect further analysis.
sdcMicro can be run locally on your computer as an add-on package for the R software environment. It is free to download and includes a browser-based graphical user interface called sdcApp, which provides instructions and information about the changes made throughout the process – making it suitable for users who are not experts in microdata management. sdcMicro automatically documents all changes in a script, making it easy to trace and replicate your actions.
To use sdcMicro, you need access to an R environment on your computer or on a server (beginners are recommended to install RStudio). sdcMicro can then be downloaded as an R package from the CRAN repository using the script: install.packages("sdcMicro").
Amnesia
Amnesia is a tool from OpenAIRE, which pseudonymizes data using a predefined algorithm to convert personal data into pseudonyms. The algorithm can be based on encryption or hashing. Like sdcMicro, Amnesia allows you to aggregate variables and evaluate the re-identification risk.
Amnesia is Java-based and can be downloaded and run locally on your computer, but it is also available as a web application.
Download Amnesia
ARX
ARX is an open-source software for anonymizing personal data. It is available both as a standalone graphical application and as a programming library. ARX supports users with robust privacy models (including l-diversity and t-closeness) and provides metrics for measuring information loss and estimating residual re-identification risk.
ARX is Java-based and can be run locally on your computer via a compatible Java environment.
Download ARX
µ-ARGUS
µ-ARGUS is a standalone graphical software developed by and for statisticians at Statistics Netherlands. The software is well-established, offers a wide range of functions, and supports SPSS file formats. The project behind µ-ARGUS has been has active for a long time, so many functions developed for µ-ARGUS have been reused in other projects, such as sdcMicro, thanks to its open-source nature. Therefore, the functionalities in µ-ARGUS are also accessible for programmatic use beyond the graphical interface.
µ-ARGUS is Java-based and can be run locally on your computer via a compatible Java environment.
Download µ-ARGUS
Tools for qualitative data
There are several digital tools to help manage qualitative data, particularly for assisting in anonymizing and structuring materials such as interview transcripts. QualiAnon is one such tool, designed to help protect personal data while preserving the analytical value of the material.
QualiAnon
QualiAnon assists in identifying personal data and sensitive information in textual data, such as interview transcripts. It allows users to work systematically with different forms of markup and the use of stop terms, which can support pseudonymization of qualitative data. This can be particularly helpful when preparing data for transfer or sharing.
QualiAnon is Java-based and can be run locally on your computer using a compatible Java environment.
Tools for encryption
Encryption is a protective measure that adds an extra layer of access control to sensitive data. It can be particularly useful for file transfers, temporary storage in environments with limited security, or as part of a systematic access control strategy within a research project. Below you find some commonly used encryption tools.
Microsoft Office and LibreOffice
Office suites such as Microsoft Office and LibreOffice include functions for encrypting documents. In Microsoft Office, for example, you can find this under the File menu, Info, and then Protect document. The encryption algorithm used is considered strong in reasonably recent versions (post-2007), meaning that overall security largely depends on the strength of the chosen password.
7-Zip
7-Zip is an open-source compression tool that allows you to encrypt files during compression. It uses the AES-256 encryption algorithm, which is considered highly secure; meaning that it is also important to choose a strong password. 7-Zip is best suited for backing up files, storing raw data, or protecting files at rest, as the process of decrypting, extracting, and then re-encrypting and compressing files can be time-consuming.
A limitation of 7-Zip is that it is only available for Windows and Linux. Mac users can extract and decrypt 7-Zip archives using The Unarchiver.
Download 7-Zip
VeraCrypt
VeraCrypt is an open-source software that supports AES-256 and several other encryption algorithms. It creates an encrypted “container”, which looks like as a regular file in a directory (without a file extension – you may add one manually, for example .pdf, to help disguise the file). When a file is decrypted in VeraCrypt, it behaves like a mounted network volume where files can be stored and accessed.
Unlike the Office suites or 7-Zip, VeraCrypt is a specialized encryption application. It is available for Windows, Linux, and Mac OSX. One downside is that it is more resource-intensive in terms of memory and processing.
Download VeraCrypt
Tools for generating synthetic data
Synthetic data are artificially generated data based on statistical models. The data may be generated based on real datasets or created entirely from predefined rules and input values.
Mockaroo
Mockaroo is a simple web-based tool for creating fully generative synthetic test data that follow typical distributions for different variable types, for example, background variables for fictitious individuals. It supports around 170 variable types and users can define distributions using a custom formula language.
Although Mockaroo is primarily intended for software testing, it is useful in many other scenarios. It is a commercial product, but the free version does not require registration and allows users to generate synthetic datasets of up to 1,000 rows, which can then be downloaded.
Download Mockaroo
synthpop
Synthpop is a tool for programmatic generation of synthetic data by analyzing real datasets and then modelling and mimicking them. It also allows for integration of various generic statistical distributions in the output. Synthpop is open source and provided as a package for the R programming environment. Most use cases currently require some basic programming knowledge, although a web interface is in development.
To use synthpop, first install R and – especially if you are new to R – RStudio. You can access synthpop in R or RStudio by installing it from the CRAN repository with the command: install.packages("synthpop").
Development for a Python version is also underway but remains in its early stages.
Download and access synthpop
Secure computing environments
A secure computing environment is designed to protect sensitive or confidential information and research data from unauthorized access, data leakage, or other security threats. These environments are especially important in research involving personal data or other sensitive information. Many universities offer secure local computing environments for researchers.
MONA
MONA (Microdata Online Access) is Statistics Sweden’s platform for access to microdata. MONA allows users to process data online and download the results, without the microdata ever leaving Statistics Sweden. The system offers a selection of software tools (including statistical and word processing software), and users can upload their own material to dedicated storage space.
Read more and request access to MONA
Bianca
Bianca is a research system for analyzing sensitive personal data, which is available at no cost to all Swedish academic researchers. Bianca is operated by UPPMAX at Uppsala University as a part of the NAISS-SENS project. The system offers a Linux-based environment with extensive storage and processing capabilities, making it ideal for analyzing pseudonymized sensitive data.
The research infrastructure SIMPLER and the SWEGEN project both use Bianca to provide their data, but most researchers upload their own data or access data imported directly from NGI.