Tools

When working with research data containing personal or sensitive information, it can be helpful to use various tools to manage and protect the data. These may assist in assessing the risk of re-identification or in systematically preparing a dataset for disclosure. On this page, we introduce several types of tools and software that can be useful for handling quantitative and qualitative data, as well as for encryption, generating synthetic data, and working in secure computing environments.

Tools for quantitative data

There are several tools available for statistical disclosure control of quantitative data – that is, tools that help you assess and minimise the risk of re-identification within your dataset. Many of these tools offer protective measures for reducing this risk, as well as functions to evaluate the usability of data after applying these measures. Below are some of the most common tools for statistical disclosure control. 

sdcMicro

sdcMicro helps identify variables or combinations of variables that pose a risk of re-identification. It provides users with a quick overview of a dataset and makes it possible to aggregate variables, evaluate the impact of changes on the re-identification risk, and measure how data modifications affect further analysis.

sdcMicro can be run locally on your computer as an add-on package for the R software environment. It is free to download and includes a browser-based graphical user interface called sdcApp, which provides instructions and information about the changes made throughout the process – making it suitable even for users who are not experts in microdata management. sdcMicro automatically documents all changes in a script, making it easy to trace and replicate your actions.

To use sdcMicro, you need access to an R environment on your computer or on a server (beginners are recommended to install RStudio). sdcMicro can then be downloaded as an R package from the CRAN repository using the script: install.packages("sdcMicro").

Download sdcMicro
Amnesia

Amnesia is a tool from OpenAIRE, which pseudonymises data using a predefined algorithm to convert personal data into pseudonyms. The algorithm can be based on encryption or hashing. Like sdcMicro, Amnesia allows you to aggregate variables and evaluate the re-identification risk.

The software is Java-based and can be downloaded and run locally on your computer, but it is also available as a web application.

Download Amnesia
ARX

ARX is an open source software for anonymising personal data. It is available both as a standalone graphical application and as a programming library. ARX supports users with robust privacy models (including l-diversity and t-closeness) and provides metrics for measuring information loss and estimating residual re-identification risk.

ARX is Java-based and can be run locally via a compatible Java environment.

Download ARX
µ-ARGUS

µ-ARGUS is a standalone graphical software developed by and for statisticians at Statistics Netherlands. The software is well-established, offers a wide range of functions, and supports SPSS file formats. The project behind µ-ARGUS has been has active for a long time, so many functions developed for µ-ARGUS have been reused in other projects, such as sdcMicro, thanks to its open-source nature. Therefore, the functionalities in µ-ARGUS are also accessible for programmatic use beyond the graphical interface.

µ-ARGUS is Java-based and runs locally on your computer via a compatible Java environment. 

Download µ-ARGUS

Tools for qualitative data

Several digital tools are available to help handle qualitative data, particularly for assisting in anonymising and structuring materials such as interview transcripts. QualiAnon is one such tool, designed to help protect personal data while preserving the analytical value of the material.

QualiAnon

QualiAnon assists in identifying personal data and sensitive information in textual data, such as interview transcripts. It allows users to work systematically with different forms of markup and the use of stop terms, which can support pseudonymisation of qualitative data. This can be particularly helpful when preparing data for transfer or sharing.

QualiAnon is Java-based and can be run locally on your computer using a compatible Java environment.

Download QualiAnon

Tools for encryption

Encryption is a protective measure that adds an extra layer of access control to sensitive data. It can be particularly useful for file transfers, temporary storage in environments with limited security, or as part of a systematic access control strategy within a research project. Below are some commonly used encryption tools. 

Microsoft Office and LibreOffice

Office suites such as Microsoft Office and LibreOffice include functions for encrypting documents. In Microsoft Office, for example, you can find this under the File menu, Info, and then Protect document. The encryption algorithm used is considered strong in reasonably recent versions (post-2007), meaning that overall security largely depends on the strength of the chosen password.

7-Zip

7-Zip is an open-source compression tool that allows you to encrypt files during compression. It uses the AES-256 encryption algorithm, which is considered highly secure; meaning that it is also important to choose a strong password. 7-Zip is best suited for backing up files, storing raw data, or protecting files at rest, as the process of decrypting, extracting, and then re-encrypting and compressing files can be cumbersome.

A limitation of 7-Zip is that it is only available for Windows and Linux. Mac users can extract and decrypt 7-Zip archives using The Unarchiver

Download 7-Zip
VeraCrypt

VeraCrypt is an open-source software that supports AES-256 and several other encryption algorithms. It creates an encrypted “container”, which appears as a regular file in a directory (without a file extension – you may add one manually, such as .pdf, to help disguise the file). When a file is decrypted in VeraCrypt, it behaves like a mounted network volume where files can be stored and accessed.

Unlike the Office suites or 7-Zip, VeraCrypt is a specialised encryption application and is available for Windows, Linux, and Mac OSX. One downside is that it is more resource-intensive in terms of memory and processing.

Download VeraCrypt

Tools for generating synthetic data

Synthetic data are artificially generated data based on statistical models. The data may be generated based on real datasets or created entirely from predefined rules and input values.

Mockaroo

Mockaroo is a simple web-based tool for creating fully generative synthetic test data that follow typical distributions for different variable types, for example, background variables for fictitious individuals. It supports around 170 variable types and users can define distributions using a custom formula language.

Although Mockaroo is primarily intended for software testing, it is useful in many other scenarios. It is a commercial product, but the free version does not require registration and allows users to generate synthetic datasets of up to 1,000 rows, which can then be downloaded.

Download Mockaroo
synthpop

Synthpop is a tool for programmatic generation of synthetic data by analysing real datasets and then modelling and mimicking them. It also allows for integration of various generic statistical distributions in the output.

Synthpop is open source and provided as a package for the R programming environment. Most use cases currently require some basic programming knowledge, although a web interface is in development. 

To use synthpop, first install R and – especially if you are new to R – RStudio. You can access synthpop in R or RStudio by installing it from the CRAN repository with the command: install.packages("synthpop").

Development for a Python version is also underway but remains in its early stages. 

Download and access synthpop

Secure computing environments

A secure computing environment is designed to protect sensitive or confidential information and research data from unauthorised access, data leakage, or other security threats. These environments are especially important in research involving personal data or other sensitive information. Many universities offer secure local computing environments for researchers.

MONA

MONA (Microdata Online Access) is Statistics Sweden’s platform for access to microdata. MONA allows users to process data online and download the results, without the microdata ever leaving Statistics Sweden. The system offers a selection of software tools (including statistical and word processing software), and users can upload their own material to dedicated storage space.

Read more and request access to MONA
Bianca

Bianca is a research system dedicated for analysing sensitive personal data. It is freely available to all Swedish academic researchers. Bianca is operated by UPPMAX at Uppsala University, and a part of the NAISS-SENS project. Bianca offers a Linux-based environment with extensive storage and processing capabilities, making it ideal for analysing pseudonymised sensitive data.

The research infrastructure SIMPLER and the SWEGEN project use Bianca to provide their data, but individual researchers can upload their own data or access data imported directly from NGI. 

Read more and request access to Bianca