Synthetic data

Synthetic data refers to artificially generated data. Instead of modifying an existing dataset to make it less identifiable, a completely new dataset is generated, containing fictitious individuals and values. These data may be partially or entirely generated from artificial sources such as statistical distribution models or random generators. Therefore, “synthetic data” does not refer to a specific type of data found in files with particular formats; it is a category of data created through specific techniques.

When synthetic data are generated for the purpose of protecting personal data, sensitive values in the original dataset are replaced with values generated from a statistical model. Synthetic data can be created in many ways – for example, based on rules or by using trained machine learning models – and for a range of purposes, including privacy protection, data validation, and software testing.

Considerations for synthetic data based on personal information

Synthetic datasets that are generated from original data containing personal or sensitive information are often referred to, somewhat paradoxically, as “synthetic personal data.” Creating such data requires additional safeguards.

A key concern with synthetic personal data is the risk of re-identification. In some cases, synthetic data may be so realistic that it becomes possible to re-identify individuals from the real data used to train the model. To reduce this risk, you should:

Document re-identification risk assessments using measures such as k-anonymity and quantify the differences from the original dataset.
Consider how outliers may affect the re-identification risk.
Re-evaluate your requirements for data fidelity. High fidelity to the original dataset can increase the risk of re-identification and may not be necessary or even desirable.

You should also consider the following:

Folder structure: If the original data are sensitive and cannot be shared, consider providing an empty placeholder file or a synthetic dataset with low fidelity.
Provision of sample data: When access to a dataset is restricted, a risk-free sample dataset can help users understand its structure and content before placing a request for full access.
Metadata and codebooks: You can improve the reusability of synthetic survey data by describing the variables in a standard-format codebook, rather than in a generic text file.

When should I use synthetic data?

As an interim step: Synthetic data can serve as a preliminary version of a dataset when you want to share data that contain personal information. For instance, it can help recipients explore the contents, which variables or how many observations they might need from the actual dataset.
For exploratory analysis: Synthetic data can be used to test statistical relationships without accessing the actual dataset. This requires that the variables in the synthetic dataset reasonably reflect the distributions of the real data, which can be achieved by having a synthetic data tool analyze the original data as input and generate output that is statistically similar but contains no real individuals and cannot be connected to rows in the original dataset.
As dummy data: Synthetic data may also be used as “dummy data” to develop or test methods or code without accessing real data. This type of synthetic data is typically generated using strictly generative tools. In such cases, the synthetic dataset does not need to be statistically similar to the real data, only structurally similar (i.e., containing the same variable names and data types). If the data mimic anything statistically, it might be in the form of generalizable distributions – such as a normal distribution within a population.

How do I create synthetic data?

Creating synthetic data requires specialized software tools. These tools use advanced algorithms and statistical models to generate datasets that preserve the statistical characteristics of the original data, while protecting sensitive information.

The general process involves:

Data preparation: Prepare the original dataset by identifying and managing missing values, cleaning the data, and ensuring they are formatted correctly for modelling.
Model training: Train a statistical or machine learning model on the original data. The model learns the underlying patterns and distributions in the data.
Data generation: Use the trained model to generate a new dataset that reflects the statistical properties of the original dataset but contains entirely fictitious values.
Evaluation and validation: Evaluate the quality of the synthetic data by comparing their statistical properties with those of the original dataset to ensure that both privacy and usability are preserved.

Two examples of tools are described in the Tools section. You can also read more about synthetic data in a research article referenced in the Resources section.