ACROBAT - a multi-stain breast cancer histological whole-slide-image data set from routine diagnostics for computational pathology

SND-ID: 2022-190-1. Version: 1. DOI: https://doi.org/10.48723/w728-p041

Citation

Alternative title

ACROBAT

Creator/Principal investigator(s)

Mattias Rantalainen - Karolinska Institutet, Department of Medical Epidemiology and Biostatistics orcid

Johan Hartman - Karolinska Institutet, Department of Oncology-Pathology orcid

Research principal

Karolinska Institutet - Department of Medical Epidemiology and Biostatistics rorId

Description

The ACROBAT data set consists of 4,212 whole slide images (WSIs) from 1,153 female primary breast cancer patients. The WSIs in the data set are available at 10X magnification and show tissue sections from breast cancer resection specimens stained with hematoxylin and eosin (H&E) or immunohistochemistry (IHC). For each patient, one WSI of H&E stained tissue and at least one one, and up to four, WSIs of corresponding tissue stained with the routine diagnostic stains ER, PGR, HER2 and KI67 are available. The data set was acquired as part of the CHIME study (chimestudy.se) and its primary purpose was to facilitate the ACROBAT WSI registration challenge (acrobat.grand-challenge.org). The histopathology slides originate from routine diagnostic pathology workflows and were digitised for research purposes at Karolinska Institutet (Stockholm, Sweden). The image acquisition process resembles the routine digital pathology image digitisation workflow, using three different Hamamatsu WSI scanners, specifically one NanoZoomer S360 and two NanoZoomer XR. The WSIs in this data set are accompanied by a data ta

... Show more..
The ACROBAT data set consists of 4,212 whole slide images (WSIs) from 1,153 female primary breast cancer patients. The WSIs in the data set are available at 10X magnification and show tissue sections from breast cancer resection specimens stained with hematoxylin and eosin (H&E) or immunohistochemistry (IHC). For each patient, one WSI of H&E stained tissue and at least one one, and up to four, WSIs of corresponding tissue stained with the routine diagnostic stains ER, PGR, HER2 and KI67 are available. The data set was acquired as part of the CHIME study (chimestudy.se) and its primary purpose was to facilitate the ACROBAT WSI registration challenge (acrobat.grand-challenge.org). The histopathology slides originate from routine diagnostic pathology workflows and were digitised for research purposes at Karolinska Institutet (Stockholm, Sweden). The image acquisition process resembles the routine digital pathology image digitisation workflow, using three different Hamamatsu WSI scanners, specifically one NanoZoomer S360 and two NanoZoomer XR. The WSIs in this data set are accompanied by a data table with one row for each WSI, specifying an anonymised patient ID, the stain or IHC antibody type of each WSI, as well as the magnification and microns per pixel at each available resolution level. Automated registration algorithm performance evaluation is possible through the ACROBAT challenge website based on over 37,000 landmark pair annotations from 13 annotators. While the primary purpose of this data set was the development and evaluation of WSI registration methods, this data set has the potential to facilitate further research in the context of computational pathology, for example in the areas of stain-guided learning, virtual staining, unsupervised learning and stain-independent models.

The data set consists of three subsets, the training, validation and test set, based on the ACROBAT WSI registration challenge. There are 750 cases in the training set, for each of which one H&E WSI and one to four IHC WSIs are available, with 3406 WSIs in total. The validation set consists of 100 cases with 200 WSIs in total and the test set of 303 cases with 606 WSIs in total. Both for the validation and test set, one H&E WSI as well as one randomly selected IHC WSI is available.

WSIs were anonymised by deleting the associated macro images, by generating filenames with random case IDs and by overwriting meta data fields with potentially personal information. Hamamatsu NDPI files were then converted using libvips (libvips.org/). WSIs are available as generic tiled TIFF WSIs (openslide.org/formats/generic-tiff/) at 10X magnification and lower image levels.

The data set is available for download in seven separate ZIP archives, five for the training data (train_part1.zip (71.47 GB), train_part2.zip (70.59 GB), train_part3.zip (75.91 GB), train_part4.zip (71.63 GB) and train_part5.zip (69.09 GB)), one for the validation data (valid.zip 21.79 GB) and one for the test data (test.zip 68.11 GB).

File listings and checksums in SHA1 format are available for checking archive/data integrity when downloading.

While it would be helpful to notify SND of any publications using this data set by sending an email to request@snd.gu.se, please note that this is not required to use the data. Show less..

Data contains personal data

No

Language

Method and outcome

Unit of analysis

Population

Anonymised female primary breast cancer patients from the Stockholm region

Study design

Observational study

Sampling procedure

A subset of the whole-slide-images that were generated in terms of the CHIME study were randomly selected for the ACROBAT data set. Training and validation data are a random subset, whereas the test data was generated using stratified sampling, taking into account biomarker statuses and the scanner model that was used to generate the respective whole-slide-image.

Time period(s) investigated

2012 – 2018

Number of individuals/objects

1153

Data format / data structure

Data collection
  • Description of the mode of collection: Archived routine clinical diagnostic tissue slides with tissue material were scanned using whole-slide-image scanners at Karolinska Institutet.
  • Time period(s) for data collection: 2012 – 2018
  • Data collector: Karolinska Institutet rorId
  • Instrument: NanoZoomer XR (Technical instrument(s)) - Hamamatsu whole-slide-imaging scanner.
  • Instrument: NanoZoomer S360 (Technical instrument(s)) - Hamamatsu whole-slide-imaging scanner
Geographic coverage

Geographic spread

Geographic location: Stockholm County

Administrative information

Responsible department/unit

Department of Medical Epidemiology and Biostatistics

Contributor(s)

Masi Valkonen - University of Turku, Institute of Biomedicine orcid

Kimmo Kartasalo - Karolinska Institutet, Department of Medical Epidemiology and Biostatistics orcid

Kajsa Ledesma Eriksson - Karolinska Institutet, Department of Medical Epidemiology and Biostatistics orcid

Leena Latonen - University of Eastern Finland, Institute of Biomedicine orcid

Constance Boissin - Karolinska Institutet, Department of Medical Epidemiology and Biostatistics orcid

... Show more..

Masi Valkonen - University of Turku, Institute of Biomedicine orcid

Kimmo Kartasalo - Karolinska Institutet, Department of Medical Epidemiology and Biostatistics orcid

Kajsa Ledesma Eriksson - Karolinska Institutet, Department of Medical Epidemiology and Biostatistics orcid

Leena Latonen - University of Eastern Finland, Institute of Biomedicine orcid

Constance Boissin - Karolinska Institutet, Department of Medical Epidemiology and Biostatistics orcid

Yanbo Feng - Karolinska Institutet, Department of Medical Epidemiology and Biostatistics orcid

Philippe Weitz - Karolinska Institutet, Department of Medical Epidemiology and Biostatistics orcid

Dusan Rasic - Zealand University Hospital, Department of Surgical Pathology orcid

Sonja Koivukoski - University of Eastern Finland, Institute of Biomedicine

Pekka Ruusuvuori - University of Turku, Institute of Biomedicine orcid

Circe Carr - University of Turku, Institute of Biomedicine

Sandra Pouplier - Zealand University Hospital, Department of Surgical Pathology

Leslie Solorzano - Karolinska Institutet, Department of Medical Epidemiology and Biostatistics orcid

Abhinav Sharma - Karolinska Institutet, Department of Medical Epidemiology and Biostatistics orcid

Anne-Vibeke Laenkholm - Zealand University Hospital, Institute of Biomedicine orcid

Aino Kuusela - University of Turku, Institute of Biomedicine

Show less..

Funding 1

  • Funding agency: Swedish Research Council rorId

Funding 2

  • Funding agency: ERA PerMed
  • Funding agency's reference number: ERAPERMED2019-224-ABCAP
  • Project name on the application: Advancing Breast Cancer histopathology towards AI-based Personalised medicine

Funding 3

  • Funding agency: Swedish Cancer Society rorId

Ethics Review

Stockholm - Ref. 2017/2106-31

Amendment: 2018/1462-32

Topic and keywords

Research area

Science and technology (CESSDA Topic Classification)

Information technology (CESSDA Topic Classification)

Medical image processing (Standard för svensk indelning av forskningsämnen 2011)

Medical and health sciences (Standard för svensk indelning av forskningsämnen 2011)

Cancer and oncology (Standard för svensk indelning av forskningsämnen 2011)

Publications

Sort by name | Sort by year

Weitz, P. et al., (2022). ACROBAT -- a multi-stain breast cancer histological whole-slide-image data set from routine diagnostics for computational pathology. doi:10.48550/ARXIV.2211.13621
DOI: https://doi.org/10.48550/ARXIV.2211.13621

Weitz P, Valkonen M, Solorzano L, Carr C, Kartasalo K, Boissin C, Koivukoski S, Kuusela A, Rasic D, Feng Y, Sinius Pouplier S, Sharma A, Ledesma Eriksson K, Latonen L, Laenkholm AV, Hartman J, Ruusuvuori P, Rantalainen M. A Multi-Stain Breast Cancer Histological Whole-Slide-Image Data Set from Routine Diagnostics. Sci Data. 2023 Aug 24;10(1):562.
DOI: https://doi.org/10.1038/s41597-023-02422-6

License

CC BY 4.0

Versions

Version 1. 2023-01-02

Version 1: 2023-01-02

DOI: https://doi.org/10.48723/w728-p041

Contacts for questions about the data

Philippe Weitz

philippe.weitz@ki.se

Mattias Rantalainen

mattias.rantalainen@ki.se

This resource has the following relations

Published: 2023-01-02
Last updated: 2023-10-20