Tokenized product information for centrally approved medicines within EU (extracted May 3, 2022)

SND-ID: 2022-157-1. Version: 1. DOI: https://doi.org/10.57804/ggrw-hr06

Citation

Creator/Principal investigator(s)

Gabriel Westman - Uppsala University orcid

Research principal

Uppsala University - Department of Medical Sciences rorId

Description

The text corpus was compiled on May 3, 2022, by scripted downloading of all available English language product information files for all centrally approved medicinal products within the EU, from the European Medicines Agency website. Package Leaflet (PL) and Summary of product characteristics (SmPC) documents for each medicinal product, excluding multiplicate documents for medicinal products with more than one strength or pharmaceutical preparation, were used. The PDF files were scraped using the pdfplumber version 0.6.1 package in Python 3.8.10 to extract all text except page numbering, headers, and footers.

Line breaks and special characters (excluding punctuation characters) were removed, and punctuation was added to sentences where this was missing (such as headings) to avoid false aggregation. All paragraphs were tokenized on a sentence level using the Natural Language Toolkit (NLTK) version 3.7 tokenizer

This database contains sentence-level tokenized product infomation from all centrally approved medicinal products within the EU (May 3, 2022) including Summary of product characterist

... Show more..
The text corpus was compiled on May 3, 2022, by scripted downloading of all available English language product information files for all centrally approved medicinal products within the EU, from the European Medicines Agency website. Package Leaflet (PL) and Summary of product characteristics (SmPC) documents for each medicinal product, excluding multiplicate documents for medicinal products with more than one strength or pharmaceutical preparation, were used. The PDF files were scraped using the pdfplumber version 0.6.1 package in Python 3.8.10 to extract all text except page numbering, headers, and footers.

Line breaks and special characters (excluding punctuation characters) were removed, and punctuation was added to sentences where this was missing (such as headings) to avoid false aggregation. All paragraphs were tokenized on a sentence level using the Natural Language Toolkit (NLTK) version 3.7 tokenizer

This database contains sentence-level tokenized product infomation from all centrally approved medicinal products within the EU (May 3, 2022) including Summary of product characteristics (SmPC) and Package leaflet (PL) documents.

A total of 1258 medicinal products were initially included, of which 5 were subsequently excluded due to document compatibility issues. From these, a total of 783 K sentences were extracted from PL and SmPC documents. Show less..

Data contains personal data

No

Language

Method and outcome

Population

All centrally approved medicinal products within EU

Study design

Observational study

Description of study design

Health informatics study on information about approved medicinal products.

Data format / data structure

Data collection
Geographic coverage
Administrative information

Responsible department/unit

Department of Medical Sciences

Commissioning organisation

Swedish Medical Products Agency

Topic and keywords

Research area

Computer and information science (Standard för svensk indelning av forskningsämnen 2011)

Basic medicine (Standard för svensk indelning av forskningsämnen 2011)

Publications

Bergman E, Sherwood K, Forslund M, Arlett P, Westman G (2022) A natural language processing approach towards harmonisation of European medicinal product information. PLoS ONE 17(10): e0275386. https://doi.org/10.1371/journal.pone.0275386
DOI: https://doi.org/10.1371/journal.pone.0275386

If you have published anything based on these data, please notify us with a reference to your publication(s). If you are responsible for the catalogue entry, you can update the metadata/data description in DORIS.

Published: 2022-09-29
Last updated: 2022-10-25