Code and Data for “Classification of Medieval Documents: Determining the Issuer, Place of Issue, and Decade for Old Swedish Charters”

SND-ID: 2024-283. Version: 1. DOI: https://doi.org/10.57804/e9cs-gh75

Citation

Creator/Principal investigator(s)

Mats Dahllöf - Uppsala University orcid

Research principal

Uppsala University rorId

Description

Code and data for the article Classification of Medieval Documents: Determining the Issuer, Place of Issue, and Decade for Old Swedish Charters (to appear in DHN2020 Digital Humanities in the Nordic Countries}, Riga, 17--20 March 2020).
The zip-file contains Python code, an XML data file, and a pdf document.

The study based on this code and dataset is a comparative exploration of different classification tasks for Swedish medieval charters (transcriptions from the SDHK collection) and different classifier setups. In particular, we explore the identification of the issuer, place of issue, and decade of production. The experiments used features based on lowercased words and character 3- and 4-grams. We evaluated the performance of two learning algorithms: linear discriminant analysis and decision trees. For evaluation, five-fold cross-validation was performed. We report accuracy and macro-averaged F1 score. The validation made use of six labeled subsets of SDHK combining the three tasks with Old Swedish and Latin. Issuer identification for the Latin dataset (595 charters from 12 issuers) reache

... Show more..
Code and data for the article Classification of Medieval Documents: Determining the Issuer, Place of Issue, and Decade for Old Swedish Charters (to appear in DHN2020 Digital Humanities in the Nordic Countries}, Riga, 17--20 March 2020).
The zip-file contains Python code, an XML data file, and a pdf document.

The study based on this code and dataset is a comparative exploration of different classification tasks for Swedish medieval charters (transcriptions from the SDHK collection) and different classifier setups. In particular, we explore the identification of the issuer, place of issue, and decade of production. The experiments used features based on lowercased words and character 3- and 4-grams. We evaluated the performance of two learning algorithms: linear discriminant analysis and decision trees. For evaluation, five-fold cross-validation was performed. We report accuracy and macro-averaged F1 score. The validation made use of six labeled subsets of SDHK combining the three tasks with Old Swedish and Latin. Issuer identification for the Latin dataset (595 charters from 12 issuers) reached the highest scores, above 0.9, for the decision tree classifier using word features. The best corresponding accuracy for Old Swedish was 0.81. Place and decade identification produced lower performance scores for both languages. Which classifier design is the best one seems to depend on peculiarities of the dataset and the classification task. The present study does however support the idea that text classification is useful also for medieval documents characterized by extreme spelling variation.

The dataset was originally published in DiVA and moved to SND in 2024. Show less..

Data contains personal data

No

Language

Method and outcome

Data format / data structure

Data collection
Geographic coverage
Administrative information

Funding

  • Funding agency: Riksbankens Jubileumsfond
  • Funding agency's reference number: NHS 14-2068:1

Identifiers

Topic and keywords

Research area

Language technology (computational linguistics) (Standard för svensk indelning av forskningsämnen 2011)

Publications

Versions

Version 1. 2020-01-03

Version 1: 2020-01-03

DOI: https://doi.org/10.57804/e9cs-gh75

This resource has the following relations

CLARIN Virtual Collection Registry

Add to collection

A virtual collection is connected to a specific research purpose and contains links to data resources from various digital archives. It is easy to create, access, and cite the collection.

Read more about virtual collections on the CLARIN website.

Published: 2020-01-03
Last updated: 2024-08-22