uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Code and Data for “Classification of Medieval Documents: Determining the Issuer, Place of Issue, and Decade for Old Swedish Charters”
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.ORCID iD: 0000−0002−4990−7880
2020 (English)Data set, Aggregated data
Abstract [en]

Code and data for the article Classification of Medieval Documents: Determining the Issuer, Place of Issue, and Decade for Old Swedish Charters (to appear in DHN2020 Digital Humanities in the Nordic Countries}, Riga, 17--20 March 2020).

The study based on this code and dataset is a comparative exploration of different classification tasks for Swedish medieval charters (transcriptions from the SDHK collection) and different classifier setups. In particular, we explore the identification of the issuer, place of issue, and decade of production. The experiments used features based on lowercased words and character 3- and 4-grams. We evaluated the performance of two learning algorithms: linear discriminant analysis and decision trees. For evaluation, five-fold cross-validation was performed. We report accuracy and macro-averaged F1 score. The validation made use of six labeled subsets of SDHK combining the three tasks with Old Swedish and Latin. Issuer identification for the Latin dataset (595 charters from 12 issuers) reached the highest scores, above 0.9, for the decision tree classifier using word features. The best corresponding accuracy for Old Swedish was 0.81. Place and decade identification produced lower performance scores for both languages. Which classifier design is the best one seems to depend on peculiarities of the dataset and the classification task. The present study does however support the idea that text classification is useful also for medieval documents characterized by extreme spelling variation.

Place, publisher, year
2020.
Version
1.0
National Category
Language Technology (Computational Linguistics)
Research subject
History; Computational Linguistics; Scandinavian Languages; Latin
Identifiers
URN: urn:nbn:se:uu:diva-400834OAI: oai:DiVA.org:uu-400834DiVA, id: diva2:1382485
Projects
New Eyes on Sweden’s Medieval Scribes. Scribal Attribution using Digital Palaeography in the Medieval Gothic Script (Riksbankens Jubileumsfond, Dnr NHS14-2068:1)
Funder
Riksbankens Jubileumsfond, NHS 14-2068:1Available from: 2020-01-03 Created: 2020-01-03 Last updated: 2020-01-10Bibliographically approved

Open Access in DiVA

dhn2020supplement(7917 kB)2 downloadsDescription of content
File information
File name DATASET01.zipFile size 7917 kBChecksum SHA-512Description Zip-file containing Python code, an XML data file, and a pdf document.
90951a31d2c26bc81c7e9d0bdf2c80b4941d387f329bea81fe3b935c92226b86109d216430b30b318f9b0e2fab495b9935e37435cbe8370ccb6ebe8823508fc6
Type datasetMimetype application/zip

Authority records BETA

Dahllöf, Mats

Search in DiVA

By author/editor
Dahllöf, Mats
By organisation
Department of Linguistics and Philology
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 0 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 62 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf