uu.seUppsala University Publications
Change search
Link to record
Permanent link

Direct link
BETA
Dahllöf, Mats
Publications (10 of 22) Show all publications
Dahllöf, M. (2020). Code and Data for “Classification of Medieval Documents: Determining the Issuer, Place of Issue, and Decade for Old Swedish Charters”.
Open this publication in new window or tab >>Code and Data for “Classification of Medieval Documents: Determining the Issuer, Place of Issue, and Decade for Old Swedish Charters”
2020 (English)Data set, Aggregated data
Abstract [en]

Code and data for the article Classification of Medieval Documents: Determining the Issuer, Place of Issue, and Decade for Old Swedish Charters (to appear in DHN2020 Digital Humanities in the Nordic Countries}, Riga, 17--20 March 2020).

The study based on this code and dataset is a comparative exploration of different classification tasks for Swedish medieval charters (transcriptions from the SDHK collection) and different classifier setups. In particular, we explore the identification of the issuer, place of issue, and decade of production. The experiments used features based on lowercased words and character 3- and 4-grams. We evaluated the performance of two learning algorithms: linear discriminant analysis and decision trees. For evaluation, five-fold cross-validation was performed. We report accuracy and macro-averaged F1 score. The validation made use of six labeled subsets of SDHK combining the three tasks with Old Swedish and Latin. Issuer identification for the Latin dataset (595 charters from 12 issuers) reached the highest scores, above 0.9, for the decision tree classifier using word features. The best corresponding accuracy for Old Swedish was 0.81. Place and decade identification produced lower performance scores for both languages. Which classifier design is the best one seems to depend on peculiarities of the dataset and the classification task. The present study does however support the idea that text classification is useful also for medieval documents characterized by extreme spelling variation.

National Category
Language Technology (Computational Linguistics)
Research subject
History; Computational Linguistics; Scandinavian Languages; Latin
Identifiers
urn:nbn:se:uu:diva-400834 (URN)
Projects
New Eyes on Sweden’s Medieval Scribes. Scribal Attribution using Digital Palaeography in the Medieval Gothic Script (Riksbankens Jubileumsfond, Dnr NHS14-2068:1)
Funder
Riksbankens Jubileumsfond, NHS 14-2068:1
Available from: 2020-01-03 Created: 2020-01-03 Last updated: 2020-01-10Bibliographically approved
Berglund, K., Dahllöf, M. & Määttä, J. (2019). Apples and Oranges? Large-Scale Thematic Comparisons of Contemporary Swedish Popular and Literary Fiction (ed.). Samlaren: tidskrift för svensk litteraturvetenskaplig forskning, 140, 228-260
Open this publication in new window or tab >>Apples and Oranges? Large-Scale Thematic Comparisons of Contemporary Swedish Popular and Literary Fiction
2019 (English)In: Samlaren: tidskrift för svensk litteraturvetenskaplig forskning, ISSN 0348-6133, E-ISSN 2002-3871, Vol. 140, p. 228-260Article in journal (Refereed) Published
Abstract [en]

Karl Berglund, Department of Literature, Uppsala University

Mats Dahllöf, Department of Linguistics and Philology, Uppsala University

Jerry Määttä, Department of Literature, Uppsala University

Apples and Oranges? Large-Scale Thematic Comparisons of Contemporary Swedish Popular and Literary Fiction

The aim of this article is to compare thematic trends in contemporary Swedish bestselling and literary fiction with the help of a computational method—topic modelling—which extracts content themes based on statistical patterns of word usage. This procedure allows us to identify trends and patterns that are not easily discovered through manual reading. We track topics in two subsets of Swedish fiction from the period 2004–2017: 1) prose fiction on the Swedish bestseller charts, and 2) prose fiction shortlisted for the August Prize (arguably the most prestigious Swedish literary prize). The results confirm several assumptions about contemporary popular and literary fiction, such as more plot-focused themes in popular fiction and themes more connected to settings in literary fiction. But the outcomes also provide new, and more surprising knowledge, such as food and economy being the most biased themes among the non-crime fiction bestsellers, whereas themes concerning nature are most biased in the literary realm. Moreover, themes relating to sex, intimacy, and violence are biased towards literary fiction rather than popular fiction. In the light of our findings, we argue that both popular fiction and literary fiction seem to be characterised by certain thematic attributes that make it relevant to discuss them as genres also on a textual-thematic level.

Place, publisher, year, edition, pages
Uppsala: Svenska Litteratursällskapet, 2019
Keywords
Bestsellers, The August Prize, Swedish literary fiction, distant reading, topic modelling, popular fiction
National Category
General Literature Studies
Research subject
Literature
Identifiers
urn:nbn:se:uu:diva-406939 (URN)
Available from: 2020-03-16 Created: 2020-03-16 Last updated: 2020-03-19Bibliographically approved
Dahllöf, M. (2019). Clustering writing components from medieval manuscripts. In: Michael Piotrowski (Ed.), Proceedings of the Workshop on Computational Methods in the Humanities 2018: . Paper presented at COMHUM 2018 Workshop on Computational Methods in the Humanities 2018, Lausanne, Switzerland, June 4–5, 2018. (pp. 23-32).
Open this publication in new window or tab >>Clustering writing components from medieval manuscripts
2019 (English)In: Proceedings of the Workshop on Computational Methods in the Humanities 2018 / [ed] Michael Piotrowski, 2019, p. 23-32Conference paper, Published paper (Refereed)
Abstract [en]

This article explores a minimally supervised method for extracting components, mostly letters, from historical manuscripts, and clustering them into classes capturing linguistic equivalence. The clustering uses the DBSCAN algorithm and an additional classification step. This pipeline gives us cheap, but partial, manuscript transcription in combination with human annotation. Experiments with different parameter settings suggest that a system like this should be tuned separately for different categories, rather than rely on one-pass application of algorithms partitioning the same components into non-overlapping clusters. The method could also be used to extract features for manuscript classification, e.g. dating and scribe attribution, as well as to extract data for further palaeographic analysis.

Series
CEUR Workshop Proceedings, ISSN 1613-0073 ; 2314
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-375111 (URN)
Conference
COMHUM 2018 Workshop on Computational Methods in the Humanities 2018, Lausanne, Switzerland, June 4–5, 2018.
Funder
Swedish Research Council, 2012-5743Riksbankens Jubileumsfond, NHS14-2068:1
Available from: 2019-01-26 Created: 2019-01-26 Last updated: 2019-05-09Bibliographically approved
Dahllöf, M. & Berglund, K. (2019). Faces, Fights, and Families: Topic Modeling and Gendered Themes in Two Corpora of Swedish Prose Fiction. In: Constanza Navaretta et al. (Ed.), DHN 2019 Copenhagen, Proceedings of 4th Conference of The Association Digital Humanities in the Nordic Countries Copenhagen, March 6-8 2019: . Paper presented at DHN 2019, 4th Digital Humanities in the Nordic Countries 2019, University of Copenhagen, Copenhagen, Denmark, March 6–8, 2019 (pp. 92-111).
Open this publication in new window or tab >>Faces, Fights, and Families: Topic Modeling and Gendered Themes in Two Corpora of Swedish Prose Fiction
2019 (English)In: DHN 2019 Copenhagen, Proceedings of 4th Conference of The Association Digital Humanities in the Nordic Countries Copenhagen, March 6-8 2019 / [ed] Constanza Navaretta et al., 2019, p. 92-111Conference paper, Published paper (Refereed)
Abstract [en]

This paper explores topic modeling (TM) as a tool for “dis- tant reading” of two Swedish literary corpora. We investigate what kinds of insight and knowledge a TM-based approach can provide to Swedish literary history, and which methodological difficulties are associated with this endeavour. The TM is based on 12- and 24-term chunks of selected verb and common noun lemmas. We generate models with 20, 40, and 100 topics. We also propose a method for a quantitative and qualita- tive gendered thematic analysis by combining TM with a study of how the topics relate to gender in characters and authors. The two corpora contain, respectively, Swedish classics (1821–1941) and recent bestsellers (2004–2017). We find that most of the topics proposed by the TM are easy to interpret as conceptual themes, and that the “same” themes ap- pear for the two corpora and for different TM settings. The study allows us to make interesting observations concerning different aspects of gender and topic distribution.

Keywords
Topic Modeling · Distant Reading; Gender Analysis; Literary Methodology; Swedish Prose Fiction; Bestsellers
National Category
General Literature Studies Language Technology (Computational Linguistics)
Research subject
Literature; Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-382230 (URN)
Conference
DHN 2019, 4th Digital Humanities in the Nordic Countries 2019, University of Copenhagen, Copenhagen, Denmark, March 6–8, 2019
Available from: 2019-04-23 Created: 2019-04-23 Last updated: 2019-09-12Bibliographically approved
Berglund, K., Dahllöf, M. & Määttä, J. (2019). Supplementary material for “Apples and Oranges? Large-Scale Thematic Comparisons of Contemporary Swedish Popular and Literary Fiction” (Samlaren, 2019).
Open this publication in new window or tab >>Supplementary material for “Apples and Oranges? Large-Scale Thematic Comparisons of Contemporary Swedish Popular and Literary Fiction” (Samlaren, 2019)
2019 (English)Report (Other academic)
Abstract [en]

The report provides raw listings of the results of topic modeling experiments, intended for readers interested in taking a closer look at these. Explanations and discussion are found in the main article: “Apples and Oranges? Large-Scale Thematic Comparisons of Contemporary Swedish Popular and Literary Fiction” published in the journal Samlaren, 2019.

Publisher
p. 674
National Category
Specific Literatures
Identifiers
urn:nbn:se:uu:diva-397762 (URN)
Available from: 2019-11-25 Created: 2019-11-25 Last updated: 2020-02-18Bibliographically approved
Dahllöf, M. (2018). Automatic Scribe Attribution for Medieval Manuscripts. Digital Medievalist, 11(1), 1-26, Article ID 6.
Open this publication in new window or tab >>Automatic Scribe Attribution for Medieval Manuscripts
2018 (English)In: Digital Medievalist, ISSN 1715-0736, E-ISSN 1715-0736, Vol. 11, no 1, p. 1-26, article id 6Article in journal (Refereed) Published
Abstract [en]

We propose an automatic method for attributing manuscript pages to scribes. The system uses digital images as published by libraries. The attribution process involves extracting from each query page approximately letter-size components. This is done by means of binarization (ink-background separation), connected component labelling, and further segmentation, guided by the estimated typical stroke width. Components are extracted in the same way from the pages of known scribal origin. This allows us to assign a scribe to each query component by means of nearest-neighbour classification. Distance (dissimilarity) between components is modelled by simple features capturing the distribution of ink in the bounding box defined by the component, together with Euclidean distance. The set of component-level scribe attributions, which typically includes hundreds of components for a page, is then used to predict the page scribe by means of a voting procedure. The scribe who receives the largest number of votes from the 120 strongest component attributions is proposed as its scribe. The scribe attribution process allows the argument behind an attribution to be visualized for a human reader. The writing components of the query page are exhibited along with the matching components of the known pages. This report is thus open to inspection and analysis using the methods and intuitions of traditional palaeography. The present system was evaluated on a data set covering 46 medieval scribes, writing in Carolingian minuscule, Bastarda, and a few other scripts. The system achieved a mean top-1 accuracy of 98.3% as regards the first scribe proposed for each page, when the labelled data comprised one randomly selected page from each scribe and nine unseen pages for each scribe were to be attributed in the validation procedure. The experiment was repeated 50 times to even out random variation effects.

National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-371700 (URN)10.16995/dm.67 (DOI)
Funder
Swedish Research Council, 2012-5743Riksbankens Jubileumsfond, NHS14-2068:1
Available from: 2018-12-27 Created: 2018-12-27 Last updated: 2019-04-25Bibliographically approved
Dahllöf, M. (2018). Clustering Writing Components from Medieval Manuscripts. In: Piotrowski, Michael (Ed.), COMHUM 2018: Book of Abstracts for the Workshop on Computational Methods in the Humanities 2018: . Paper presented at Workshop on Computational Methods in the Humanities, Lausanne, 4 June - 5 June, 2018 (pp. 11-13). Lausanne
Open this publication in new window or tab >>Clustering Writing Components from Medieval Manuscripts
2018 (English)In: COMHUM 2018: Book of Abstracts for the Workshop on Computational Methods in the Humanities 2018 / [ed] Piotrowski, Michael, Lausanne, 2018, p. 11-13Conference paper, Oral presentation with published abstract (Refereed)
Place, publisher, year, edition, pages
Lausanne: , 2018
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-356601 (URN)10.5281/zenodo.1312779 (DOI)
Conference
Workshop on Computational Methods in the Humanities, Lausanne, 4 June - 5 June, 2018
Projects
Searching and datamining in Large Collections of Historical Handwritten Documents, (Vetenskapsrådet, Dnr 2012-5743, PI Anders Brun)New Eyes on Sweden's Medieval Scribes. Scribal Attribution using Digital Palaeography in the Medieval Gothic Script. (Riksbankens Jubileumsfond, Dnr NHS14-2068:1, PI Lasse Mårtensson)
Funder
Swedish Research Council, 2012-5743Riksbankens Jubileumsfond, NHS14-2068:1
Available from: 2018-08-02 Created: 2018-08-02 Last updated: 2018-11-01Bibliographically approved
Dahllöf, M. (2014). Predicting the Scribe Behind a Page of Medieval Handwriting. In: : . Paper presented at SLTC 2014, The Fifth Swedish Language Technology Conference.
Open this publication in new window or tab >>Predicting the Scribe Behind a Page of Medieval Handwriting
2014 (English)Conference paper, Published paper (Refereed)
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:uu:diva-241362 (URN)
Conference
SLTC 2014, The Fifth Swedish Language Technology Conference
Available from: 2015-01-11 Created: 2015-01-11 Last updated: 2018-01-11
Dahllöf, M. (2014). Scribe attribution for early medieval handwriting by means of letter extraction and classification and a voting procedure for larger pieces. In: 22nd International Conference on Pattern Recognition (ICPR): . Paper presented at 22nd International Conference on Pattern Recognition (ICPR), 24-28 Aug, 2014, Stockholm, Sweden (pp. 1910-1915).
Open this publication in new window or tab >>Scribe attribution for early medieval handwriting by means of letter extraction and classification and a voting procedure for larger pieces
2014 (English)In: 22nd International Conference on Pattern Recognition (ICPR), 2014, p. 1910-1915Conference paper, Published paper (Refereed)
Abstract [en]

The present study investigates a method for the attribution of scribal hands, inspired by traditional palaeography in being based on comparison of letter shapes. The system was developed for and evaluated on early medieval Caroline minuscule manuscripts. The generation of a prediction for a page image involves writing identification, letter segmentation, and letter classification. The system then uses the letter proposals to predict the scribal hand behind a page. Letters and sequences of connected letters are identified by means of connected component labeling and split into letter-size pieces. The hand (and character) prediction makes use of a dataset containing instances of the letters b, d, p, and q, cut out from manuscript pages whose scribal origin is known. Letters are represented by features capturing the distribution of foreground. Cosine similarity is used for nearest neighbor classification. The hand behind a page is finally predicted by means of a voting procedure taking the highest scoring letter-level hits as its input. This hand prediction method was evaluated on pages from five different hands and reached an accuracy above 99% for four of them and 87% for a fifth significantly more difficult one. The hand behind single toplisted letters was correctly predicted in 83% of the cases.

Series
International Conference on Pattern Recognition, ISSN 1051-4651
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:uu:diva-241360 (URN)10.1109/ICPR.2014.334 (DOI)000359818002005 ()978-1-4799-5208-3 (ISBN)
Conference
22nd International Conference on Pattern Recognition (ICPR), 24-28 Aug, 2014, Stockholm, Sweden
Available from: 2015-01-11 Created: 2015-01-11 Last updated: 2018-01-11Bibliographically approved
Wahlberg, F., Dahllöf, M., Mårtensson, L. & Brun, A. (2014). Spotting words in medieval manuscripts. Studia Neophilologica, 86, 171-186
Open this publication in new window or tab >>Spotting words in medieval manuscripts
2014 (English)In: Studia Neophilologica, ISSN 0039-3274, E-ISSN 1651-2308, Vol. 86, p. 171-186Article in journal (Refereed) Published
Abstract [en]

This article discusses the technology of handwritten text recognition (HTR) as a tool for the analysis of historical handwritten documents. We give a broad overview of this field of research, but the focus is on the use of a method called word spotting' for finding words directly and automatically in scanned images of manuscript pages. We illustrate and evaluate this method by applying it to a medieval manuscript. Word spotting uses digital image analysis to represent stretches of writing as sequences of numerical features. These are intended to capture the linguistically significant aspects of the visual shape of the writing. Two potential words can then be compared mathematically and their degree of similarity assigned a value. Our version of this method gives a false positive rate of about 30%, when the true positive rate is close to 100%, for an application where we search for very frequent short words in a 16th-Century Old Swedish cursiva recentior manuscript. Word spotting would be of use e.g. to researchers who want to explore the content of manuscripts when editions or other transcriptions are unavailable.

National Category
Computer and Information Sciences General Language Studies and Linguistics Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-227725 (URN)10.1080/00393274.2013.871975 (DOI)000335850200012 ()
Available from: 2014-01-20 Created: 2014-06-30 Last updated: 2018-01-11Bibliographically approved
Organisations

Search in DiVA

Show all publications