Open this publication in new window or tab >>2023 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]
Document image processing and handwritten text recognition have been applied to a variety of materials, scripts, and languages, both modern and historic. They are crucial building blocks in the on-going digitisation efforts of archives, where they aid in preserving archival materials and foster knowledge sharing. The latter is especially facilitated by making document contents available to interested readers who may have little to no practice in, for example, reading a specific script type, and might therefore face challenges in accessing the material.
The first part of this dissertation focuses on reducing editorial artefacts, specifically in the form of struck-through words, in manuscripts. The main goal of this process is to identify struck-through words and remove as much of the strikethrough artefacts as possible in order to regain access to the original word. This step can serve both as preprocessing, to aid human annotators and readers, as well as in computerised pipelines, such as handwritten text recognition. Two deep learning-based approaches, exploring paired and unpaired data settings, are examined and compared. Furthermore, an approach for generating synthetic strikethrough data, for example, for training and testing purposes, and three novel datasets are presented.
The second part of this dissertation is centred around applying handwritten text recognition to the stenographic manuscripts of Swedish children's book author Astrid Lindgren (1907 - 2002). Manually transliterating stenography, also known as shorthand, requires special domain knowledge of the script itself. Therefore, the main focus of this part is to reduce the required manual work, aiming to increase the accessibility of the material. In this regard, a baseline for handwritten text recognition of Swedish stenography is established. Two approaches for improving upon this baseline are examined. Firstly, a variety of data augmentation techniques, commonly-used in handwritten text recognition, are studied. Secondly, different target sequence encoding methods, which aim to approximate diplomatic transcriptions, are investigated. The latter, in combination with a pre-training approach, significantly improves the recognition performance. In addition to the two presented studies, the novel LION dataset is published, consisting of excerpts from Astrid Lindgren's stenographic manuscripts.
Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2023. p. 87
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 2294
Series
Skrifter utgivna av Svenska barnboksinstitutet, ISSN 0347-5387 ; 166
Keywords
document image processing, handwritten text recognition, stenography, strikethrough
National Category
Computer graphics and computer vision
Research subject
Computerized Image Processing
Identifiers
urn:nbn:se:uu:diva-509138 (URN)978-91-513-1873-8 (ISBN)
Public defence
2023-10-04, Room 101121, Ångströmlaboratoriet, Lägerhyddsvägen 1, Uppsala, 09:15 (English)
Opponent
Supervisors
2023-09-112023-08-162025-02-07