Logo: to the web site of Uppsala University

uu.sePublications from Uppsala University
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (Computational Linguistics)
2016 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

Historical text constitutes a rich source of information for historians and other researchers in humanities. Many texts are however not available in an electronic format, and even if they are, there is a lack of NLP tools designed to handle historical text. In my thesis, I aim to provide a generic workflow for automatic linguistic analysis and information extraction from historical text, with spelling normalisation as a core component in the pipeline. In the spelling normalisation step, the historical input text is automatically normalised to a more modern spelling, enabling the use of existing taggers and parsers trained on modern language data in the succeeding linguistic analysis step. In the final information extraction step, certain linguistic structures are identified based on the annotation labels given by the NLP tools, and ranked in accordance with the specific information need expressed by the user.

An important consideration in my implementation is that the pipeline should be applicable to different languages, time periods, genres, and information needs by simply substituting the language resources used in each module. Furthermore, the reuse of existing NLP tools developed for the modern language is crucial, considering the lack of linguistically annotated historical data combined with the high variability in historical text, making it hard to train NLP tools specifically aimed at analysing historical text.

In my evaluation, I show that spelling normalisation can be a very useful technique for easy access to historical information content, even in cases where there is little (or no) annotated historical training data available. For the specific information extraction task of automatically identifying verb phrases describing work in Early Modern Swedish text, 91 out of the 100 top-ranked instances are true positives in the best setting. 

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2016. , p. 147
Series
Studia Linguistica Upsaliensia, ISSN 1652-1366 ; 17
Keywords [en]
NLP for historical text, spelling normalisation, digital humanities, information extraction, character-based statistical machine translation, SMT, Levenshtein edit distance, language technology, computational linguistics
National Category
Humanities
Research subject
Computational Linguistics
Identifiers
URN: urn:nbn:se:uu:diva-269753ISBN: 978-91-554-9443-8 (print)OAI: oai:DiVA.org:uu-269753DiVA, id: diva2:885117
Public defence
2016-03-05, Ihresalen, 21-0011, Engelska Parken, Thunbergsvägen 3H, Uppsala, 10:15 (English)
Opponent
Supervisors
Part of project
Gender and Work: A Research and Digitisation Project at the Department of History, Uppsala University, Swedish Research Council, Riksbankens Jubileumsfond, Knut and Alice Wallenberg FoundationAvailable from: 2016-02-12 Created: 2015-12-18 Last updated: 2024-10-22

Open Access in DiVA

fulltext(1043 kB)2208 downloads
File information
File name FULLTEXT01.pdfFile size 1043 kBChecksum SHA-512
012bcf851fc5ec4de9a1e989cee13fc8536d410a06c234a721c26e81dc3d005acccede0a209c9a6043763a076c14b27b1ae9b5bb193b9efbcb4f6f69d8528a24
Type fulltextMimetype application/pdf
Buy this publication >>

Authority records

Pettersson, Eva

Search in DiVA

By author/editor
Pettersson, Eva
By organisation
Department of Linguistics and Philology
Humanities

Search outside of DiVA

GoogleGoogle Scholar
Total: 2210 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 4978 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf