uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Normalization of historical Text Using Context-Sensitive Weighted Levenshtein Distance and Compound Splitting
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (Computational Linguistics)
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.ORCID iD: 0000-0002-4838-6518
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
2013 (English)In: Proceedings of the 19th Nordic Conference on Computational Linguistics, Oslo, Norway, 2013Conference paper, Published paper (Refereed)
Abstract [en]

Natural language processing for historical text imposes a variety of challenges, such as to deal

with a high degree of spelling variation. Furthermore, there is often not enough linguistically

annotated data available for training part-of-speech taggers and other tools aimed at handling

this specific kind of text. In this paper we present a Levenshtein-based approach to normalisation

of historical text to a modern spelling. This enables us to apply standard NLP tools trained

on contemporary corpora on the normalised version of the historical input text. In its basic

version, no annotated historical data is needed, since the only data used for the Levenshtein

comparisons are a contemporary dictionary or corpus. In addition, a (small) corpus of manually

normalised historical text can optionally be included to learn normalisation for frequent words

and weights for edit operations in a supervised fashion, which improves precision. We show

that this method is successful both in terms of normalisation accuracy, and by the performance

of a standard modern tagger applied to the historical text. We also compare our method to a

previously implemented approach using a set of hand-written normalisation rules, and we see

that the Levenshtein-based approach clearly outperforms the hand-crafted rules. Furthermore,

the experiments were carried out on Swedish data with promising results and we believe that

our method could be successfully applicable to analyse historical text for other languages,

including those with less resources.

Place, publisher, year, edition, pages
Oslo, Norway, 2013.
Keyword [en]
historical texts, normalization, digital humanities, natural language processing, levenshtein edit distance, compound splitting, part-of-speech tagging, underresourced languages, less-resourced languages
National Category
Languages and Literature
Research subject
Computational Linguistics
Identifiers
URN: urn:nbn:se:uu:diva-205141OAI: oai:DiVA.org:uu-205141DiVA: diva2:640841
Conference
The 19th Nordic Conference on Computational Linguistics
Available from: 2013-08-14 Created: 2013-08-14 Last updated: 2017-01-25

Open Access in DiVA

No full text

Authority records BETA

Pettersson, EvaMegyesi, Beata

Search in DiVA

By author/editor
Pettersson, EvaMegyesi, Beata
By organisation
Department of Linguistics and Philology
Languages and Literature

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 421 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf