uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
An SMT Approach to Automatic Annotation of Historical Texts
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (Computational Linguistics)
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.ORCID iD: 0000-0002-4838-6518
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
2013 (English)In: Workshop on Computational Historical Linguistics, Nodalida 2013., 2013Conference paper, Published paper (Refereed)
Abstract [en]

In this paper we propose an approach to tagging and parsing of historical text, using characterbased

SMT methods for translating the historical spelling to a modern spelling before applying

the NLP tools. This way, existing modern taggers and parsers may be used to analyse historical

text instead of training new tools specialised in historical language, which might be hard

considering the lack of linguistically annotated historical corpora. We show that our approach

to spelling normalisation is successful even with small amounts of training data, and that

it is generalisable to several languages. For the two languages presented in this paper, the

proportion of tokens with a spelling identical to the modern gold standard spelling increases

from 64.8% to 83.9%, and from 64.6% to 92.3% respectively, which has a positive impact on

subsequent tagging and parsing using modern tools.

Place, publisher, year, edition, pages
2013.
Keyword [en]
historical texts, normalization, statistical machine translation, digital humanities, natural language processing
National Category
General Language Studies and Linguistics
Research subject
Computational Linguistics
Identifiers
URN: urn:nbn:se:uu:diva-205142OAI: oai:DiVA.org:uu-205142DiVA: diva2:640842
Conference
The 19th Nordic Conference on Computational Linguistics (Nodalida) 2013.
Available from: 2013-08-14 Created: 2013-08-14 Last updated: 2017-01-25

Open Access in DiVA

No full text

Authority records BETA

Pettersson, EvaMegyesi, BeataTiedemann, Jörg

Search in DiVA

By author/editor
Pettersson, EvaMegyesi, BeataTiedemann, Jörg
By organisation
Department of Linguistics and Philology
General Language Studies and Linguistics

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 361 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf