uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Extending the View: Explorations in Bootstrapping a Swedish PoS Tagger
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (Språkteknologi)
2009 (English)In: Proceedings of the 17th Nordic Conference on Computational Linguistics NODALIDA 2009, Tartu, Estland: Tartu University Library , 2009, 34-40 p.Conference paper, Published paper (Refereed)
Abstract [en]

State-of-the-art statistical part-of-speech taggers mainly use information on tag bi- or  trigrams, depending on the size of the training corpus. Some also use lexical emission probabilities above unigrams with beneficial results. In both cases, a wider context usually gives better accuracy for a large training corpus, which in turn gives better accuracy than a smaller one.  Large corpora with validated tags, however, are scarce, so a bootstrap technique can be used. As the corpus grows, it is probable that a widened context would improve results even further.

In this paper, we looked at the contribution to accuracy of such an extended view for both tag transitions and lexical emissions, applied to both a validated Swedish source corpus and a raw bootstrap corpus. We found that the extended view was more important for tag transitions, in particular if applied to the bootstrap corpus. For lexical emission, it was also more important if applied to the bootstrap corpus than to the source corpus, although it was beneficial for both. The overall best tagger had an accuracy of 98.05%.

Place, publisher, year, edition, pages
Tartu, Estland: Tartu University Library , 2009. 34-40 p.
Series
NEALT Proceedings Series, ISSN 1736-6305 ; 4
Keyword [en]
part-of-speech tagging, machine learning
Keyword [sv]
ordklassuppmärkning, maskininlärning
National Category
Language Technology (Computational Linguistics)
Research subject
datorlingvistik
Identifiers
URN: urn:nbn:se:uu:diva-103321OAI: oai:DiVA.org:uu-103321DiVA: diva2:218046
Available from: 2009-05-18 Created: 2009-05-18 Last updated: 2009-05-19Bibliographically approved

Open Access in DiVA

No full text

Other links

http://hdl.handle.net/10062/9549
By organisation
Department of Linguistics and Philology
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 427 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf