Extending the View: Explorations in Bootstrapping a Swedish PoS Tagger
2009 (English)In: Proceedings of the 17th Nordic Conference on Computational Linguistics NODALIDA 2009, Tartu, Estland: Tartu University Library , 2009, 34-40 p.Conference paper (Refereed)
State-of-the-art statistical part-of-speech taggers mainly use information on tag bi- or trigrams, depending on the size of the training corpus. Some also use lexical emission probabilities above unigrams with beneficial results. In both cases, a wider context usually gives better accuracy for a large training corpus, which in turn gives better accuracy than a smaller one. Large corpora with validated tags, however, are scarce, so a bootstrap technique can be used. As the corpus grows, it is probable that a widened context would improve results even further.
In this paper, we looked at the contribution to accuracy of such an extended view for both tag transitions and lexical emissions, applied to both a validated Swedish source corpus and a raw bootstrap corpus. We found that the extended view was more important for tag transitions, in particular if applied to the bootstrap corpus. For lexical emission, it was also more important if applied to the bootstrap corpus than to the source corpus, although it was beneficial for both. The overall best tagger had an accuracy of 98.05%.
Place, publisher, year, edition, pages
Tartu, Estland: Tartu University Library , 2009. 34-40 p.
, NEALT Proceedings Series, ISSN 1736-6305 ; 4
part-of-speech tagging, machine learning
Language Technology (Computational Linguistics)
Research subject datorlingvistik
IdentifiersURN: urn:nbn:se:uu:diva-103321OAI: oai:DiVA.org:uu-103321DiVA: diva2:218046