uu.seUppsala University Publications
Change search
ReferencesLink to record
Permanent link

Direct link
Generation of Compound Words in Statistical Machine Translation into Compounding Languages
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
2013 (English)In: Computational linguistics - Association for Computational Linguistics (Print), ISSN 0891-2017, E-ISSN 1530-9312, Vol. 39, no 4, 1067-1108 p.Article in journal (Refereed) Published
Abstract [en]

In this article we investigate statistical machine translation (SMT) into Germanic languages, with a focus on compound processing. Our main goal is to enable the generation of novel compounds that have not been seen in the training data. We adopt a split-merge strategy, where compounds are split before training the SMT system, and merged after the translation step. This approach reduces sparsity in the training data, but runs the risk of placing translations of compound parts in non-consecutive positions. It also requires a postprocessing step of compound merging, where compounds are reconstructed in the translation output. We present a method for increasing the chances that components that should be merged are translated into contiguous positions and in the right order and show that it can lead to improvements both by direct inspection and in terms of standard translation evaluation metrics. We also propose several new methods for compound merging, based on heuristics and machine learning, which outperform previously suggested algorithms. These methods can produce novel compounds and a translation with at least the same overall quality as the baseline. For all subtasks we show that it is useful to include part-of-speech based information in the translation process, in order to handle compounds.

Place, publisher, year, edition, pages
2013. Vol. 39, no 4, 1067-1108 p.
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
URN: urn:nbn:se:uu:diva-212834DOI: 10.1162/COLI_a_00162ISI: 000327124700008OAI: oai:DiVA.org:uu-212834DiVA: diva2:680463
Available from: 2013-12-18 Created: 2013-12-16 Last updated: 2016-02-18

Open Access in DiVA

No full text

Other links

Publisher's full text

Search in DiVA

By author/editor
Stymne, Sara
By organisation
Department of Linguistics and Philology
In the same journal
Computational linguistics - Association for Computational Linguistics (Print)
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Altmetric score

Total: 147 hits
ReferencesLink to record
Permanent link

Direct link