uu.seUppsala University Publications
Change search
ReferencesLink to record
Permanent link

Direct link
Approaching a New Language in Machine Translation: Considerations in Choosing a Strategy
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (Datorlingvistik, Computational Linguistics)
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (Datorlingvistik, Computational Linguistics)
2006 (English)In: Proceedings of the workshop on 'Strategies for developing machine translation for minority languages' (5th SALTMIL Workshop on Minority Languages), May 23rd 2006, Genoa, Italy.: Satellite workshop of LREC. Internationnal conference on resources and evaluation, 2006Conference paper (Other academic)
Abstract [en]

As a contribution to the on-going discussions concerning what strategy to use when approaching a new language, we present our experience from working with Swedish in the rule-based and statistical paradigms. We outline the development of Convertus. a robust transfer-based system equipped with techniques for using partial analyses, external dictionaries, statistical models and fall-back strategies. We also present a number of experiments with statistical translation of Swedish involving several languages. We observe that the concrete language pair, translation direction and corpus characteristics have an impact on translation quality in terms of the BLEU score. In particular, we study the effects of the openness/closeness of the domain, and introduce the concept of corpus density to measure this aspect. Density is based on repetition and overlap of text segments, and it is demonstrated that density correlates with BLEU. We also compare a statistical versus a rule-based approach the translation of a Swedish corpus. The rule-based approach for which we use Convertus outperforms the statistical in a modest way. For both systems there is much room for improvement and it is likely that they both can be further developed to a BLEU score of 0.4 – 0.5 which seems good enough for post-editing to pay off. However, a major difference concerns the kinds of errors that are made and how they can be identified. The errors caused by Convertus can be easily traced and explained in linguistic terms and hence also avoided by extensions and modifications of the dictionaries and the grammars. The errors produced by the statistical system are, however, less predictable and difficult to pin-point and eliminate by further training. In particular, the many cases of omissions constitute a serious problem. Our conclusion will be that the investment made in developing a rule-based system, preferably backed up by a statistical system, will pay off in the long run. Thus it becomes an urgent issue to make rule-based systems available as open-source so that the development of new systems can be focused on creating the language resources.

Place, publisher, year, edition, pages
2006.
Keyword [en]
machine translation, statistical machine translation, rule-based translation, minority languages
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:uu:diva-287990OAI: oai:DiVA.org:uu-287990DiVA: diva2:923701
Conference
'Strategies for developing machine translation for minority languages' (5th SALTMIL Workshop on Minority Languages), May 23rd 2006, Genoa, Italy. Satellite workshop of the LREC International Conference on Language Resources and Evaluation, Genoa, Italy, May 2006LREC. Internationnal conference on resources and evaluation
Available from: 2016-04-27 Created: 2016-04-27 Last updated: 2016-08-22Bibliographically approved

Open Access in DiVA

No full text

Search in DiVA

By author/editor
Sågvall Hein, Anna
By organisation
Department of Linguistics and Philology
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 4 hits
ReferencesLink to record
Permanent link

Direct link