Approaching a New Language in Machine Translation: Considerations in Choosing a Strategy
2006 (English)In: Proceedings of the workshop on 'Strategies for developing machine translation for minority languages' (5th SALTMIL Workshop on Minority Languages), May 23rd 2006, Genoa, Italy.: Satellite workshop of LREC. Internationnal conference on resources and evaluation, 2006Conference paper (Other academic)
As a contribution to the on-going discussions concerning what strategy to use when approaching a new language, we present our experience from working with Swedish in the rule-based and statistical paradigms. We outline the development of Convertus. a robust transfer-based system equipped with techniques for using partial analyses, external dictionaries, statistical models and fall-back strategies. We also present a number of experiments with statistical translation of Swedish involving several languages. We observe that the concrete language pair, translation direction and corpus characteristics have an impact on translation quality in terms of the BLEU score. In particular, we study the effects of the openness/closeness of the domain, and introduce the concept of corpus density to measure this aspect. Density is based on repetition and overlap of text segments, and it is demonstrated that density correlates with BLEU. We also compare a statistical versus a rule-based approach the translation of a Swedish corpus. The rule-based approach for which we use Convertus outperforms the statistical in a modest way. For both systems there is much room for improvement and it is likely that they both can be further developed to a BLEU score of 0.4 – 0.5 which seems good enough for post-editing to pay off. However, a major difference concerns the kinds of errors that are made and how they can be identified. The errors caused by Convertus can be easily traced and explained in linguistic terms and hence also avoided by extensions and modifications of the dictionaries and the grammars. The errors produced by the statistical system are, however, less predictable and difficult to pin-point and eliminate by further training. In particular, the many cases of omissions constitute a serious problem. Our conclusion will be that the investment made in developing a rule-based system, preferably backed up by a statistical system, will pay off in the long run. Thus it becomes an urgent issue to make rule-based systems available as open-source so that the development of new systems can be focused on creating the language resources.
Place, publisher, year, edition, pages
machine translation, statistical machine translation, rule-based translation, minority languages
Language Technology (Computational Linguistics)
IdentifiersURN: urn:nbn:se:uu:diva-287990OAI: oai:DiVA.org:uu-287990DiVA: diva2:923701
'Strategies for developing machine translation for minority languages' (5th SALTMIL Workshop on Minority Languages), May 23rd 2006, Genoa, Italy. Satellite workshop of the LREC International Conference on Language Resources and Evaluation, Genoa, Italy, May 2006LREC. Internationnal conference on resources and evaluation