uu.sePublications
Change search
ReferencesLink to record
Permanent link

Direct link
Recycling Translations : Extraction of Lexical Data from Parallel Corpora and their Application in Natural Language Processing
2003 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

The focus of this thesis is on re-using translations in natural language processing. It involves the collection of documents and their translations in an appropriate format, the automatic extraction of translation data, and the application of the extracted data to different tasks in natural language processing.

Five parallel corpora containing more than 35 million words in 60 languages have been collected within co-operative projects. All corpora are sentence aligned and parts of them have been analyzed automatically and annotated with linguistic markup.

Lexical data are extracted from the corpora by means of word alignment. Two automatic word alignment systems have been developed, the Uppsala Word Aligner (UWA) and the Clue Aligner. UWA implements an iterative "knowledge-poor" word alignment approach using association measures and alignment heuristics. The Clue Aligner provides an innovative framework for the combination of statistical and linguistic resources in aligning single words and multi-word units. Both aligners have been applied to several corpora. Detailed evaluations of the alignment results have been carried out for three of them using fine-grained evaluation techniques.

A corpus processing toolbox, Uplug, has been developed. It includes the implementation of UWA and is freely available for research purposes. A new version, Uplug II, includes the Clue Aligner. It can be used via an experimental web interface (UplugWeb).

Lexical data extracted by the word aligners have been applied to different tasks in computational lexicography and machine translation. The use of word alignment in monolingual lexicography has been investigated in two studies. In a third study, the feasibility of using the extracted data in interactive machine translation has been demonstrated. Finally, extracted lexical data have been used for enhancing the lexical components of two machine translation systems.

Place, publisher, year, pages
Uppsala: Acta Universitatis Upsaliensis, 2003. 130 p.
Series
Studia Linguistica Upsaliensia, ISSN 1652-1366 ; 1
Keyword [en]
Computational linguistics, word alignment, parallel corpora, translation corpora, computational lexicography, machine translation
Keyword [sv]
Datorlingvistik
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:uu:diva-3791 (URN)91-554-5815-7 (ISBN)oai:DiVA.org:uu-3791 (OAI)diva2:163715 (DiVA)
Public defence
2003-12-12, sal IX, Universitetshuset, Uppsala, 10:15
Opponent
Supervisors
Available from2003-11-20 Created:2003-11-20Bibliographically approved

Open Access in DiVA

fulltext(2112 kB)4236 downloads
File information
File name FULLTEXT01.pdfFile size 2112 kBChecksum MD5
73d9ec2198317afff16eea4594abe58507f05374b5dcf61a6d9ae1b87e65608c1fcb45f2
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Tiedemann, Jörg
By organisation
Department of Linguistics
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 4236 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available
Total: 1648 hits
ReferencesLink to record
Permanent link

Direct link