uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Multilingual Word Embeddings Based on Cross-Lingual Annotations
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
2019 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Multilingual word embeddings are the elements of a vector space that represents the meanings of words across different languages. The applications of the multilingual word embeddings are effective within the realm of Natural Language Processing (NLP) including, but not limited to machine translation, information retrieval, and dependency parsing.

In this thesis, we aim at evaluating multilingual word embeddings trained on syntactic grammatical features provided by cross-lingually annotated corpora. A presupposition is that the word embeddings capture more syntactic information of words than their semantic information. We address to what extent word embeddings trained on syntactic features capture semantic information in addition to syntactic information. Concerning language choices, we conduct experiments on the following five different Indo-European and Non-Indo-European languages including Chinese, English, German, Hebrew, and Italian. We evaluate the word embeddings on a benchmark of word similarity, named entity recognition (NER) and dependency parsing. The experimental results demonstrate that using cross-lingually annotated corpora for training word embeddings can be useful in capturing semantic and syntactic information. The combination of word embeddings trained by two algorithms improves the semantic quality of the word embeddings but shows limitations in their syntactic quality.

The multilingual word embedding approach based on cross-lingually annotated corpora is useful for the following reasons. First, it is flexible with a variety of algorithms as long as the algorithm can adjust to different grammatical features. Second, the approach do not require any bilingual dictionary or parallel corpus. Third, word embeddings are simple to train, as the word embeddings of all languages can be obtained simultaneously. However, one limitation of the approach is that the annotations of training corpora can be expensive.

Place, publisher, year, edition, pages
2019.
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:uu:diva-394888OAI: oai:DiVA.org:uu-394888DiVA, id: diva2:1359740
Subject / course
Language Technology
Educational program
Master Programme in Language Technology
Supervisors
Examiners
Available from: 2019-10-10 Created: 2019-10-10 Last updated: 2019-10-10Bibliographically approved

Open Access in DiVA

No full text in DiVA

By organisation
Department of Linguistics and Philology
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 25 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf