uu.seUppsala University Publications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Principal Word Vectors
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (Computational Linguistics)
2018 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

Word embedding is a technique for associating the words of a language with real-valued vectors, enabling us to use algebraic methods to reason about their semantic and grammatical properties. This thesis introduces a word embedding method called principal word embedding, which makes use of principal component analysis (PCA) to train a set of word embeddings for words of a language. The principal word embedding method involves performing a PCA on a data matrix whose elements are the frequency of seeing words in different contexts. We address two challenges that arise in the application of PCA to create word embeddings. The first challenge is related to the size of the data matrix on which PCA is performed and affects the efficiency of the word embedding method. The data matrix is usually a large matrix that requires a very large amount of memory and CPU time to be processed. The second challenge is related to the distribution of word frequencies in the data matrix and affects the quality of the word embeddings. We provide an extensive study of the distribution of the elements of the data matrix and show that it is unsuitable for PCA in its unmodified form.

We overcome the two challenges in principal word embedding by using a generalized PCA method. The problem with the size of the data matrix is mitigated by a randomized singular value decomposition (SVD) procedure, which improves the performance of PCA on the data matrix. The data distribution is reshaped by an adaptive transformation function, which makes it more suitable for PCA. These techniques, together with a weighting mechanism that generalizes many different weighting and transformation approaches used in literature, enable the principal word embedding to train high quality word embeddings in an efficient way.

We also provide a study on how principal word embedding is connected to other word embedding methods. We compare it to a number of word embedding methods and study how the two challenges in principal word embedding are addressed in those methods. We show that the other word embedding methods are closely related to principal word embedding and, in many instances, they can be seen as special cases of it.

The principal word embeddings are evaluated in both intrinsic and extrinsic ways. The intrinsic evaluations are directed towards the study of the distribution of word vectors. The extrinsic evaluations measure the contribution of principal word embeddings to some standard NLP tasks. The experimental results confirm that the newly proposed features of principal word embedding (i.e., the randomized SVD algorithm, the adaptive transformation function, and the weighting mechanism) are beneficial to the method and lead to significant improvements in the results. A comparison between principal word embedding and other popular word embedding methods shows that, in many instances, the proposed method is able to generate word embeddings that are better than or as good as other word embeddings while being faster than several popular word embedding methods.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2018. , p. 159
Series
Studia Linguistica Upsaliensia, ISSN 1652-1366 ; 22
Keywords [en]
word, context, word embedding, principal component analysis, PCA, sparse matrix, singular value decomposition, SVD, entropy
National Category
General Language Studies and Linguistics Language Technology (Computational Linguistics) Computer Systems
Identifiers
URN: urn:nbn:se:uu:diva-353866ISBN: 978-91-513-0365-9 (print)OAI: oai:DiVA.org:uu-353866DiVA, id: diva2:1219609
Public defence
2018-09-08, Room 22-0008, Humanistiska teatern, 752 38, Uppsala, 09:00 (English)
Opponent
Supervisors
Available from: 2018-08-14 Created: 2018-06-17 Last updated: 2018-08-27

Open Access in DiVA

fulltext(1305 kB)222 downloads
File information
File name FULLTEXT01.pdfFile size 1305 kBChecksum SHA-512
070a2edac73998c276edbb7ee199420e7162ba4fd0e18c0023e874c62af7e3c55becb973b59a1a1d3c300dc78a46452c802d3c04879a969066245aaa1feca8a5
Type fulltextMimetype application/pdf
Buy this publication >>

Authority records BETA

Basirat, Ali

Search in DiVA

By author/editor
Basirat, Ali
By organisation
Department of Linguistics and Philology
General Language Studies and LinguisticsLanguage Technology (Computational Linguistics)Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 222 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1252 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf