uu.seUppsala universitets publikasjoner
Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Principal Word Vectors
Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi. (Computational Linguistics)
2018 (engelsk)Doktoravhandling, monografi (Annet vitenskapelig)
Abstract [en]

Word embedding is a technique for associating the words of a language with real-valued vectors, enabling us to use algebraic methods to reason about their semantic and grammatical properties. This thesis introduces a word embedding method called principal word embedding, which makes use of principal component analysis (PCA) to train a set of word embeddings for words of a language. The principal word embedding method involves performing a PCA on a data matrix whose elements are the frequency of seeing words in different contexts. We address two challenges that arise in the application of PCA to create word embeddings. The first challenge is related to the size of the data matrix on which PCA is performed and affects the efficiency of the word embedding method. The data matrix is usually a large matrix that requires a very large amount of memory and CPU time to be processed. The second challenge is related to the distribution of word frequencies in the data matrix and affects the quality of the word embeddings. We provide an extensive study of the distribution of the elements of the data matrix and show that it is unsuitable for PCA in its unmodified form.

We overcome the two challenges in principal word embedding by using a generalized PCA method. The problem with the size of the data matrix is mitigated by a randomized singular value decomposition (SVD) procedure, which improves the performance of PCA on the data matrix. The data distribution is reshaped by an adaptive transformation function, which makes it more suitable for PCA. These techniques, together with a weighting mechanism that generalizes many different weighting and transformation approaches used in literature, enable the principal word embedding to train high quality word embeddings in an efficient way.

We also provide a study on how principal word embedding is connected to other word embedding methods. We compare it to a number of word embedding methods and study how the two challenges in principal word embedding are addressed in those methods. We show that the other word embedding methods are closely related to principal word embedding and, in many instances, they can be seen as special cases of it.

The principal word embeddings are evaluated in both intrinsic and extrinsic ways. The intrinsic evaluations are directed towards the study of the distribution of word vectors. The extrinsic evaluations measure the contribution of principal word embeddings to some standard NLP tasks. The experimental results confirm that the newly proposed features of principal word embedding (i.e., the randomized SVD algorithm, the adaptive transformation function, and the weighting mechanism) are beneficial to the method and lead to significant improvements in the results. A comparison between principal word embedding and other popular word embedding methods shows that, in many instances, the proposed method is able to generate word embeddings that are better than or as good as other word embeddings while being faster than several popular word embedding methods.

sted, utgiver, år, opplag, sider
Uppsala: Acta Universitatis Upsaliensis, 2018. , s. 159
Serie
Studia Linguistica Upsaliensia, ISSN 1652-1366 ; 22
Emneord [en]
word, context, word embedding, principal component analysis, PCA, sparse matrix, singular value decomposition, SVD, entropy
HSV kategori
Identifikatorer
URN: urn:nbn:se:uu:diva-353866ISBN: 978-91-513-0365-9 (tryckt)OAI: oai:DiVA.org:uu-353866DiVA, id: diva2:1219609
Disputas
2018-09-08, Room 22-0008, Humanistiska teatern, 752 38, Uppsala, 09:00 (engelsk)
Opponent
Veileder
Tilgjengelig fra: 2018-08-14 Laget: 2018-06-17 Sist oppdatert: 2018-08-27

Open Access i DiVA

fulltext(1305 kB)216 nedlastinger
Filinformasjon
Fil FULLTEXT01.pdfFilstørrelse 1305 kBChecksum SHA-512
070a2edac73998c276edbb7ee199420e7162ba4fd0e18c0023e874c62af7e3c55becb973b59a1a1d3c300dc78a46452c802d3c04879a969066245aaa1feca8a5
Type fulltextMimetype application/pdf
Kjøp publikasjonen >>

Personposter BETA

Basirat, Ali

Søk i DiVA

Av forfatter/redaktør
Basirat, Ali
Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar
Totalt: 216 nedlastinger
Antall nedlastinger er summen av alle nedlastinger av alle fulltekster. Det kan for eksempel være tidligere versjoner som er ikke lenger tilgjengelige

isbn
urn-nbn

Altmetric

isbn
urn-nbn
Totalt: 1224 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf