Logo: to the web site of Uppsala University

uu.sePublications from Uppsala University
Operational message
There are currently operational disruptions. Troubleshooting is in progress.
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Searching and Recommending Texts Related to Climate Change
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology.
2021 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

This project considers the design of a machine learning system to search efficiently a database of texts related to climate change. The efficient search and navigation of such a database make it easier to find actionable information, detect trends, or derives other useful information. A key feature of such an information retrieval system is the numerical representation of such a text. This project implements and compares three different ways to represent a text in a vector space. Specifically, we contrast Bag-of-Words, Term Frequency - Inverse Document Frequency, and Doc2Vec in this context.

The reported results indicate two cases: firstly, we observe that all 3 embeddings outperform a naive (fixed, expert rule-based) method for retrieving a text. In this case, the query contains part of the text with a small modification, while the result of the query should be the text itself. The Bag-of-Words approach turns out to be best in class for this task.

Secondly, we consider the task where the query is a random string, while the desired result is based on a manual comparison of the results. Here we observe that the doc2vec approach is best in class. If the random queries become abstract-alike, the Bag-of-Words approach is performing almost as well.

Abstract [sv]

Det har projektet tar hänsyn till utformningen av ett maskininlärningssystem för att effektivt söka i en databas med texter relaterade till klimatförändringar. Effektiv sökning och navigering av en sådan databas gör det lättare att upptäcka trender eller hitta användbar information. En nyckelfunktion i ett sådant informationshämtningssystem är den numeriska representationen av en sådan text. Detta projekt implementerar och jämför tre olika sätt att representera en text i en vektorrymd. Specifikt jämför vi Bag-of-Words, Term Frequency - Inverse Document Frequency och Doc2Vec i detta sammanhang.

De rapporterade resultaten indikerar två fall: i det första fallet observerar vi att alla 3 implementationer overträffar en naiv metod för att hitta en text. I det här fallet innehåller forfrågan en del av texten med en mindre modifikation, medan resultatet bör vara själva texten. Bag-of-Words-metoden visar sig vara bäst lämpad för denna uppgift.

I det andra fallet är f örfrågan en slumpmässig sträng, medan det önskade resultatet baseras på en manuell jämförelse av resultaten. Här observerar vi att doc2vec-metoden är bäst. Om förfrågan är lik ett förväntat resultat fungerar Bag-of-Words-metoden nästan lika bra.

Place, publisher, year, edition, pages
2021. , p. 62
Series
UPTEC IT, ISSN 1401-5749 ; 21006
National Category
Engineering and Technology
Identifiers
URN: urn:nbn:se:uu:diva-443535OAI: oai:DiVA.org:uu-443535DiVA, id: diva2:1558764
External cooperation
AFRY
Educational program
Master of Science Programme in Information Technology Engineering
Supervisors
Examiners
Available from: 2021-06-01 Created: 2021-05-31 Last updated: 2021-08-17Bibliographically approved

Open Access in DiVA

fulltext(1254 kB)394 downloads
File information
File name FULLTEXT01.pdfFile size 1254 kBChecksum SHA-512
fea2d55c6f8e83a299fd5c97bbc428994669775ca6d96043ad5b81578bd251d3205f5873f29ee8a3d56932a119e77ba7d4fa5f51b60186b9dcb577d32d1a8c85
Type fulltextMimetype application/pdf

By organisation
Department of Information Technology
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar
Total: 396 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 640 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf