Logo: to the web site of Uppsala University

uu.sePublications from Uppsala University
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Text Classification in Parliamentary Records with BERT and Active Learning
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Social Sciences, Department of Statistics.
2023 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

This thesis aims to explore ways to efficiently annotate text segments in the records of the Swedish parliamentary proceedings, with the classification of titles and decisions. As the corpus consists of more than 11.5 million sequences with varying structure and quality, spanning a time period of more than a hundred years, the work of curating the corpus is challenging, and the cost of annotating the data is substantial. Therefore, active learning in conjunction with Bidirectional Encoder Representations from Transformers (BERT) models is used. The active learning strategy used is uncertainty sampling with prediction entropy as the uncertainty measure. On average, the active learning strategy outperforms a random sampling strategy, suggesting that there is some benefit to using deep pre-trained neural networks in conjunction with active learning methods in classifying titles and decisions in parliamentary records. However,it did not substantially improve the performance on the test set when compared to a baseline model trained on the initial training set already performing well with a micro-averaged F1 score of 0.9556 and subset accuracy of 0.9843. The resulting sets of selected unlabeled sequences with the highest entropy suggest that using an uncertainty sampling method based on prediction entropy in conjunction with BERT on records of parliamentary proceedings diminished the model uncertainty surrounding text sequences from underrepresented periods with lower data quality.

Place, publisher, year, edition, pages
2023.
Keywords [en]
Parliamentary Records, Natural Language Processing, BERT, Active Learning, Uncertainty Sampling
National Category
Probability Theory and Statistics
Identifiers
URN: urn:nbn:se:uu:diva-506349OAI: oai:DiVA.org:uu-506349DiVA, id: diva2:1775373
Subject / course
Statistics
Educational program
Master Programme in Statistics
Supervisors
Examiners
Available from: 2023-06-27 Created: 2023-06-26 Last updated: 2023-06-27Bibliographically approved

Open Access in DiVA

No full text in DiVA

By organisation
Department of Statistics
Probability Theory and Statistics

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 563 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf