Logo: to the web site of Uppsala University

uu.sePublications from Uppsala University
Planned maintenance
A system upgrade is planned for 10/12-2024, at 12:00-13:00. During this time DiVA will be unavailable.
Change search
Refine search result
1234567 1 - 50 of 936
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Abdou, Mostafa
    et al.
    Univ Copenhagen, Dept Comp Sci, Copenhagen, Denmark..
    Ravishankar, Vinit
    Univ Oslo, Dept Informat, Language Technol Grp, Oslo, Norway..
    Kulmizev, Artur
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Sogaard, Anders
    Univ Copenhagen, Dept Comp Sci, Copenhagen, Denmark..
    Word Order Does Matter (And Shuffled Language Models Know It)2022In: PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), ASSOC COMPUTATIONAL LINGUISTICS-ACL Association for Computational Linguistics, 2022, p. 6907-6919Conference paper (Refereed)
    Abstract [en]

    Recent studies have shown that language models pretrained and/or fine-tuned on randomly permuted sentences exhibit competitive performance on GLUE, putting into question the importance of word order information. Somewhat counter-intuitively, some of these studies also report that position embeddings appear to be crucial for models' good performance with shuffled text. We probe these language models for word order information and investigate what position embeddings learned from shuffled text encode, showing that these models retain information pertaining to the original, naturalistic word order. We show this is in part due to a subtlety in how shuffling is implemented in previous work - before rather than after subword segmentation. Surprisingly, we find even Language models trained on text shuffled after subword segmentation retain some semblance of information about word order because of the statistical dependencies between sentence length and unigram probabilities. Finally, we show that beyond GLUE, a variety of language understanding tasks do require word order information, often to an extent that cannot be learned through fine-tuning.

  • 2.
    Adams, Allison
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Dependency Parsing and Dialogue Systems: an investigation of dependency parsing for commercial application2017Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    In this thesis, we investigate dependency parsing for commercial application, namely for future integration in a dialogue system. To do this, we conduct several experiments on dialogue data to assess parser performance on this domain, and to improve this performance over a baseline. This work makes the following contributions: first, the creation and manual annotation of a gold-standard data set for dialogue data; second, a thorough error analysis of the data set, comparing neural network parsing to traditional parsing methods on this domain; and finally, various domain adaptation experiments show how parsing on this data set can be improved over a baseline.  We further show that dialogue data is characterized by questions in particular, and suggest a method for improving overall parsing on these constructions. 

    Download full text (pdf)
    fulltext
  • 3.
    Adams, Allison
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Stymne, Sara
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Learning with learner corpora: Using the TLE for native language identification2017In: Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition, 2017, p. 1-7Conference paper (Refereed)
    Abstract [en]

    This study investigates the usefulness of the Treebank of Learner English (TLE) when applied to the task of Native Language Identification (NLI). The TLE is effectively a parallel corpus of Standard/Learner English, as there are two versions; one based on original learner essays, and the other an error-corrected version. We use the corpus to explore how useful a parser trained on ungrammatical relations is compared to a parser trained on grammatical relations, when used as features for a native language classification task. While parsing results are much better when trained on grammatical relations, native language classification is slightly better using a parser trained on the original treebank containing ungrammatical relations.

    Download full text (pdf)
    fulltext
  • 4.
    Adelani, David
    et al.
    Saarland Univ, Saarbrucken, Germany..
    Alabi, Jesujoba
    INRIA, Paris, France..
    Fan, Angela
    Meta AI, Menlo Pk, CA USA..
    Kreutzer, Julia
    Google Res, Mountain View, CA USA..
    Shen, Xiaoyu
    Amazon Alexa AI, Seattle, WA USA..
    Reid, Machel
    Univ Tokyo, Tokyo, Japan..
    Ruiter, Dana
    Saarland Univ, Saarbrucken, Germany..
    Klakow, Dietrich
    Saarland Univ, Saarbrucken, Germany..
    Nabende, Peter
    Makerere Univ, Kampala, Uganda..
    Chang, Ernie
    Saarland Univ, Saarbrucken, Germany..
    Gwadabe, Tajuddeen
    UCAS, Beijing, Peoples R China..
    Sackey, Freshia
    JKUAT, Juja, Kenya..
    Dossou, Bonaventure F. P.
    Jacobs Univ, Bremen, Germany..
    Emezue, Chris
    TUM, Munich, Germany..
    Leong, Colin
    Univ Dayton, Dayton, OH 45469 USA..
    Beukman, Michael
    Univ Witwatersrand, Johannesburg, South Africa..
    Muhammad, Shamsuddeen
    LIAAD INESC TEC, Porto, Portugal..
    Jarso, Guyo
    Yousuf, Oreen
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Rubungo, Andre Niyongabo
    UPC, Barcelona, Spain..
    Hacheme, Gilles
    Ai4Innov, Paris, France..
    Wairagala, Eric Peter
    Makerere Univ, Kampala, Uganda..
    Nasir, Muhammad Umair
    Ominor AI, Orlando, FL USA..
    Ajibade, Benjamin
    Ajayi, Tunde
    Gitau, Yvonne
    Abbott, Jade
    Ahmed, Mohamed
    Microsoft Africa Res Inst, Nairobi, Kenya..
    Ochieng, Millicent
    Microsoft Africa Res Inst, Nairobi, Kenya..
    Aremu, Anuoluwapo
    Ogayo, Perez
    CMU, Pittsburgh, PA USA..
    Mukiibi, Jonathan
    Makerere Univ, Kampala, Uganda..
    Kabore, Fatoumata Ouoba
    Kalipe, Godson
    Mbaye, Derguene
    Baamtu, Dakar, Senegal..
    Tapo, Allahsera Auguste
    RIT, Rochester, NY USA..
    Koagne, Victoire Memdjokam
    Munkoh-Buabeng, Edwin
    Wagner, Valencia
    SPU, Kimberley, South Africa..
    Abdulmumin, Idris
    ABU, Abuja, Nigeria..
    Awokoya, Ayodele
    UI Ibadan, Ibadan, Nigeria..
    Buzaaba, Happy
    Sibanda, Blessing
    NUST, Windhoek, Namibia..
    Bukula, Andiswa
    SADiLaR, Potchefstroom, South Africa..
    Manthalu, Sam
    Univ Malawi, Zomba, Malawi..
    A Few Thousand Translations Go A Long Way! Leveraging Pre-trained Models for African News Translation2022In: NAACL 2022: The 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Stroudsburg: Association for Computational Linguistics, 2022, p. 3053-3070Conference paper (Refereed)
    Abstract [en]

    Recent advances in the pre-training of language models leverage large-scale datasets to create multilingual models. However, low-resource languages are mostly left out in these datasets. This is primarily because many widely spoken languages are not well represented on the web and therefore excluded from the large-scale crawls used to create datasets. Furthermore, downstream users of these models are restricted to the selection of languages originally chosen for pre-training. This work investigates how to optimally leverage existing pre-trained models to create low-resource translation systems for 16 African languages. We focus on two questions: 1) How can pre-trained models be used for languages not included in the initial pre-training? and 2) How can the resulting translation models effectively transfer to new domains? To answer these questions, we create a new African news corpus covering 16 languages, of which eight languages are not part of any existing evaluation dataset. We demonstrate that the most effective strategy for transferring both to additional languages and to additional domains is to fine-tune large pre-trained models on small quantities of high-quality translation data.

  • 5. Agić, Zeljko
    et al.
    Tiedemann, Jörg
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Merkler, Danijela
    Krek, Simon
    Dobrovoljc, Kaja
    Moze, Sara
    Cross-lingual Dependency Parsing of Related Languages with Rich Morphosyntactic Tagsets2014In: Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants, 2014, p. 13-24Conference paper (Refereed)
  • 6.
    Ahlbom, Viktoria
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics.
    Sågvall Hein, Anna
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics.
    Test Suites Covering the Functional Specifications of the Sub-components of the Swedish Prototype1999In: Working Papers in Computational Linguistics & Language Engineering;13, ISSN 1401-923X, no 13, p. 28-Article in journal (Other academic)
  • 7.
    Ahlbom, Viktoria
    et al.
    Uppsala University, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Faculty of Languages, Department of Linguistics.
    Sågvall Hein, Anna
    Test Suites Covering the Functional Specifications of the Sub-components of the Swedish Prototype1999In: Working Papers in Computational Linguistics & Language Engineering;13, ISSN 1401-923X, no 13, p. 28-Article in journal (Other scientific)
  • 8.
    Ahrenberg, Lars and Merkel, Magnus and Ridings, Daniel and Sågvall Hein, Anna and Tiedemann, Jörg
    Uppsala University, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Faculty of Languages, Department of Linguistics.
    Automatic processing of parallel corpora: A Swedish perspective.1999Report (Other scientific)
    Abstract [en]

    As empirical methods have come to the fore in language technology and translation studies, the processing of parallel texts and parallel corpora have become a major issue. In this article we review the state of the art in alignment and data extraction tec

  • 9.
    Ahrenberg, Lars
    et al.
    Linköping University, Sweden.
    Megyesi, BeátaUppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Proceedings of the Workshop on NLP and Pseudonymisation2019Conference proceedings (editor) (Refereed)
    Download full text (pdf)
    fulltext
  • 10. Ahrenberg, Lars
    et al.
    Merkel, Magnus
    Sågvall Hein, Anna
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics.
    Tiedemann, Jörg
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics.
    Evaluation of LWA and UWA1999Report (Other academic)
  • 11.
    Ait-Mlouk, Addi
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Scientific Computing. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computational Science.
    Alawadi, Sadi
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Scientific Computing. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computational Science.
    Toor, Salman
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Scientific Computing. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computational Science.
    Hellander, Andreas
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Scientific Computing. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computational Science.
    FedQAS: Privacy-Aware Machine Reading Comprehension with Federated Learning2022In: Applied Sciences, E-ISSN 2076-3417, Vol. 12, no 6, article id 3130Article in journal (Refereed)
    Abstract [en]

    Machine reading comprehension (MRC) of text data is a challenging task in Natural Language Processing (NLP), with a lot of ongoing research fueled by the release of the Stanford Question Answering Dataset (SQuAD) and Conversational Question Answering (CoQA). It is considered to be an effort to teach computers how to "understand" a text, and then to be able to answer questions about it using deep learning. However, until now, large-scale training on private text data and knowledge sharing has been missing for this NLP task. Hence, we present FedQAS, a privacy-preserving machine reading system capable of leveraging large-scale private data without the need to pool those datasets in a central location. The proposed approach combines transformer models and federated learning technologies. The system is developed using the FEDn framework and deployed as a proof-of-concept alliance initiative. FedQAS is flexible, language-agnostic, and allows intuitive participation and execution of local model training. In addition, we present the architecture and implementation of the system, as well as provide a reference evaluation based on the SQuAD dataset, to showcase how it overcomes data privacy issues and enables knowledge sharing between alliance members in a Federated learning setting.

    Download full text (pdf)
    FULLTEXT01
  • 12.
    Aleksandrova, Anastasiia
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. U.
    Exploring Language Descriptions through Vector Space Models2024Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    The abundance of natural languages and the complexities involved in describingtheir structures pose significant challenges for modern linguists, not only in documentation but also in the systematic organization of knowledge. Computational linguisticstools hold promise in comprehending the “big picture”, provided existing grammars aredigitized and made available for analysis using state-of-the-art language models. Extensive efforts have been made by an international team of linguists to compile such aknowledge base, resulting in the DReaM corpus – a comprehensive dataset comprisingtens of thousands of digital books containing multilingual language descriptions.However, there remains a lack of tools that facilitate understanding of concise language structures and uncovering overlooked topics and dialects. This thesis representsa small step towards elucidating the broader picture by utilizing a subset of the DReaMcorpus as a vector space capable of capturing genetic ties among described languages.To achieve this, we explore several encoding algorithms in conjunction with varioussegmentation strategies and vector summarization approaches for generating bothmonolingual and cross-lingual feature representations of selected grammars in Englishand Russian.Our newly proposed sentence-facets TF-IDF model shows promise in unsupervisedgeneration of monolingual representations, conveying sufficient signal to differentiate historical linguistic relations among 484 languages from 26 language familiesbased on their descriptions. However, the construction of a cross-lingual vector spacenecessitates further exploration of advanced technologies.

    Download full text (pdf)
    fulltext
  • 13.
    Aleksandrova, Anastasiia
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. RISE Res Inst Sweden, Stockholm, Sweden.
    Models and Strategies for Russian Word Sense Disambiguation: A Comparative Analysis2024In: Text, Speech, and Dialogue: 27th International Conference, TSD 2024, Brno, Czech Republic, September 9–13, 2024, Proceedings, Part I / [ed] Elmar Nöth; Aleš Horák; Petr Sojka, Cham: Springer, 2024, p. 267-278Conference paper (Refereed)
    Abstract [en]

    Word sense disambiguation (WSD) is a core task in computational linguistics that involves interpreting polysemous words in context by identifying senses from a predefined sense inventory. Despite the dominance of BERT and its derivatives in WSD evaluation benchmarks, their effectiveness in encoding and retrieving word senses, especially in languages other than English, remains relatively unexplored. This paper provides a detailed quantitative analysis, comparing various BERT-based models for Russian, and examines two primary WSD strategies: fine-tuning and feature-based nearest-neighbor classification. The best results are obtained with the ruBERT model coupled with the feature-based nearest neighbor strategy. This approach adeptly captures even fine-grained meanings with limited data and diverse sense distributions.

  • 14. Alemu, Atelach
    et al.
    Hulth, Anette
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. Datorlingvistik.
    General-Purpose Text Categorization Applied to the Medical Domain.2007Report (Other academic)
    Abstract [en]

    This paper presents work where a general-purpose text categorization method was applied to categorize medical free-texts. The purpose of the experiments was to examine how such a method performs without any domain-specific knowledge, hand-crafting or tuning. Additionally, we compare the results from the general-purpose method with results from runs in which a medical thesaurus as well as automatically extracted keywords were used when building the classifiers. We show that standard text categorization techniques using stemmed unigrams as the basis for learning can be applied directly to categorize medical reports, yielding an F-measure of 83.9, and outperforming the more sophisticated methods.

  • 15. Almqvist, Ingrid
    et al.
    Sågvall Hein, Anna
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics.
    Defining ScaniaSwedish - A Controlled Language for Truck Maintenance1996In: Proceedings of the First International Workshop on Controlled Language Applications, Centre for Computational Linguistics. Katholieke Universiteit Leuven , 1996Conference paper (Refereed)
    Abstract [en]

    An approach to integrated multilingual document production is proposed. The basic idea of this approach is to use the analyzer of a modular, transferbased machine translation system as the core of a language checker. The checker generates grammatical structures to be forwarded to the transfer and generation components for the various target languages. A precondition for such an approach is a controlled source language. The source language in focus of this presentation, is ScaniaSwedish, to be defined via a standardization of the language presently used by Scania in their truck maintenance documents. Here we concentrate on the identification of the vocabulary of current ScaniaSwedish and present the results that we achieved so far. In parallel with the inventory of the vocabulary, the competence of the language checker is developed.

  • 16. Almqvist, Inrid
    et al.
    Sågvall Hein, Anna
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics.
    A Language Checker of Controlled Language and its Integration in a Documentation and Translation Workflow2000In: Translating and the Computer 22: Proceedings of the Twenty-second international conference, 16-17 November, 2000, London, London: Aslib, 2000, Vol. 22Conference paper (Refereed)
  • 17.
    Andersson, Alexander
    et al.
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Systems and Control.
    Caracolias, Vilma
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Systems and Control.
    Unifying a Large Language Model and Company Specific Terminology for Enhanced Translation in a Company’s Multilingual Landscape2024Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    This thesis investigates the possibility of integrating a large language model with industry-specific terminology to enhance translation accuracy in Ahlsell’s daily operations. Additionally, the thesis seeks to identify effective strategies for implementing a translation tool within the organization’s operations by using the change management model ADKAR. Dictionaries and synonyms of industry-specific words were used to provide desired translations to an existing translation model. Translations were conducted on product descriptions and evaluated with BLEU-scores. A test group was formed to evaluate translations and identify words to be added to the dictionaries. Furthermore, meetings and demos were held with the goal to spread information and retrieve feedback. Impact goals were also set at the beginning of the project to state the desired outcomes of the implementation. The research has yielded a functional and efficient translation model employing GPT-4, incorporating a dictionary and synonyms tailored for translation between English and Nordic languages within Ahlsell’s operations. This study initiated the establishment of a company-specific dictionary, aimed at further refining translation capabilities. The study also found that considerations such as establishing clear responsibility guidelines and effective information dissemination strategies should be prioritized in the future for a successful implementation of the translation tool.

    Download full text (pdf)
    fulltext
  • 18.
    Andréasson, Maia
    et al.
    Department of Swedish Language, University of Gothenburg.
    Borin, Lars
    Department of Swedish Language, University of Gothenburg.
    Forsberg, Markus
    Department of Swedish Language, University of Gothenburg.
    Beskow, Jonas
    School of Computer Science and Communication, KTH.
    Carlsson, Rolf
    School of Computer Science and Communication, KTH.
    Edlund, Jens
    School of Computer Science and Communication, KTH.
    Elenius, Kjell
    School of Computer Science and Communication, KTH.
    Hellmer, Kahl
    School of Computer Science and Communication, KTH.
    House, David
    School of Computer Science and Communication, KTH.
    Merkel, Magnus
    Department of Computer Science, Linköping University.
    Forsbom, Eva
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Megyesi, Beáta
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Eriksson, Anders
    Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg.
    Strömqvist, Sven
    Centre for Languages and Literature, Lund University.
    Swedish CLARIN Activities2009In: Proceedings of the NODALIDA 2009 workshop Nordic Perspectives on the CLARIN Infrastructure of Language Resources / [ed] Rickard Domeij, Kimmo Koskenniemi, Steven Krauwer, Bente Maegaard, Eiríkur Rögnvaldsson and Koenraad de Smedt, Northern European Association for Language Technology (NEALT) , 2009, p. 1-5Conference paper (Refereed)
    Abstract [en]

    Although Sweden has yet to allocate funds specifically intended for CLARIN activities, there are some ongoing activities which are directly relevant to CLARIN, and which are explicitly linked to CLARIN. These activities have been funded by the Committee for Research Infrastructures and its subcommittee DISC (Database Infrastructure Committee) of the Swedish Research Council.

  • 19.
    Antomonov, Filip
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Automatic Morphosyntactic Analaysis of Clinical Text2014Conference paper (Refereed)
    Abstract [en]

    Electronical health records, also called clinical texts, have their own linguistic characteristics and have been shown to deviate from standard language. Therefore, computational linguistics tools trained on standard language presumably do not achieve the same accuracy when applied to clinical data. In this paper, we describe a pipeline of tools for the automatic processing of clinical texts in Swedish from tokenization through part-of-speech tagging and dependency parsing. The evaluation of the components of the pipeline shows that existing NLP tools can be used, but performance drops greatly when models trained on standard language are applied to clinical data. We also present a small, syntactically annotated data set of clinical text to serve as gold standard.

    Download full text (pdf)
    fulltext
  • 20.
    Arentzen, Thomas
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Mary Retold2019In: Ancient Jew ReviewArticle in journal (Other academic)
  • 21.
    Arentzen, Thomas
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Johnsén, Henrik Rydell
    Westergren, Andreas
    Rubenson on the Move: A Biographical Journey2020In: Wisdom on the Move: Late Antique Traditions in Multicultural Conversation. Essays in Honor of Samuel Rubenson / [ed] Susan Ashbrook Harvey, Thomas Arentzen, Henrik Rydell Johnsén and Andreas Westergren, Leiden: Brill Academic Publishers, 2020, p. 247-250Chapter in book (Other academic)
  • 22.
    Arentzen, Thomas
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Westergren, Andreas
    Prolog2019In: Patristica Nordica Annuaria, ISSN 2001-2365, Vol. 34, p. 3-4Article in journal (Other academic)
  • 23.
    Axelsson, Hans
    et al.
    Uppsala University, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Faculty of Languages, Department of Linguistics and Philology.
    Blom, Oskar
    Uppsala University, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Faculty of Languages, Department of Linguistics and Philology.
    Utveckling av ett svensk-engelskt lexikon inom tåg- och transportdomänen2006Independent thesis Advanced level (degree of Master (One Year)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    This paper describes the process of building a machine translation lexicon for use in the train and transport domain with the machine translation system MATS. The lexicon will consist of a Swedish part, an English part and links between them and is derived from a Trados

    translation memory which is split into a training(90%) part and a testing(10%) part. The task is carried out mainly by using existing word linking software and recycling previous machine translation lexicons from other domains. In order to do this, a method is developed where focus lies on automation by means of both existing and self developed software, in combination with manual interaction. The domain specific lexicon is then extended with a domain neutral core lexicon and a less domain neutral general lexicon. The different lexicons are automatically and manually evaluated through machine translation on the test corpus. The automatic evaluation of the largest lexicon yielded a NEVA score of 0.255 and a BLEU score of 0.190. The manual evaluation saw 34% of the segments correctly translated, 37%, although not correct, perfectly understandable and 29% difficult to understand.

    Download full text (pdf)
    FULLTEXT01
  • 24. Baldwin, Timothy
    et al.
    Croft, William
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Savary, Agata
    Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics2021In: Dagstuhl Reports, ISSN 2192-5283, Vol. 11, no 7, p. 89-138Article in journal (Refereed)
    Abstract [en]

    Computational linguistics builds models that can usefully process and produce language and thatcan increase our understanding of linguistic phenomena. From the computational perspective,language data are particularly challenging notably due to their variable degree of idiosyncrasy(unexpected properties shared by few peer objects), and the pervasiveness of non-compositionalphenomena such as multiword expressions (whose meaning cannot be straightforwardly deducedfrom the meanings of their components, e.g. red tape, by and large, to pay a visit and to pullone’s leg) and constructions (conventional associations of forms and meanings). Additionally, ifmodels and methods are to be consistent and valid across languages, they have to face specificitiesinherent either to particular languages, or to various linguistic traditions.These challenges were addressed by the Dagstuhl Seminar 21351 entitled “Universals ofLinguistic Idiosyncrasy in Multilingual Computational Linguistics”, which took place on 30-31 August 2021. Its main goal was to create synergies between three distinct though partlyoverlapping communities: experts in typology, in cross-lingual morphosyntactic annotation and inmultiword expressions. This report documents the program and the outcomes of the seminar. Wepresent the executive summary of the event, reports from the 3 Working Groups and abstracts ofindividual talks and open problems presented by the participants.

  • 25. Baldwin, Timothy
    et al.
    Croft, William
    Nivre, Joakim
    Savary, Agata
    Stymne, Sara
    Vylomova, Ekaterina
    Universals of Linguistic Idiosyncrasy in Multilingual Computational Linguistics2023In: Vol. 13, no 5, p. 70p. 22-70Article in journal (Other academic)
    Abstract [en]

    The Dagstuhl Seminar 23191 entitled “Universals of Linguistic Idiosyncrasy in MultilingualComputational Linguistics” took place May 7–12, 2023. Its main objectives were to deepenthe understanding of language universals and linguistic idiosyncrasy, to harness idiosyncrasyin treebanking frameworks in computationally tractable ways, and to promote a higher degreeof convergence in universalism-driven initiatives to natural language morphology, syntax andsemantics.Most of the seminar was devoted to working group discussions, covering topics such as:representations below and beyond word boundaries; annotation of particular kinds of constructions;semantic representations, in particular for multiword expressions; finding idiosyncrasy in corpora;large language models; and methodological issues, community interactions and cross-communityinitiatives. Thanks to the collaboration of linguistic typologists, NLP experts and experts indifferent annotation frameworks, significant progress was made towards the theoretical, practicaland networking objectives of the seminar.

    Download full text (pdf)
    fulltext
  • 26. Ballesteros, Miguel
    et al.
    Gómez-Rodríguez, Carlos
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Optimizing Planar and 2-Planar Parsers with MaltOptimizer2012In: Revista de Procesamiento de Lenguaje Natural (SEPLN), ISSN 1135-5948, E-ISSN 1989-7553, Vol. 49, p. 171-178Article in journal (Refereed)
  • 27. Ballesteros, Miguel
    et al.
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Going to the Roots of Dependency Parsing2013In: Computational linguistics - Association for Computational Linguistics (Print), ISSN 0891-2017, E-ISSN 1530-9312, Vol. 39, no 1, p. 5-13Article in journal (Refereed)
    Abstract [en]

    Dependency trees used in syntactic parsing often include a root node representing a dummy word prefixed or suffixed to the sentence, a device that is generally considered a mere technical convenience and is tacitly assumed to have no impact on empirical results. We demonstrate that this assumption is false and that the accuracy of data-driven dependency parsers can in fact be sensitive to the existence and placement of the dummy root node. In particular, we show that a greedy, left-to-right, arc-eager transition-based parser consistently performs worse when the dummy root node is placed at the beginning of the sentence (following the current convention in data-driven dependency parsing) than when it is placed at the end or omitted completely. Control experiments with an arc-standard transition-based parser and an arc-factored graph-based parser reveal no consistent preferences but nevertheless exhibit considerable variation in results depending on root placement. We conclude that the treatment of dummy root nodes in data-driven dependency parsing is an underestimated source of variation in experiments and may also be a parameter worth tuning for some parsers.

  • 28. Ballesteros, Miguel
    et al.
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    MaltOptimizer: Fast and Effective Parser Optimization2016In: Natural Language Engineering, ISSN 1351-3249, E-ISSN 1469-8110, Vol. 22, no 2, p. 187-213Article in journal (Refereed)
    Abstract [en]

    Statistical parsers often require careful parameter tuning and feature selection. This is a nontrivial task for application developers who are not interested in parsing for its own sake, and it can be time-consuming even for experienced researchers. In this paper we present MaltOptimizer, a tool developed to automatically explore parameters and features for MaltParser, a transition-based dependency parsing system that can be used to train parser's given treebank data. MaltParser provides a wide range of parameters for optimization, including nine different parsing algorithms, an expressive feature specification language that can be used to define arbitrarily rich feature models, and two machine learning libraries, each with their own parameters. MaltOptimizer is an interactive system that performs parser optimization in three stages. First, it performs an analysis of the training set in order to select a suitable starting point for optimization. Second, it selects the best parsing algorithm and tunes the parameters of this algorithm. Finally, it performs feature selection and tunes machine learning parameters. Experiments on a wide range of data sets show that MaltOptimizer quickly produces models that consistently outperform default settings and often approach the accuracy achieved through careful manual optimization.

  • 29.
    Bartczak, Zuzanna
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    From RAG to Riches: Evaluating the Benefits of Retrieval-Augmented Generation in SQL Database Querying2024Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
    Abstract [en]

    This thesis investigates the performance of large language models (LLMs) in generating SQL queries, with a particular focus on the impact retrieval-augmented generation (RAG) has on this task. The research explores how four LLMs (GPT-3.5, GPT-4, Llama2, and Llama3), perform in translating natural language inputs into SQL queries, a task that is critical for users with limited technical expertise in database management. The study is structured around a series of experiments that compare the accuracy and efficiency of LLMs with and without the integration of RAG, across both simple and complex query scenarios. The findings reveal significant variations in model performance, with RAG-enhanced models demonstrating improved accuracy in generating contextually appropriate SQL queries. However, these benefits are often accompanied by increased computational costs and longer query generation times, raising questions about the practical feasibility of RAG in resource-constrained environments. The thesis also discusses the challenges associated with integrating LLMs into existing database management systems. Through this analysis, we hope to contribute to a better understanding of the potential and limitations of using LLMs for SQL query generation in real-life scenarios, providing possible directions for future research in database management and natural language processing.

    Download full text (pdf)
    bartczak_rag_to_riches
  • 30.
    Baró, Arnau
    et al.
    Computer Vision Center, Computer Science Department, Universitat Autònoma de Barcelona Bellaterra, Spain.
    Chen, Jialuo
    Computer Vision Center, Computer Science Department, Universitat Autònoma de Barcelona Bellaterra, Spain.
    Fornés, Alicia
    Computer Vision Center, Computer Science Department, Universitat Autònoma de Barcelona Bellaterra, Spain.
    Megyesi, Beáta
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Towards a Generic Unsupervised Method for Transcription of Encoded Manuscripts2019In: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage: DATeCH2019, New York: ACM , 2019Conference paper (Refereed)
    Abstract [en]

    Historical ciphers, a special type of manuscripts, contain encrypted information, important for the interpretation of our history. The first step towards decipherment is to transcribe the images, either manually or by automatic image processing techniques. Despite the improvements in handwritten text recognition (HTR) thanks to deep learning methodologies, the need of labelled data to train is an important limitation. Given that ciphers often use symbol sets across various alphabets and unique symbols without any transcription scheme available, these supervised HTR techniques are not suitable to transcribe ciphers. In this paper we propose an unsupervised method for transcribing encrypted manuscripts based on clustering and label propagation, which has been successfully applied to community detection in networks. We analyze the performance on ciphers with various symbol sets, and discuss the advantages and drawbacks compared to supervised HTR methods.

  • 31.
    Basirat, Ali
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Principal Word Vectors2018Doctoral thesis, monograph (Other academic)
    Abstract [en]

    Word embedding is a technique for associating the words of a language with real-valued vectors, enabling us to use algebraic methods to reason about their semantic and grammatical properties. This thesis introduces a word embedding method called principal word embedding, which makes use of principal component analysis (PCA) to train a set of word embeddings for words of a language. The principal word embedding method involves performing a PCA on a data matrix whose elements are the frequency of seeing words in different contexts. We address two challenges that arise in the application of PCA to create word embeddings. The first challenge is related to the size of the data matrix on which PCA is performed and affects the efficiency of the word embedding method. The data matrix is usually a large matrix that requires a very large amount of memory and CPU time to be processed. The second challenge is related to the distribution of word frequencies in the data matrix and affects the quality of the word embeddings. We provide an extensive study of the distribution of the elements of the data matrix and show that it is unsuitable for PCA in its unmodified form.

    We overcome the two challenges in principal word embedding by using a generalized PCA method. The problem with the size of the data matrix is mitigated by a randomized singular value decomposition (SVD) procedure, which improves the performance of PCA on the data matrix. The data distribution is reshaped by an adaptive transformation function, which makes it more suitable for PCA. These techniques, together with a weighting mechanism that generalizes many different weighting and transformation approaches used in literature, enable the principal word embedding to train high quality word embeddings in an efficient way.

    We also provide a study on how principal word embedding is connected to other word embedding methods. We compare it to a number of word embedding methods and study how the two challenges in principal word embedding are addressed in those methods. We show that the other word embedding methods are closely related to principal word embedding and, in many instances, they can be seen as special cases of it.

    The principal word embeddings are evaluated in both intrinsic and extrinsic ways. The intrinsic evaluations are directed towards the study of the distribution of word vectors. The extrinsic evaluations measure the contribution of principal word embeddings to some standard NLP tasks. The experimental results confirm that the newly proposed features of principal word embedding (i.e., the randomized SVD algorithm, the adaptive transformation function, and the weighting mechanism) are beneficial to the method and lead to significant improvements in the results. A comparison between principal word embedding and other popular word embedding methods shows that, in many instances, the proposed method is able to generate word embeddings that are better than or as good as other word embeddings while being faster than several popular word embedding methods.

    Download full text (pdf)
    fulltext
    Download (jpg)
    presentationsbild
  • 32. Basirat, Ali
    Random Word Vectors2019Conference paper (Other academic)
    Download full text (pdf)
    fulltext
  • 33.
    Basirat, Ali
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. Linkoping Univ, Dept Comp & Informat Sci, Linkoping, Sweden.
    Allassonnière-Tang, Marc
    Univ Lyon, Lab Dynam Language, CNRS, UMR 5596, Lyon, France.
    Berdicevskis, Aleksandrs
    Univ Gothenburg, Dept Swedish, Sprakbanken Text, Gothenburg, Sweden.
    An empirical study on the contribution of formal and semantic features to the grammatical gender of nouns2021In: Linguistics Vanguard, E-ISSN 2199-174X, Vol. 7, no 1, article id 20200048Article in journal (Refereed)
    Abstract [en]

    This study conducts an experimental evaluation of two hypotheses about the contributions of formal and semantic features to the grammatical gender assignment of nouns. One of the hypotheses (Corbett and Fraser, 2000) claims that semantic features dominate formal ones. The other hypothesis, formulated within the optimal gender assignment theory (Rice, 2006), states that form and semantics contribute equally. Both hypotheses claim that the combination of formal and semantic features yields the most accurate gender identification.  In this paper, we operationalize and test these hypotheses by trying to predict grammatical gender using only character-based embeddings (that capture only formal features), only context-based embeddings (that capture only semantic features) and the combination of both. We performed the experiment using data from three languages with different gender systems (French, German and Russian). Formal features are a significantly better predictor of gender than semantic ones, and the difference in prediction accuracy is very large. Overall, formal features are also significantly better than the combination of form and semantics, but the difference is very small and the results for this comparison are not entirely consistent across languages.

  • 34. Basirat, Ali
    et al.
    de Lhoneux, Miryam
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Kulmizev, Artur
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Kurfal, Murathan
    Department of Linguistics, Stockholm University.
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Östling, Robert
    Department of Linguistics, Stockholm University.
    Polyglot Parsing for One Thousand and One Languages (And Then Some)2019Conference paper (Other academic)
    Download full text (pdf)
    fulltext
  • 35. Basirat, Ali
    et al.
    Fa, Heshaam
    Constructing Linguistically Motivated Structuresfrom Statistical Grammars2011In: Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, 2011, p. 63-69Conference paper (Refereed)
  • 36. Basirat, Ali
    et al.
    Faili, Heshaam
    Bridge the gap between statistical and hand-crafted grammars2013In: Computer speech & language (Print), ISSN 0885-2308, E-ISSN 1095-8363, Vol. 27, no 5, p. 1085-1104Article in journal (Refereed)
    Abstract [en]

    LTAG is a rich formalism for performing NLP tasks such as semantic interpretation, parsing, machine translation and information retrieval. Depend on the specific NLP task, different kinds of LTAGs for a language may be developed. Each of these LTAGs is enriched with some specific features such as semantic representation and statistical information that make them suitable to be used in that task. The distribution of these capabilities among the LTAGs makes it difficult to get the benefit from all of them in NLP applications.

    This paper discusses a statistical model to bridge between two kinds LTAGs for a natural language in order to benefit from the capabilities of both kinds. To do so, an HMM was trained that links an elementary tree sequence of a source LTAG onto an elementary tree sequence of a target LTAG. Training was performed by using the standard HMM training algorithm called Baum–Welch. To lead the training algorithm to a better solution, the initial state of the HMM was also trained by a novel EM-based semi-supervised bootstrapping algorithm.

    The model was tested on two English LTAGs, XTAG (XTAG-Group, 2001) and MICA's grammar (Bangalore et al., 2009) as the target and source LTAGs, respectively. The empirical results confirm that the model can provide a satisfactory way for linking these LTAGs to share their capabilities together.

  • 37.
    Basirat, Ali
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. University of Tehran.
    Faili, Heshaam
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    A statistical model for grammar mapping2016In: Natural Language Engineering, ISSN 1351-3249, E-ISSN 1469-8110, Vol. 22, no 2, p. 215-255Article in journal (Refereed)
    Abstract [en]

    The two main classes of grammars are (a) hand-crafted grammars, which are developed bylanguage experts, and (b) data-driven grammars, which are extracted from annotated corpora.This paper introduces a statistical method for mapping the elementary structures of a data-driven grammar onto the elementary structures of a hand-crafted grammar in order to combinetheir advantages. The idea is employed in the context of Lexicalized Tree-Adjoining Grammars(LTAG) and tested on two LTAGs of English: the hand-crafted LTAG developed in theXTAG project, and the data-driven LTAG, which is automatically extracted from the PennTreebank and used by the MICA parser. We propose a statistical model for mapping anyelementary tree sequence of the MICA grammar onto a proper elementary tree sequence ofthe XTAG grammar. The model has been tested on three subsets of the WSJ corpus thathave average lengths of 10, 16, and 18 words, respectively. The experimental results show thatfull-parse trees with average F1 -scores of 72.49, 64.80, and 62.30 points could be built from94.97%, 96.01%, and 90.25% of the XTAG elementary tree sequences assigned to the subsets,respectively. Moreover, by reducing the amount of syntactic lexical ambiguity of sentences,the proposed model significantly improves the efficiency of parsing in the XTAG system.

  • 38.
    Basirat, Ali
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Greedy Universal Dependency Parsing with Right Singular Word Vectors2016Conference paper (Refereed)
    Abstract [en]

    A set of continuous feature vectors formed by right singular vectors of a transformed co-occurrence matrix are used with the Stanford neural dependency parser to train parsing models for a limited number of languages in the corpus of universal dependencies. We show that the feature vector can help the parser to remain greedy and be as accurate as (or even more accurate than) some other greedy and non-greedy parsers.

    Download full text (pdf)
    fulltext
  • 39.
    Basirat, Ali
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Real-valued syntactic word vectors2020In: Journal of experimental and theoretical artificial intelligence (Print), ISSN 0952-813X, E-ISSN 1362-3079, Vol. 32, no 4, p. 557-579Article in journal (Refereed)
    Abstract [en]

    We introduce a word embedding method that generates a set of real-valued word vectors from a distributional semantic space. The semantic space is built with a set of context units (words) which are selected by an entropy-based feature selection approach with respect to the certainty involved in their contextual environments. We show that the most predictive context of a target word is its preceding word. An adaptive transformation function is also introduced that reshapes the data distribution to make it suitable for dimensionality reduction techniques. The final low-dimensional word vectors are formed by the singular vectors of a matrix of transformed data. We show that the resulting word vectors are as good as other sets of word vectors generated with popular word embedding methods.

    Download full text (pdf)
    fulltext
  • 40.
    Basirat, Ali
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Real-valued Syntactic Word Vectors (RSV) for Greedy Neural Dependency Parsing2017Conference paper (Refereed)
    Abstract [en]

    We show that a set of real-valued word vectors formed by right singular vectors of a transformed co-occurrence matrix are meaningful for determining different types of dependency relations between words. Our experimental results on the task of dependency parsing confirm the superiority of the word vectors to the other sets of word vectors generated by popular methods of word embedding. We also study the effect of using these vectors on the accuracy of dependency parsing in different languages versus using more complex parsing architectures.

    Download full text (pdf)
    fulltext
  • 41.
    Basirat, Ali
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Syntactic Nuclei in Dependency Parsing –: A Multilingual Exploration2021In: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics, 2021, p. 1376-1387Conference paper (Refereed)
    Abstract [en]

    Standard models for syntactic dependency parsing take words to be the elementary units that enter into dependency relations. In this paper, we investigate whether there are any benefits from enriching these models with the more abstract notion of nucleus proposed by Tesniere. We do this by showing how the concept of nucleus can be defined in the framework of Universal Dependencies and how we can use composition functions to make a transition-based dependency parser aware of this concept. Experiments on 12 languages show that nucleus composition gives small but significant improvements in parsing accuracy. Further analysis reveals that the improvement mainly concerns a small number of dependency relations, including relations of coordination, direct objects, nominal modifiers, and main predicates.

    Download full text (pdf)
    fulltext
  • 42.
    Basirat, Ali
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Tang, Marc
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Lexical and Morpho-syntactic Features in Word Embeddings: A Case Study of Nouns in Swedish2018In: Proceedings of the 10th International Conference on Agents and Artificial Intelligence: Volume 2, Setubal: SciTePress, 2018, p. 663-674Chapter in book (Refereed)
  • 43.
    Basirat, Ali
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Tang, Marc
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Linguistic information in word embeddings2019In: Agents and Artificial Intelligence / [ed] Jaap van den Herik, Ana Paula Rocha, Cham: Springer, 2019, p. 492-513Conference paper (Refereed)
    Abstract [en]

    We study the presence of linguistically motivated information in the word embeddings generated with statistical methods. The nominal aspects of uter/neuter, common/proper, and count/mass in Swedish are selected to represent respectively grammatical, semantic, and mixed types of nominal categories within languages. Our results indicate that typical grammatical and semantic features are easily captured by word embeddings. The classification of semantic features required significantly less neurons than grammatical features in our experiments based on a single layer feed-forward neural network. However, semantic features also generated higher entropy in the classification output despite its high accuracy. Furthermore, the count/mass distinction resulted in difficulties to the model, even though the quantity of neurons was almost tuned to its maximum.

  • 44. Basirat, Ali
    et al.
    Tang, Marc
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Neural network and human cognition: A case study of grammatical gender in Swedish2017In: Proceedings of the 13th Swedish Cognitive Science Society (SweCog) national conference, Uppsala, 2017, p. 28-30Conference paper (Other academic)
    Download full text (pdf)
    fulltext
  • 45.
    Beck, Daniel
    et al.
    University of Sheffield.
    Cohn, Trevor
    University of Melbourne.
    Hardmeier, Christian
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Specia, Lucia
    University of Sheffield.
    Learning Structural Kernels for Natural Language Processing2015In: Transactions of the Association for Computational Linguistics, ISSN 2307-387X, Vol. 3, p. 461-473Article in journal (Refereed)
    Abstract [en]

    Structural kernels are a flexible learning paradigm that has been widely used in Natural Language Processing. However, the problem of model selection in kernel-based methods is usually overlooked. Previous approaches mostly rely on setting default values for kernel hyperparameters or using grid search, which is slow and coarse-grained. In contrast, Bayesian methods allow efficient model selection by maximizing the evidence on the training data through gradient-based methods. In this paper we show how to perform this in the context of structural kernels by using Gaussian Processes. Experimental results on tree kernels show that this procedure results in better prediction performance compared to hyperparameter optimization via grid search. The framework proposed in this paper can be adapted to other structures besides trees, e.g., strings and graphs, thereby extending the utility of kernel-based methods.

    Download full text (pdf)
    TACL2015
  • 46.
    Beloucif, Meriem
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Bansal, Mihir
    Carnegie Mellon University.
    Biemann, Chris
    Hamburg University.
    Using Wikidata for Enhancing Compositionality in Pre-trained Language Models2023In: Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing / [ed] Galia Angelova; Maria Kunilovskaya; Ruslan Mitkov, INCOMA , 2023, , p. 8p. 170-178Conference paper (Refereed)
    Abstract [en]

    One of the many advantages of pre-trained language models (PLMs) such as BERT and RoBERTa is their flexibility and contextual nature. These features give PLMs strong capabilities for representing lexical semantics. However, PLMs seem incapable of capturing high-level semantics in terms of compositionally. We show that when augmented with the relevant semantic knowledge, PMLs learn to capture a higher degree of lexical compositionality. We annotate a large dataset from Wikidata highlighting a type of semantic inference that is easy for humans to understand but difficult for PLMs, like the correlation between age and date of birth. We use this resource for finetuning DistilBERT, BERT large and RoBERTa. Our results show that the performance of PLMs against the test data continuously improves when augmented with such a rich resource. Our results are corroborated by a consistent improvement over most GLUE benchmark natural language understanding tasks.

  • 47.
    Beloucif, Meriem
    et al.
    MIN Faculty, Universität Hamburg.
    Biemann, Chris
    MIN Faculty, Universität Hamburg.
    Probing Pre-trained Language Models for Semantic Attributes and their Values2021In: Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021 / [ed] Association for Computational Linguistics, 2021, p. 2554-2559, article id 218Conference paper (Refereed)
    Abstract [en]

    Pretrained Language Models (PTLMs) yield state-of-the-art performance on many Natural Language Processing tasks, including syntax, semantics and commonsense reasoning. In this paper, we focus on identifying to what extent do PTLMs capture semantic attributes and their values, e.g. the relation between rich and high net worth. We use PTLMs to predict masked tokens using patterns and lists of items from Wikidata in order to verify how likely PTLMs encode semantic attributes along with their values. Such inferences based on semantics are intuitive for us humans as part of our language understanding. Since PTLMs are trained on large amounts of Wikipedia data, we would assume that they can generate similar predictions. However, our findings reveal that PTLMs perform still much worse than humans on this task. We show an analysis which explains how to exploit our methodology to integrate better context and semantics into PTLMs using knowledge bases.

    Download full text (pdf)
    fulltext
  • 48.
    Benali, B. Ait
    et al.
    Hassan First Univ Settat, Fac Sci & Tech, IR2M Lab, Settat, Morocco..
    Mihi, S.
    Hassan First Univ Settat, Fac Sci & Tech, IR2M Lab, Settat, Morocco..
    Ait-Mlouk, Addi
    Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology.
    El Bazi, I
    Sultan Moulay Slimane Univ, Natl Sch Business & Management, Beni Mellal, Morocco..
    Laachfoubi, N.
    Hassan First Univ Settat, Fac Sci & Tech, IR2M Lab, Settat, Morocco..
    Arabic named entity recognition in social media based on BiLSTM-CRF using an attention mechanism2022In: Journal of Intelligent & Fuzzy Systems, ISSN 1064-1246, E-ISSN 1875-8967, Vol. 42, no 6, p. 5427-5436Article in journal (Refereed)
    Abstract [en]

    Named Entity Recognition (NER) is a vitally important task of Natural Language Processing (NLP), which aims at finding named entities in natural language text and classifying them into predefined categories such as persons (PER), places (LOC), organizations (ORG), and so on. In the Arabic context, the current NER approaches based on deep learning are mainly based on word embedding or character-level embedding as input. However, using a single granularity representation has problems with out-of-vocabulary (OOV), word embedding errors, and relatively simple semantic content. This paper presents a multi-headed self-attention mechanism implemented in the BiLSTM-CRF neural network structure to recognize Arabic named entities on social media using two embeddings. Unlike other state-of-the-art approaches, this approach combines character and word embedding at the embedding layer, and the attention mechanism calculates the similarity over the entire sequence of characters and captures local context information. The proposed approach better recognized NEs in Dialect Arabic, reaching an F1 value of 74.15% on Darwish's dataset (a publicly available Arabic NER benchmark for social media). According to our knowledge, our findings outperform the current state-of-the-art models for Arabic Named Entity Recognition on social media.

  • 49. Bengoetxea, Kepa
    et al.
    Agirre, Eneko
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Zhang, Yue
    Gojenola, Koldo
    On WordNet Semantic Classes and Dependency Parsing2014In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2014, p. 649-655Conference paper (Refereed)
  • 50.
    Bengtsson, Camilla
    et al.
    Uppsala University, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Faculty of Languages, Department of Linguistics.
    Borin, Lars
    Oxhammar, Henrik
    Comparing and combining part-of-speech taggers for multilingual parallel corpora2000Article in journal (Other scientific)
1234567 1 - 50 of 936
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf