uu.seUppsala University Publications
Change search
Refine search result
1 - 36 of 36
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    af Geijerstam, Åsa
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Att skriva i naturorienterande ämnen i skolan2006Doctoral thesis, monograph (Other academic)
    Abstract [en]

    When children encounter new subjects in school, they are also faced with new ways of using language. Learning science thus means learning the language of science, and writing is one of the ways this is accomplished. The present study investigates writing in natural sciences in grades 5 and 8 in Swedish schools. Major theoretical influences for these investigations are found within the socio-cultural, dialogical and social semiotic perspectives on language use.

    The study is based on texts written by 97 students, interviews around these texts and observations from 16 different classroom practices. Writing is seen as a situated practice; therefore analysis is carried out of the activities surrounding the texts. The student texts are analysed in terms of genre and in relation to their abstraction, density and use of expansions. This analysis shows among other things that the texts show increasing abstraction and density with increasing age, whereas the text structure and the use of expansions do not increase.

    It is also argued that a central point in school writing must be the students’ way of talking about their texts. Analysis of interviews with the students is thus carried out in terms of text movability. The results from this analysis indicate that students find it difficult to talk about their texts. They find it hard to express the main content of the text, as well as to discuss it’s function and potential readers.

    Previous studies argue that writing constitutes a potential for learning. In the material studied in this thesis, this potential learning tool is not used to any large extent. To be able to participate in natural sciences in higher levels, students need to take part in practices where the specialized language of natural science is used in writing as well as in speech.

  • 2.
    Basirat, Ali
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Principal Word Vectors2018Doctoral thesis, monograph (Other academic)
    Abstract [en]

    Word embedding is a technique for associating the words of a language with real-valued vectors, enabling us to use algebraic methods to reason about their semantic and grammatical properties. This thesis introduces a word embedding method called principal word embedding, which makes use of principal component analysis (PCA) to train a set of word embeddings for words of a language. The principal word embedding method involves performing a PCA on a data matrix whose elements are the frequency of seeing words in different contexts. We address two challenges that arise in the application of PCA to create word embeddings. The first challenge is related to the size of the data matrix on which PCA is performed and affects the efficiency of the word embedding method. The data matrix is usually a large matrix that requires a very large amount of memory and CPU time to be processed. The second challenge is related to the distribution of word frequencies in the data matrix and affects the quality of the word embeddings. We provide an extensive study of the distribution of the elements of the data matrix and show that it is unsuitable for PCA in its unmodified form.

    We overcome the two challenges in principal word embedding by using a generalized PCA method. The problem with the size of the data matrix is mitigated by a randomized singular value decomposition (SVD) procedure, which improves the performance of PCA on the data matrix. The data distribution is reshaped by an adaptive transformation function, which makes it more suitable for PCA. These techniques, together with a weighting mechanism that generalizes many different weighting and transformation approaches used in literature, enable the principal word embedding to train high quality word embeddings in an efficient way.

    We also provide a study on how principal word embedding is connected to other word embedding methods. We compare it to a number of word embedding methods and study how the two challenges in principal word embedding are addressed in those methods. We show that the other word embedding methods are closely related to principal word embedding and, in many instances, they can be seen as special cases of it.

    The principal word embeddings are evaluated in both intrinsic and extrinsic ways. The intrinsic evaluations are directed towards the study of the distribution of word vectors. The extrinsic evaluations measure the contribution of principal word embeddings to some standard NLP tasks. The experimental results confirm that the newly proposed features of principal word embedding (i.e., the randomized SVD algorithm, the adaptive transformation function, and the weighting mechanism) are beneficial to the method and lead to significant improvements in the results. A comparison between principal word embedding and other popular word embedding methods shows that, in many instances, the proposed method is able to generate word embeddings that are better than or as good as other word embeddings while being faster than several popular word embedding methods.

  • 3.
    Björk, Ingrid
    Uppsala University, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Faculty of Languages, Department of Linguistics and Philology.
    Relativizing linguistic relativity: Investigating underlying assumptions about language in the neo-Whorfian literature2008Doctoral thesis, monograph (Other academic)
    Abstract [en]

    This work concerns the linguistic relativity hypothesis, also known as the Sapir-Whorf hypothesis, which, in its most general form claims that ‘lan-guage’ influences ‘thought’. Past studies into linguistic relativity have treated various aspects of both thought and language, but a growing body of literature has recently emerged, in this thesis referred to as neo-Whorfian, that empirically investigates thought and language from a cross-linguistic perspective and claims that the grammar or lexicon of a particular language influences the speakers’ non-linguistic thought.

    The present thesis examines the assumptions about language that underlie this claim and criticizes the neo-Whorfian arguments from the point of view that they are based on misleading notions of language. The critique focuses on the operationalization of thought, language, and culture as separate vari-ables in the neo-Whorfian empirical investigations. The neo-Whorfian stud-ies explore language primarily as ‘particular languages’ and investigate its role as a variable standing in a causal relation to the ‘thought’ variable. Tho-ught is separately examined in non-linguistic tests and found to ‘correlate’ with language.

    As a contrast to the neo-Whorfian view of language, a few examples of other approaches to language, referred to in the thesis as sociocultural appro-aches, are reviewed. This perspective on language places emphasis on prac-tice and communication rather than on particular languages, which are vie-wed as secondary representations. It is argued that from a sociocultural per-spective, language as an integrated practice cannot be separated from tho-ught and culture. The empirical findings in the neo-Whorfian studies need not be rejected, but they should be interpreted differently. The findings of linguistic and cognitive diversity reflect different communicational practices in which language cannot be separated from non-language.

  • 4.
    Dahlqvist, Bengt
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Nordenfors, Mikael
    Inst för svenska språket, Göteborgs universitet.
    Using the Text Processing Tool Textin to Examine Developmental Aspects of School Texts2008In: Resourceful Language Technology: Festschrift in Honor of Anna Sågvall Hein / [ed] Nivre, Joakim, Dahllöf, Mats, Megyesi, Beáta, Uppsala: Uppsala universitet, 2008, p. 61-76Chapter in book (Other academic)
    Abstract [en]

    The purpose with this article is to first make a brief presentation of the functions in the web based text processing tool Textin 1.2, and then to illuminate these functions by putting the program to use within a research project in progress that concerns developmental aspects on texts written by Swedish pupils during school years 5 to 9. The text will begin with a brief description of Textins’ main functions, and then move on to previous research on school texts where computer linguistic methods either were used or could have been used if the technology had been accessible at the time being. The article then continues with a presentation of the results that Textin delivers, and ends with a discussion on these findings.

  • 5.
    de Lhoneux, Miryam
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Linguistically Informed Neural Dependency Parsing for Typologically Diverse Languages2019Doctoral thesis, monograph (Other academic)
    Abstract [en]

    This thesis presents several studies in neural dependency parsing for typologically diverse languages, using treebanks from Universal Dependencies (UD). The focus is on informing models with linguistic knowledge. We first extend a parser to work well on typologically diverse languages, including morphologically complex languages and languages whose treebanks have a high ratio of non-projective sentences, a notorious difficulty in dependency parsing. We propose a general methodology where we sample a representative subset of UD treebanks for parser development and evaluation. Our parser uses recurrent neural networks which construct information sequentially, and we study the incorporation of a recursive neural network layer in our parser. This follows the intuition that language is hierarchical. This layer turns out to be superfluous in our parser and we study its interaction with other parts of the network. We subsequently study transitivity and agreement information learned by our parser for auxiliary verb constructions (AVCs). We suggest that a parser should learn similar information about AVCs as it learns for finite main verbs. This is motivated by work in theoretical dependency grammar. Our parser learns different information about these two if we do not augment it with a recursive layer, but similar information if we do, indicating that there may be benefits from using that layer and we may not yet have found the best way to incorporate it in our parser. We finally investigate polyglot parsing. Training one model for multiple related languages leads to substantial improvements in parsing accuracy over a monolingual baseline. We also study different parameter sharing strategies for related and unrelated languages. Sharing parameters that partially abstract away from word order appears to be beneficial in both cases but sharing parameters that represent words and characters is more beneficial for related than unrelated languages.

  • 6.
    Dubremetz, Marie
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Detecting Rhetorical Figures Based on Repetition of Words: Chiasmus, Epanaphora, Epiphora2017Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    This thesis deals with the detection of three rhetorical figures based on repetition of words: chiasmus (“Fair is foul, and foul is fair.”), epanaphora (“Poor old European Commission! Poor old European Council.”) and epiphora (“This house is mine. This car is mine. You are mine.”). For a computer, locating all repetitions of words is trivial, but locating just those repetitions that achieve a rhetorical effect is not. How can we make this distinction automatically?

     First, we propose a new definition of the problem. We observe that rhetorical figures are a graded phenomenon, with universally accepted prototypical cases, equally clear non-cases, and a broad range of borderline cases in between. This makes it natural to view the problem as a ranking task rather than a binary detection task. We therefore design a model for ranking candidate repetitions in terms of decreasing likelihood of having a rhetorical effect, which allows potential users to decide for themselves where to draw the line with respect to borderline cases.

     Second, we address the problem of collecting annotated data to train the ranking model. Thanks to a selective method of annotation, we can reduce by three orders of magnitude the annotation work for chiasmus, and by one order of magnitude the work for epanaphora and epiphora. In this way, we prove that it is feasible to develop a system for detecting the three figures without an unsurmountable amount of human work.

     Finally, we propose an evaluation scheme and apply it to our models. The evaluation reveals that, even with a very incompletely annotated corpus, a system for repetitive figure detection can be trained to achieve reasonable accuracy. We investigate the impact of different linguistic features, including length, n-grams, part-of-speech tags, and syntactic roles, and find that different features are useful for different figures. We also apply the system to four different types of text: political discourse, fiction, titles of articles and novels, and quotations. Here the evaluation shows that the system is robust to shifts in genre and that the frequencies of the three rhetorical figures vary with genre.

    List of papers
    1. Rhetorical Figure Detection: the Case of Chiasmus
    Open this publication in new window or tab >>Rhetorical Figure Detection: the Case of Chiasmus
    2015 (English)In: Proceedings of the Fourth Workshop on Computational Linguistics for Literature, 2015, p. 23-31Conference paper, Published paper (Refereed)
    National Category
    Language Technology (Computational Linguistics)
    Research subject
    Computational Linguistics
    Identifiers
    urn:nbn:se:uu:diva-268899 (URN)
    Conference
    Fourth Workshop on Computational Linguistics for Literature Denver, Colorado, USA
    Available from: 2015-12-10 Created: 2015-12-10 Last updated: 2019-08-20Bibliographically approved
    2. Syntax Matters for Rhetorical Structure: The Case of Chiasmus
    Open this publication in new window or tab >>Syntax Matters for Rhetorical Structure: The Case of Chiasmus
    2016 (English)In: Proceedings of the Fifth Workshop on Computational Linguistics for Literature, 2016, p. 47-53Conference paper, Published paper (Refereed)
    National Category
    Language Technology (Computational Linguistics)
    Research subject
    Computational Linguistics
    Identifiers
    urn:nbn:se:uu:diva-310579 (URN)
    Conference
    Fifth Workshop on Computational Linguistics for Literature
    Available from: 2016-12-16 Created: 2016-12-16 Last updated: 2019-08-20
    3. Machine Learning for Rhetorical Figure Detection: More Chiasmus with Less Annotation
    Open this publication in new window or tab >>Machine Learning for Rhetorical Figure Detection: More Chiasmus with Less Annotation
    2017 (English)In: Proceedings of the 21st Nordic Conference of Computational Linguistics, Gothenburg, Sweden, 2017, p. 37-45Conference paper, Published paper (Refereed)
    Place, publisher, year, edition, pages
    Gothenburg, Sweden: , 2017
    National Category
    Language Technology (Computational Linguistics)
    Research subject
    Computational Linguistics
    Identifiers
    urn:nbn:se:uu:diva-333324 (URN)
    Conference
    21st Nordic Conference on Computational Linguistics, NoDaLiDa, 22-24 May 2017, Gothenburg, Sweden
    Available from: 2017-11-10 Created: 2017-11-10 Last updated: 2019-08-20
    4. Rhetorical Figure Detection: Chiasmus, Epanaphora, Epiphora
    Open this publication in new window or tab >>Rhetorical Figure Detection: Chiasmus, Epanaphora, Epiphora
    (English)In: Frontiers in Digital Humanities, E-ISSN 2297-2668Article in journal (Refereed) Submitted
    National Category
    Language Technology (Computational Linguistics)
    Research subject
    Computational Linguistics
    Identifiers
    urn:nbn:se:uu:diva-333348 (URN)
    Available from: 2017-11-10 Created: 2017-11-10 Last updated: 2019-08-20
  • 7.
    Edling, Agnes
    Uppsala University, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Faculty of Languages, Department of Linguistics and Philology.
    Abstraction and authority in textbooks: The textual paths towards specialized language2006Doctoral thesis, monograph (Other academic)
    Abstract [en]

    During a few hours of a school day, a student might read textbook texts which are highly diversified in terms of abstraction. Abstraction is a central feature of specialized language and the transition from everyday language to specialized language is one of the most important things formal education can offer students. That transition is the focus of this thesis.

    This study introduces a new three-graded classification of abstraction including the levels of specificity, generalization and abstraction, based on a discussion of the concept of abstraction. The investigations performed, based on this classification, show that texts from different subject areas display distinct patterns of abstraction. The Swedish literary texts had the lowest degree of abstraction, the social science texts had an intermediate degree and the natural science texts were the most generalized and abstract. The results also show that the degree of abstraction in the textbook texts increases in later grade levels.

    The thesis presents a new way of analyzing shifts between levels of abstraction and their functions. Interestingly, the texts with a medium degree of abstraction, the social science texts, are the ones with the greatest variety in shifts. The functions of the shifts differ with respect to cultural domains. The shifts in the Swedish literary texts in general belong to the everyday domain while the shifts in the natural science texts belong to a specialized domain. The shifts in the social science texts had features of both domains.

    A secondary aim of the thesis is to develop the understanding of the relationship between author and reader in the texts. The results from my investigation of modality in the Swedish textbook texts confirm the earlier findings from English and Spanish textbooks. In comparison to other text types, textbook texts present knowledge in a more authoritative and less modalized way.

    From time to time, abstraction is described as a feature that hinders students accessing texts. Some researchers even suggest a removal of features of specialized language in textbook texts, in order to increase students’ understanding. However, in a society where specialized knowledge is necessary, the access to specialized texts is important. A democratic view of education and school mandates that children and adolescents have the opportunity to encounter and learn to encounter specialized language in school. In analyzing the texts special attention is paid to the relationship between the texts, the contexts of use and the student readers.

  • 8.
    Falk, Angela
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of English.
    Narrative Patterns in Monolingual and Bilingual Life-History Conversations2009In: Multilingualism: proceedings of the 23rd Scandinavian Conference of Linguistics : Uppsala University, 1-3 October 2008 / [ed] Anju Saxena, Åke Viberg, Uppsala: Acta Universitatis Upsaliensis , 2009, p. 159-169Conference paper (Refereed)
  • 9.
    Folkeryd, Jenny W.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Writing with an Attitude: Appraisal and student texts in the school subject of Swedish2006Doctoral thesis, monograph (Other academic)
    Abstract [en]

    Learning in school is in many respects done through language. However, it has been shown that the language of school assignments is seldom explicitly discussed in school. Writing tasks are furthermore assigned without clear guidelines for how certain lexical choices make one text more powerful than another. The present study is a contribution to a linguistic and pedagogical discussion of student writing. More specifically the focus is on the use of evaluative language in texts written by students in the school subject of Swedish in grades 5, 8 and 11.

    The major investigations of the study have been accommodated within the theoretical framework of Appraisal. An overview is given of the language resources in the student texts for constructing emotion, judging behavior in ethical terms and valuing objects aesthetically. Another question addressed is that of how attitudinal meaning is intensified, thus creating greater or lesser degrees of positivity or negativity associated with the feelings.

    The results show that manifestations of attitude are found in practically all texts in the study. However, variations are noted in relation to different genres, age, proficiency level, language background and gender.

    A contribution of the study in relation to the theoretical framework upon which it draws is an extension of the system of Attitude as well as an identification of different patterns in the use of attitudinal resources. These patterns are furthermore discussed in relation to how students talk about their own written production in terms of text movability. Results indicate that students with a high degree of text movability also use attitudinal resources to a large extent.

    It is argued that applying the linguistic tool of Appraisal can facilitate a discussion of how to make one aspect of the hidden curriculum more visible, namely, how to write with an Attitude.

  • 10.
    Forsbom, Eva
    Uppsala University, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Faculty of Languages, Department of Linguistics and Philology.
    Good tag hunting: tagability of Granska tags2008In: Resourceful language technology: festschrift in honor of Anna Sågvall Hein, Uppsala: Acta Universitatis Upsaliensis , 2008, p. 77-85Chapter in book (Other (popular science, discussion, etc.))
  • 11.
    Hardmeier, Christian
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Discourse in Statistical Machine Translation2014Doctoral thesis, monograph (Other academic)
    Abstract [en]

    This thesis addresses the technical and linguistic aspects of discourse-level processing in phrase-based statistical machine translation (SMT). Connected texts can have complex text-level linguistic dependencies across sentences that must be preserved in translation. However, the models and algorithms of SMT are pervaded by locality assumptions. In a standard SMT setup, no model has more complex dependencies than an n-gram model. The popular stack decoding algorithm exploits this fact to implement efficient search with a dynamic programming technique. This is a serious technical obstacle to discourse-level modelling in SMT.

    From a technical viewpoint, the main contribution of our work is the development of a document-level decoder based on stochastic local search that translates a complete document as a single unit. The decoder starts with an initial translation of the document, created randomly or by running a stack decoder, and refines it with a sequence of elementary operations. After each step, the current translation is scored by a set of feature models with access to the full document context and its translation. We demonstrate the viability of this decoding approach for different document-level models.

    From a linguistic viewpoint, we focus on the problem of translating pronominal anaphora. After investigating the properties and challenges of the pronoun translation task both theoretically and by studying corpus data, a neural network model for cross-lingual pronoun prediction is presented. This network jointly performs anaphora resolution and pronoun prediction and is trained on bilingual corpus data only, with no need for manual coreference annotations. The network is then integrated as a feature model in the document-level SMT decoder and tested in an English–French SMT system. We show that the pronoun prediction network model more adequately represents discourse-level dependencies for less frequent pronouns than a simpler maximum entropy baseline with separate coreference resolution.

    By creating a framework for experimenting with discourse-level features in SMT, this work contributes to a long-term perspective that strives for more thorough modelling of complex linguistic phenomena in translation. Our results on pronoun translation shed new light on a challenging, but essential problem in machine translation that is as yet unsolved.

  • 12.
    Johansson, Christine
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of English.
    Geisler, Christer
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of English.
    The Uppsala Learner English Corpus: A new corpus of Swedish high school students' writing2009In: Multilingualism: proceedings of the 23rd Scandinavian Conference of Linguistics : Uppsala University, 1-3 October 2008 / [ed] Anju Saxena & Åke Viberg, Uppsala: Acta Universitatis Upsaliensis , 2009, p. 181-190Conference paper (Other academic)
  • 13.
    Liberg, Caroline
    et al.
    Uppsala University, Faculty of Educational Sciences, Department of Curriculum Studies.
    Forsbom, Eva
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Text and language in assessment of mathematics and science2009In: Multilingualism: Proceedings of the 23rd Scandinavian Conference of Linguistics / [ed] Anju Saxena, Åke Viberg, Uppsala: Uppsala universitet , 2009, p. 328-332Conference paper (Refereed)
  • 14.
    Lindgren, Josefin
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Developing narrative competence: Swedish, Swedish-German and Swedish-Turkish children aged 4–62018Doctoral thesis, monograph (Other academic)
    Abstract [en]

    This thesis investigates the development of oral narrative competence from age 4 to 6 in Swedish monolinguals (N=72) and in both languages of Swedish-German (N=46) and Swedish-Turkish (N=48) bilinguals growing up in Sweden. Picture-based fictional narratives were elicited with Cat/Dog and Baby Birds/Baby Goats from the Multilingual Assessment Instrument for Narratives (MAIN, Gagarina et al. 2012) and A2/B2 from the Edmonton Narrative Norms Instrument (ENNI, Schneider et al., 2005). Vocabulary, character introduction and narrative macrostructure were studied. Vocabulary production scores on Cross-linguistic lexical tasks (CLTs, Haman et al., 2015) were compared to NDW (number of different words) in narratives. Production of macrostructural components, macrostructural complexity, and answers to comprehension questions were analyzed. Effects of age and differences in performance between groups, between the bilinguals’ two languages, and between narrative tasks were investigated.

    Narrative comprehension was high already at age 4, but still developed substantially with age. In contrast, macrostructure in narrative production was at a rudimentary level at age 4. Even at age 6, the narratives contained few complete episodic structures. Children mainly included actions visible in the stimuli and rarely verbalized goals and other macrostructural components that required inferencing. The ability to introduce story characters appropriately developed strongly from age 4 to 6, but stimuli had a large effect on performance. Vocabulary showed most improvement from age 5 to 6. Development with age was clearer for the majority language Swedish than the minority languages German and Turkish, where individual variation was larger.

    In Swedish, pronounced differences were found between the bilingual groups. The Swedish-German bilinguals performed similarly to the monolinguals. On most measures, the Swedish-Turkish bilinguals performed lower than the other two groups, though precisely how much varied across measures. Generally, the Swedish-German children performed better in Swedish than in German, whereas the Swedish-Turkish children performed similarly in both languages or slightly higher in Turkish. The study shows that bilinguals’ two languages need not develop in parallel, and that results depend on the tasks and specific measures used. Bilingual groups differ from each other, and it is therefore not meaningful to compare all bilinguals to all monolinguals.

     

  • 15.
    Megyesi, Beata
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Dahlqvist, Bengt
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Pettersson, Eva
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Gustafson-Capkova, Sofia
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Supporting Research Environment for Less Explored Languages: A Case Study of Swedish and Turkish2008In: Resourceful Language Technology: Festschrift in Honor of Anna Sågvall Hein / [ed] Nivre, Joakim, Dahllöf, Mats, Megyesi, Beáta, Uppsala: Uppsala universitet, 2008, p. 96-110Chapter in book (Other academic)
  • 16.
    Nilsson, Mattias
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Computational Models of Eye Movements in Reading: A Data-Driven Approach to the Eye-Mind Link2012Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    This thesis investigates new methods for understanding eye movement behavior in reading based on the use of eye tracking corpora and data-driven modeling. Eye movement behavior is characterized by two basic, generally unconscious, decisions: where and when to move the eyes. We explore the idea that empirical eye movement data carries rich information about the processes that guide these decisions. Two methods are investigated, each addressing a different aspect of eye movements in reading. The role of prediction in eye movement modeling is emphasized, and new evaluation methods for assessing the predictive accuracy of models are proposed. 

    The decision of where to move the eyes is approached using standard machine learning methods. The model proposed learns where to move the eyes under different conditions associated with the words being read. Applied to new text, the model moves the eyes in ways it has learnt, showing characteristics similar to human readers. Furthermore, we propose the use of entropy to measure the similarity between observed and predicted eye movement behavior on held-out data. The main contribution is a flexible model, with few fixed parameters, that can be used to investigate decisions about where the eyes move during reading.     

    The decision of when to move the eyes is approached using time-to-event modeling (survival analysis). The model proposed learns the timing of eye movements under different conditions associated with the words being read. Applied to new text, the model estimates the probability that a fixation survives for any given length of time. We propose an entropy-related measure to assess the probabilistic temporal predictions of the model. The main contribution is the use of Cox hazards modeling to address questions about the strength, as well as the timing, of processes that influence the decision of when to move the eyes during reading.

    List of papers
    1. Learning Where to Look: Modeling Eye Movements in Reading
    Open this publication in new window or tab >>Learning Where to Look: Modeling Eye Movements in Reading
    2009 (English)In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL), Association for Computational Linguistics , 2009, p. 93-101Conference paper, Published paper (Refereed)
    Place, publisher, year, edition, pages
    Association for Computational Linguistics, 2009
    National Category
    Language Technology (Computational Linguistics) Language Technology (Computational Linguistics)
    Research subject
    Computational Linguistics
    Identifiers
    urn:nbn:se:uu:diva-111983 (URN)978-1-932432-29-9 (ISBN)
    Conference
    Thirteenth Conference on Computational Natural Language Learning (CoNLL), Boulder, USA,, June 4-5, 2009
    Available from: 2010-01-05 Created: 2010-01-05 Last updated: 2018-01-12Bibliographically approved
    2. Towards a Data-Driven Model of Eye Movement Control in Reading
    Open this publication in new window or tab >>Towards a Data-Driven Model of Eye Movement Control in Reading
    2010 (English)In: Proceedings of the ACL Workshop on Cognitive Modeling and Computational Linguistics, Association for Computational Linguistics , 2010Conference paper, Published paper (Refereed)
    Place, publisher, year, edition, pages
    Association for Computational Linguistics, 2010
    National Category
    Language Technology (Computational Linguistics) Language Technology (Computational Linguistics)
    Research subject
    Computational Linguistics
    Identifiers
    urn:nbn:se:uu:diva-129662 (URN)
    Conference
    ACL 2010, Uppsala, Sweden, July 11–16, 2010
    Available from: 2010-08-20 Created: 2010-08-20 Last updated: 2018-01-12Bibliographically approved
    3. Entropy-Driven Evaluation of Models of Eye Movement Control in Reading
    Open this publication in new window or tab >>Entropy-Driven Evaluation of Models of Eye Movement Control in Reading
    2011 (English)In: Proceedings of the 8th International NLPCS Workshop, 2011, p. 201-212Conference paper, Published paper (Refereed)
    National Category
    Language Technology (Computational Linguistics)
    Research subject
    Computational Linguistics
    Identifiers
    urn:nbn:se:uu:diva-167395 (URN)
    Conference
    The 8th International NLPCS Workshop, Copenhagen, Denmark, August 20--21, 2011
    Available from: 2012-01-26 Created: 2012-01-26 Last updated: 2018-01-12Bibliographically approved
    4. A Survival Analysis of Fixation Times in Reading
    Open this publication in new window or tab >>A Survival Analysis of Fixation Times in Reading
    2011 (English)In: Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics (ACL 2011), 2011, p. 107-115Conference paper, Published paper (Refereed)
    National Category
    General Language Studies and Linguistics
    Research subject
    Computational Linguistics
    Identifiers
    urn:nbn:se:uu:diva-167396 (URN)
    Conference
    The 2nd Workshop on Cognitive Modeling and Computational Linguistics (ACL 2011), Portland, Oregon, USA, June 23
    Available from: 2012-01-26 Created: 2012-01-26 Last updated: 2018-01-12Bibliographically approved
    5. Time-Varying Effects on Eye Movements during Reading
    Open this publication in new window or tab >>Time-Varying Effects on Eye Movements during Reading
    (English)Manuscript (preprint) (Other academic)
    National Category
    Language Technology (Computational Linguistics)
    Research subject
    Computational Linguistics
    Identifiers
    urn:nbn:se:uu:diva-167397 (URN)
    Available from: 2012-01-26 Created: 2012-01-26 Last updated: 2018-01-12
  • 17.
    Nivre, Joakim
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Dahllöf, MatsUppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.Megyesi, BeátaUppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Resourceful Language Technology: Festschrift in Honor of Anna Sågvall Hein2008Collection (editor) (Other academic)
    Abstract [en]

    As the first holder of the first chair in computational linguistics in Sweden, Anna Sågvall Hein has played a central role in the development of computational linguistics and language technology both in Sweden and on the international scene. Besides her valuable contributions to research, which include work on machine translation, syntactic parsing, grammar checking, word prediction, and corpus linguistics, she has been instrumental in establishing a national graduate school in language technology as well as an undergraduate program in language technology at Uppsala University. It is with great pleasure that we present her with this Festschrift to honor her lasting contributions to the field and to commemorate her retirement from the chair in computational linguistics at Uppsala University. The contributions to the Festschrift come from Anna’s friends and colleagues around the world and deal with many of the topics that are dear to her heart. A common theme in many of the articles, as well as in Anna’s own scientific work, is the design, development and use of adequate language technology resources, epitomized in the title Resourceful Language Technology.

  • 18.
    Nivre, Joakim
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Gustafson-Capkova, Sofia
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Salomonsson, Filip
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Cultivating a Swedish Treebank2008In: Resourceful Language Technology: Festschrift in Honor of Anna Sågvall Hein., Uppsala: Acta Universitatis Upsaliensis , 2008, p. 111-120Chapter in book (Other academic)
  • 19.
    Okati, Farideh
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    The Vowel Systems of Five Iranian Balochi Dialects2012Doctoral thesis, monograph (Other academic)
    Abstract [en]

    The vowel systems of five selected Iranian Balochi dialects are investigated in this study, which is the first work to apply empirical acoustic analysis to a large body of recorded data on the vowel inventories of different Balochi dialects spoken in Iran. The selected dialects are spoken in the five regions of Sistan (SI), Saravan (SA), Khash (KH), Iranshahr (IR), and Chabahar (CH) located in the province Sistan and Baluchestan in southeast Iran. The aim of the present fieldwork-based survey is to study how similar the vowel systems of these dialects are to the Common Balochi vowel system (i, iː, u, uː, a, aː, eː, oː), which is represented as the vowel inventory for the Balochi dialects in general, as well as how similar these dialects are to one another. 

    The investigation shows that length is contrastive in these dialects, although the durational dif-ferences between the long and short counterparts are quite small in some dialects. The study also reveals that there are some differences between the vowel systems of these dialects and the Com-mon Balochi sound inventory. The Common Balochi short /i/ vowel is modified to short /e/ in these dialects, and a strong tendency for the long /eː/ and /oː/ to become the diphthongs ie and ue, respec-tively, is observed in some of the investigated dialects, specifically in KH, which shows heavier diphthongization than the other dialects. It is also observed, especially in SI, SA, and CH, that the short /u/ shows strong tendencies to shift towards a lower position of an [o] vowel. In SI and SA, this shift seems to be a correlate of syllable structure, with lowering occurring mostly in closed syllables. It is possible that Persian, as the dominant language in the area, has had an influence on these dialects and caused a lowering tendency among the higher vowels. 

    The vowel systems in these dialects differ slightly from each other. Phonemically, the pairs e/eː, a/aː, u/uː, and the long vowels /iː/ and /oː/ are suggested for IR; the pairs a/aː, u/uː, the short /e/ and the long /iː/ as well as the diphthongs /ie/ and /ue/ substituted for the long /eː/ and /oː/, respectively, are suggested for KH; and finally the pairs e/eː, a/aː, o/oː, and the long vowels /iː/ and /uː/, which make a more symmetrical inventory, are suggested for the SI, SA, and CH dialects. In general, the vowels in these dialects show a range of phonetic variations. In addition, processes of fronting, which is most common in coronal contexts, and nasalization, which mostly occurs in nasal envi-ronments, are observed in the data researched. 

  • 20.
    Pettersson, Eva
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Spelling Normalisation and Linguistic Analysis of Historical Text for Information Extraction2016Doctoral thesis, monograph (Other academic)
    Abstract [en]

    Historical text constitutes a rich source of information for historians and other researchers in humanities. Many texts are however not available in an electronic format, and even if they are, there is a lack of NLP tools designed to handle historical text. In my thesis, I aim to provide a generic workflow for automatic linguistic analysis and information extraction from historical text, with spelling normalisation as a core component in the pipeline. In the spelling normalisation step, the historical input text is automatically normalised to a more modern spelling, enabling the use of existing taggers and parsers trained on modern language data in the succeeding linguistic analysis step. In the final information extraction step, certain linguistic structures are identified based on the annotation labels given by the NLP tools, and ranked in accordance with the specific information need expressed by the user.

    An important consideration in my implementation is that the pipeline should be applicable to different languages, time periods, genres, and information needs by simply substituting the language resources used in each module. Furthermore, the reuse of existing NLP tools developed for the modern language is crucial, considering the lack of linguistically annotated historical data combined with the high variability in historical text, making it hard to train NLP tools specifically aimed at analysing historical text.

    In my evaluation, I show that spelling normalisation can be a very useful technique for easy access to historical information content, even in cases where there is little (or no) annotated historical training data available. For the specific information extraction task of automatically identifying verb phrases describing work in Early Modern Swedish text, 91 out of the 100 top-ranked instances are true positives in the best setting. 

  • 21.
    Saers, Markus
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Translation as Linear Transduction: Models and Algorithms for Efficient Learning in Statistical Machine Translation2011Doctoral thesis, monograph (Other academic)
    Abstract [en]

    Automatic translation has seen tremendous progress in recent years, mainly thanks to statistical methods applied to large parallel corpora. Transductions represent a principled approach to modeling translation, but existing transduction classes are either not expressive enough to capture structural regularities between natural languages or too complex to support efficient statistical induction on a large scale. A common approach is to severely prune search over a relatively unrestricted space of transduction grammars. These restrictions are often applied at different stages in a pipeline, with the obvious drawback of committing to irrevocable decisions that should not have been made. In this thesis we will instead restrict the space of transduction grammars to a space that is less expressive, but can be efficiently searched.

    First, the class of linear transductions is defined and characterized. They are generated by linear transduction grammars, which represent the natural bilingual case of linear grammars, as well as the natural linear case of inversion transduction grammars (and higher order syntax-directed transduction grammars). They are recognized by zipper finite-state transducers, which are equivalent to finite-state automata with four tapes. By allowing this extra dimensionality, linear transductions can represent alignments that finite-state transductions cannot, and by keeping the mechanism free of auxiliary storage, they become much more efficient than inversion transductions.

    Secondly, we present an algorithm for parsing with linear transduction grammars that allows pruning. The pruning scheme imposes no restrictions a priori, but guides the search to potentially interesting parts of the search space in an informed and dynamic way. Being able to parse efficiently allows learning of stochastic linear transduction grammars through expectation maximization.

    All the above work would be for naught if linear transductions were too poor a reflection of the actual transduction between natural languages. We test this empirically by building systems based on the alignments imposed by the learned grammars. The conclusion is that stochastic linear inversion transduction grammars learned from observed data stand up well to the state of the art.

  • 22.
    Saxena, Anju
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics.
    Corpora in grammar learning - Evaluation of ITG2008In: Resourceful Language Technology: Festschrift in Honor of Anna Sågvall Hein, 2008Chapter in book (Other academic)
  • 23.
    Saxena, Anju
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Csató, Éva Ágnes
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology, Asian and African Languages and Cultures, Turkic languages.
    Dahlqvist, Bengt
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Using Parallel Corpora in Teaching and Research: The Swedish-Hindi-English and Swedish-Turkish-English Parallel Corpora2009In: Multilingualism: proceedings of the 23rd Scandinavian Conference of Linguistics : Uppsala University, 1-3 October 2008 / [ed] Anju Saxena, Åke Viberg, Uppsala: Acta Universitatis Upsaliensis , 2009Conference paper (Refereed)
  • 24.
    Saxena, Anju
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Viberg, ÅkeUppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Multilingualism: Proceedings of the 23rd Scandinavian Conference of Linguistics, Uppsala University, 1-3 October 20082009Conference proceedings (editor) (Other academic)
  • 25.
    Seraji, Mojgan
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Morphosyntactic Corpora and Tools for Persian2015Doctoral thesis, monograph (Other academic)
    Abstract [en]

    This thesis presents open source resources in the form of annotated corpora and modules for automatic morphosyntactic processing and analysis of Persian texts. More specifically, the resources consist of an improved part-of-speech tagged corpus and a dependency treebank, as well as tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and dependency parsing for Persian.

    In developing these resources and tools, two key requirements are observed: compatibility and reuse. The compatibility requirement encompasses two parts. First, the tools in the pipeline should be compatible with each other in such a way that the output of one tool is compatible with the input requirements of the next. Second, the tools should be compatible with the annotated corpora and deliver the same analysis that is found in these. The reuse requirement means that all the components in the pipeline are developed by reusing resources, standard methods, and open source state-of-the-art tools. This is necessary to make the project feasible.

    Given these requirements, the thesis investigates two main research questions. The first is how can we develop morphologically and syntactically annotated corpora and tools while satisfying the requirements of compatibility and reuse? The approach taken is to accept the tokenization variations in the corpora to achieve robustness. The tokenization variations in Persian texts are related to the orthographic variations of writing fixed expressions, as well as various types of affixes and clitics. Since these variations are inherent properties of Persian texts, it is important that the tools in the pipeline can handle them. Therefore, they should not be trained on idealized data.

    The second question concerns how accurately we can perform morphological and syntactic analysis for Persian by adapting and applying existing tools to the annotated corpora. The experimental evaluation of the tools shows that the sentence segmenter and tokenizer achieve an F-score close to 100%, the tagger has an accuracy of nearly 97.5%, and the parser achieves a best labeled accuracy of over 82% (with unlabeled accuracy close to 87%).

  • 26.
    Serrander, Ulrika
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Bilingual lexical processing in single word production: Swedish learners of Spanish and the effects of L2 immersion2011Doctoral thesis, monograph (Other academic)
    Abstract [en]

    Bilingual speakers cannot suppress activation from their dominant language while naming pictures in a foreign and less dominant language. Previous research has revealed that this cross-langauge activation is manifested through phonological facilitation, semantic interference and between language competition. However, this research is based exclusively on highly proficient bilinguals. The present study investigates cross-linguistic activation in Swedish learners of Spanish, grouped according to their length of Spanish immersion, and one of the groups is in its very inital stages of learning. Participants named pictures in Spanish in two picture-word interference experiments, one with only non-cognates, and one including cognates. This study addresses the following research questions; (1) do the two groups of participants differ significantly from one another in terms of cross-linguistic activation, (2) what does cross-language activation look like in initial stages of L2 acquisition, (3) how does cognate status affect cross-linguistic activation and does this differ between participants depending on length of immersion?

    The experiments show that cross-linguistic influence is dependent on length of immersion. The more immersed participants performed very similarly to what is usually the case in highly proficient bilinguals while the less immersed participants did not. The results of the less immersed participants are interpreted as manifestations of lexical processing in initial stages of L2 acquisition. Since this type of learner has never been tested before, there are no previous results to compare to. The results are discussed in relation to the large tradition of offline research which has shown that beginning learners predominantly process their L2 phonologically, and that conceptual processing is something requiring more L2 development.

    Furthermore, the cognate word induced longer naming latencies in all participants and it turned out that the cognate words were highly unfamiliar. Hence all participants are sensitive to word frequency effects, and this sensitive is greater in early stages of learning. Finally this study suggests that more research must be conducted to establish cross-linguistic influence between the many languages of multi-lingual subjects, even when these languages may not be present in the testing situation.

  • 27.
    Shao, Yan
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Segmenting and Tagging Text with Neural Networks2018Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    Segmentation and tagging of text are important preprocessing steps for higher-level natural language processing tasks. In this thesis, we apply a sequence labelling framework based on neural networks to various segmentation and tagging tasks, including sentence segmentation, word segmentation, morpheme segmentation, joint word segmentation and part-of-speech tagging, and named entity transliteration. We apply a general neural CRF model to different tasks by designing specific tag sets. In addition, we explore effective ways of representing input characters, such as utilising concatenated n-grams and sub-character features, and use ensemble decoding to mitigate the effects of random parameter initialisation.

    The segmentation and tagging models are evaluated in a truly multilingual setup with more than 70 datasets. The experimental results indicate that the proposed neural CRF model is effective for segmentation and tagging in general as state-of-the-art accuracies are achieved on datasets in different languages, genres, and annotation schemes for various tasks. For word segmentation, we propose several typological factors to statistically characterise the difficulties posed by different languages and writing systems. Based on this analysis, we apply language-specific settings to the segmentation system for higher accuracy. Our system achieves substantially better results on languages that are more difficult to segment when compared to previous work. Moreover, we investigate conventionally adopted evaluation metrics for segmentation tasks. We propose that precision should be excluded and using recall alone is more adequate for sentence segmentation and word segmentation. The segmentation and tagging tools implemented along with this thesis are publicly available as experimental frameworks for future development as well as preprocessing tools for higher-level NLP tasks.

    List of papers
    1. Boosting English-Chinese Machine Transliteration via High Quality Alignment and Multilingual Resources
    Open this publication in new window or tab >>Boosting English-Chinese Machine Transliteration via High Quality Alignment and Multilingual Resources
    2015 (English)In: Proceedings of the Fifth Named Entity Workshop, Association for Computational Linguistics , 2015, p. 56-60Conference paper, Published paper (Refereed)
    Abstract [en]

    This paper presents our machine transliteration systems developed for the NEWS 2015 machine transliteration shared task. Our systems are applied to two tasks: English to Chinese and Chinese to English. For standard runs, in which only official data sets are used, we build phrase-based transliteration models with refined alignments provided by the M2M-aligner. For non-standard runs, we add multilingual resources to the systems designed for the standard runs and build different language specific transliteration systems. Linear regression is adopted to rerank the outputs afterwards, which significantly improves the overall transliteration performance.

    Place, publisher, year, edition, pages
    Association for Computational Linguistics, 2015
    National Category
    Language Technology (Computational Linguistics)
    Research subject
    Computational Linguistics
    Identifiers
    urn:nbn:se:uu:diva-268921 (URN)
    Conference
    Fifth Named Entity Workshop, joint with 53rd ACL and the 7th IJCNLP, July 31 2015, Beijing, China
    Available from: 2015-12-11 Created: 2015-12-11 Last updated: 2018-04-10Bibliographically approved
    2. Applying Neural Networks to English-Chinese Named Entity Transliteration
    Open this publication in new window or tab >>Applying Neural Networks to English-Chinese Named Entity Transliteration
    2016 (English)In: Proceedings of the Sixth Named Entity Workshop, joint with 54th ACL, Association for Computational Linguistics, 2016Conference paper, Published paper (Refereed)
    Abstract [en]

    This paper presents the machine translit-eration systems that we employ for ourparticipation in the NEWS 2016 machinetransliteration shared task. Based on the prevalent deep learning models developed for general sequence processing tasks, we use convolutional neural networks to extract character level information from the transliteration units and stack a simple recurrent neural network on top for sequence processing. The systems are applied to the standard runs for both English to Chinese and Chinese to English transliteration tasks. Our systems achieve competitive results according to the official evaluation.

    Place, publisher, year, edition, pages
    Association for Computational Linguistics: , 2016
    Keywords
    Transliteration, Neural Networks
    National Category
    Language Technology (Computational Linguistics)
    Identifiers
    urn:nbn:se:uu:diva-310209 (URN)
    Conference
    Sixth Named Entity Workshop, joint with 54th ACL
    Available from: 2016-12-12 Created: 2016-12-12 Last updated: 2018-04-10Bibliographically approved
    3. Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF
    Open this publication in new window or tab >>Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF
    2017 (English)In: Proceedings of the The 8th International Joint Conference on Natural Language Processing, Taipei: Asian Federation of Natural Language Processing, 2017, p. 173-183Conference paper, Published paper (Refereed)
    Abstract [en]

    We present a character-based model for joint segmentation and POS tagging for Chinese. The bidirectional RNN-CRF architecture for general sequence tagging is adapted and applied with novel vector representations of Chinese characters that capture rich contextual information and sub-character level features. The proposed model is extensively evaluated and compared with a state-of-the-art tagger respectively on CTB5, CTB9 and UD Chinese. The experimental results indicate that our model is accurate and robust across datasets in different sizes, genres and annotation schemes. We obtain stateof-the-art performance on CTB5, achieving 94.38 F1-score for joint segmentation and POS tagging.

    Place, publisher, year, edition, pages
    Taipei: Asian Federation of Natural Language Processing, 2017
    National Category
    Language Technology (Computational Linguistics)
    Identifiers
    urn:nbn:se:uu:diva-335923 (URN)
    Conference
    The 8th International Joint Conference on Natural Language Processing, Taipei, November 27 – December 1, 2017.
    Available from: 2017-12-11 Created: 2017-12-11 Last updated: 2018-04-13Bibliographically approved
    4. Recall is the Proper Evaluation Metric for Word Segmentation
    Open this publication in new window or tab >>Recall is the Proper Evaluation Metric for Word Segmentation