uu.seUppsala University Publications
Change search
Refine search result
12 1 - 50 of 67
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1. Alemu, Atelach
    et al.
    Hulth, Anette
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. Datorlingvistik.
    General-Purpose Text Categorization Applied to the Medical Domain.2007Report (Other academic)
    Abstract [en]

    This paper presents work where a general-purpose text categorization method was applied to categorize medical free-texts. The purpose of the experiments was to examine how such a method performs without any domain-specific knowledge, hand-crafting or tuning. Additionally, we compare the results from the general-purpose method with results from runs in which a medical thesaurus as well as automatically extracted keywords were used when building the classifiers. We show that standard text categorization techniques using stemmed unigrams as the basis for learning can be applied directly to categorize medical reports, yielding an F-measure of 83.9, and outperforming the more sophisticated methods.

  • 2.
    Andréasson, Maia
    et al.
    Department of Swedish Language, University of Gothenburg.
    Borin, Lars
    Department of Swedish Language, University of Gothenburg.
    Forsberg, Markus
    Department of Swedish Language, University of Gothenburg.
    Beskow, Jonas
    School of Computer Science and Communication, KTH.
    Carlsson, Rolf
    School of Computer Science and Communication, KTH.
    Edlund, Jens
    School of Computer Science and Communication, KTH.
    Elenius, Kjell
    School of Computer Science and Communication, KTH.
    Hellmer, Kahl
    School of Computer Science and Communication, KTH.
    House, David
    School of Computer Science and Communication, KTH.
    Merkel, Magnus
    Department of Computer Science, Linköping University.
    Forsbom, Eva
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Megyesi, Beáta
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Eriksson, Anders
    Department of Philosophy, Linguistics and Theory of Science, University of Gothenburg.
    Strömqvist, Sven
    Centre for Languages and Literature, Lund University.
    Swedish CLARIN Activities2009In: Proceedings of the NODALIDA 2009 workshop Nordic Perspectives on the CLARIN Infrastructure of Language Resources / [ed] Rickard Domeij, Kimmo Koskenniemi, Steven Krauwer, Bente Maegaard, Eiríkur Rögnvaldsson and Koenraad de Smedt, Northern European Association for Language Technology (NEALT) , 2009, p. 1-5Conference paper (Refereed)
    Abstract [en]

    Although Sweden has yet to allocate funds specifically intended for CLARIN activities, there are some ongoing activities which are directly relevant to CLARIN, and which are explicitly linked to CLARIN. These activities have been funded by the Committee for Research Infrastructures and its subcommittee DISC (Database Infrastructure Committee) of the Swedish Research Council.

  • 3. Bethelsen, Harald
    et al.
    Megyesi, Beata
    Ensemble of Classifiers for Noise Detection in PoS Tagged Corpora2000In: Proceedings of the Third International Workshop on TEXT, SPEECH and DIALOGUE, 2000, p. 27-32Conference paper (Refereed)
    Abstract [en]

    In this paper we apply the ensemble approach to the identification of incorrectly annotated items (noise) in a training set. In a controlled experiment, memory-based, decision tree-based and transformation-based classifiers are used as a filter to detect and remove noise deliberately introduced into a manually tagged corpus. The results indicate that the method can be successfully applied to automatically detect errors in a corpus.

  • 4.
    Borin, Lars
    et al.
    Språkbanken, Department of Swedish, University of Gothenburg.
    Tahmasebi, Nina
    Språkbanken, Department of Swedish, University of Gothenburg.
    Volodina, Elena
    Språkbanken, Department of Swedish, University of Gothenburg.
    Ekman, Stefan
    Swedish National Data Service, University of Gothenburg.
    Jordan, Caspar
    Swedish National Data Service, University of Gothenburg.
    Viklund, Jon
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Arts, Department of Literature.
    Megyesi, Beáta
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Näsman, Jesper
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Palmér, Anne
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Scandinavian Languages.
    Wirén, Mats
    Department of Linguistics, Stockholm University.
    Björkenstam, Kristina
    Department of Linguistics, Stockholm University.
    Grigonytė, Gintaré
    Department of Linguistics, Stockholm University.
    Gustafson Capková, Sofia
    Department of Linguistics, Stockholm University.
    Kosiński, Tomasz
    Department of Applied IT, Chalmers University of Technology.
    Swe-Clarin: Language Resources and Technology for Digital Humanities2016In: Extended Papers of the International Symposium on Digital Humanities, 2016, p. 29-51Conference paper (Refereed)
    Abstract [en]

    CLARIN is a European Research Infrastructure Consortium (ERIC), which aims at (a) making extensive language-based materials available as primary research data to the humanities and social sciences (HSS); and (b) offering state-of-the-art language technology (LT) as an eresearch tool for this purpose, positioning CLARIN centrally in what is often referred to as the digital humanities (DH). The Swedish CLARIN node Swe-Clarin was established in 2015 with funding from the Swedish Research Council.

    In this paper, we describe the composition and activities of Swe-Clarin, aiming at meeting the requirements of all HSS and other researchers whose research involves using text and speech as primary research data, and spreading the awareness of what Swe-Clarin can offer these research communities. We focus on one of the central means for doing this: pilot projects conducted in collaboration between HSS researchers and Swe-Clarin, together formulating a research question, the addressing of which requires working with large language-based materials. Four such pilot projects are described in more detail, illustrating research on rhetorical history, second-language acquisition, literature, and political science. A common thread to these projects is an aspiration to meet the challenge of conducting research on the basis of very large amounts of textual data in a consistent way without losing sight of the individual cases making up the mass of data, i.e., to be able to move between Moretti’s “distant” and “close reading” modes.

    While the pilot projects clearly make substantial contributions to DH, they also reveal some needs for more development, and in particular a need for document-level access to the text materials. As a consequence of this, work has now been initiated in Swe-Clarin to meet this need, so that Swe-Clarin together with HSS scholars investigating intricate research questions can take on the methodological challenges of big-data language-based digital humanities.

  • 5. Carlson, Rolf
    et al.
    Granström, Björn
    Heldner, Mattias
    House, David
    Megyesi, Beata
    Strangert, Eva
    Swerts, Mark
    Boundaries and groupings - the structuring of speech in different communicative situations: a description of the GROG project2002In: Proceedings of Fonetik 2002, 2002Conference paper (Refereed)
    Abstract [en]

    The goal of the project is to model the prosodic structuring of speech in terms of boundaries and groupings. The modeling will include different communicative situations and be based on existing as well as new speech corpora. Production and perception studies will be used in parallel with automatic methods developed for analysis, modeling and prediction of prosody. The model will be perceptually evaluated using synthetic speech.

  • 6.
    Csato, Éva Ágnes
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Kilimci, Songul
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Using Parallel Corpora in Data-Driven Teaching of Turkish in Sweden.2010Conference paper (Refereed)
    Abstract [en]

    The paper demonstrates how data-driven learning methods are applied in teaching Turkish as a foreign language at the Department of Linguistics and Philology, Uppsala University. In data-driven teaching, language corpora, concordance programs, and annotation tools developed in collaboration with computational linguists are employed. This paper illustrates how resources developed initially for research purposes in different subjects (such as Computational Linguistics, Linguistics, Turkic languages), are now being used in teaching environments.

    We present the Swedish-Turkish parallel corpus providing students and researchers with easily accessible annotated linguistic data. The web-based corpora can be used both by regular and distance students. They function also as learning tools for formulating and testing hypotheses concerning lexical, morphological and syntactic aspects of Turkish. Furthermore, they help the students to practice contrastive studies and translation between Swedish and Turkish.

  • 7.
    Csató, Éva
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology, Asian and African Languages and Cultures, Turkic languages.
    Dahlqvist, Bengt
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics.
    Megyesi, Beáta
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics.
    Saxena, Anju
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics.
    Sågvall Hein, Anna
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics.
    A Turkish-Swedish parallel corpus: Orhan Pamuk Beyaz Kale-Vita Borgen2006Other (Other academic)
  • 8. Dahlqvist, Bengt
    et al.
    Megyesi, Beata
    Changing the tokenization in Talbanken to SUC2.02007Report (Other academic)
  • 9.
    Elenius, Kjell
    et al.
    Speech, Music and Hearing, KTH.
    Forsbom, Eva
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Language Resources and Tools for Swedish: A Survey2008In: Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008), Paris: European Language Resources Association (ELRA) , 2008Conference paper (Refereed)
    Abstract [en]

    Language resources and tools to create and process these resources are necessary components in human language technology and natural language applications. In this paper, we describe a survey of existing language resources for Swedish, and the need for Swedish language resources to be used in research and real-world applications in language technology as well as in linguistic research. The survey is based on a questionnaire sent to industry and academia, institutions and organizations, and to experts involved in the development of Swedish language resources in Sweden, the Nordic countries and world-wide.

  • 10.
    Elenius, Kjell
    et al.
    Speech, Music and Hearing.
    Forsbom, Eva
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Survey on Swedish Language Resources2008Report (Other academic)
    Abstract [en]

    Language resources, such as lexicons, databases, dictionaries, corpora, and tools to create and process these resources are necessary components in human language technology and natural language applications. In this survey, we describe the inventory process and the results of existing language resources for Swedish, and the need for Swedish language resources to be used in research and real-world applications in language technology as well as in linguistic research. The survey is based on an investigation sent to industry and academia, institutions and organizations, to experts involved in the development of Swedish language resources in Sweden, the Nordic countries and world-wide. This study is a result of the project called “An Infrastructure for Swedish language technology” supported by the Swedish Research Council´s Committee for Research Infrastructures 2007 - 2008.

  • 11.
    Fornes, Alicia
    et al.
    Computer Vision Center, Universitat Autònoma de Barcelona, Spain.
    Megyesi, Beáta
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Mas, Joan
    Computer Vision Center, Universitat Autònoma de Barcelona, Spain.
    Transcription of Encoded Manuscripts with Image Processing Techniques2017In: Proceedings of Digital Humanities 2017., Canada, 2017Conference paper (Refereed)
  • 12. Gustafson-Capkova, Sofia
    et al.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    A Comparative Study of Pauses in Dialogues and Read Speech.2001In: Proceedings of Eurospeech 2001, 2001, p. 931-935Conference paper (Refereed)
    Abstract [en]

    This study aims to investigate the length, frequency and position of various types of pauses in three different speaking styles: elicited spontaneous dialogues, professional reading and non-professional reading.

  • 13. Gustafson-Capkova, Sofia
    et al.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Silence and Discourse Context in Read Speech and Dialogues in Swedish2002In: Proceedings of the Speech Prosody 2002 conference, 2002, p. 363-366Conference paper (Refereed)
    Abstract [en]

    In this study, we investigate the correlation between silent pauses and discourse boundaries in the notion of theme shift. We examine three speaking styles in Swedish: professional and non-professional reading, and elicited spontaneous dialogues. Considerable attention is given to the syntactic and discourse context in which pauses appear, as well as the characteristics of the discourse structure in terms of pauses.

  • 14. Hall, Johan
    et al.
    Nilsson, Jens
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. Datorlingvistik.
    Eryigit, Gulsen
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. Datorlingvistik.
    Nisson, Mattias
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. datorlingvistik.
    Saers, Markus
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. datorlingvistik.
    Single Malt or Blended? A Study in Multilingual Parser Optimization.2007In: Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, 2007Conference paper (Refereed)
  • 15. Hulth, Anette
    et al.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. Datorlingvistik.
    A Study on Automatically Extracted Keywords in Text Categorization2006In: Proceedings of International Conference of Association for Computational Linguistics, 2006Conference paper (Refereed)
    Abstract [en]

    This paper presents a study on if and how automatically extracted

    keywords can be used to improve text categorization. In summary we

    show that a higher performance --- as measured by micro-averaged

    F-measure on a standard text categorization collection --- is achieved

    when the full-text representation is combined with the automatically

    extracted keywords. The combination is obtained by giving higher

    weights to words in the full-texts that are also extracted as

    keywords. We also present results for experiments in which the

    keywords are the only input to the categorizer, either represented as

    unigrams or intact. Of these two experiments, the unigrams have the

    best performance, although neither performs as well as headlines only.

  • 16.
    Knight, Kevin
    et al.
    University of Southern California.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Schaefer, Christiane
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    The Copiale Cipher2011Conference paper (Refereed)
    Abstract [en]

    The Copiale cipher is a 105-page enciphered book dated 1866. We describe the features of the book and the method by which we deciphered it.

  • 17.
    Knight, Kevin
    et al.
    University of Southern California.
    Megyesi, Beáta
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Schaefer, Christiane
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    The Secrets of the Copiale Cipher2011In: Research into Freemasonry and Fraternalism, ISSN 1757-2460, Vol. 2, no 2, p. 314-324Article in journal (Refereed)
    Abstract [en]

    The Copiale Cipher is a 105-page, hand-written encrypted manuscript from the mid-eighteenth century. Its code was cracked and the text was deciphered by using modern computational technology combined with philological methods. We describe the book, the features of the text, and give a brief summary of the method by which we deciphered it. Finally, we present the content and the secret society, namely the Oculists, who were hiding behind the cipher. 

  • 18. Mattias, Heldner
    et al.
    Beata, Megyesi
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Exploring the Prosody-Syntax Interface in Conversations2003In: Proceeding of the 15th International Congress of Phonetic Sciences, 2003Conference paper (Refereed)
    Abstract [en]

    The goal of this study is to investigate the structuring of speech in terms of prosodic boundaries in spontaneous dialogues in Swedish. In particular, the relation between boundaries as percieved by listeners, and their acoustic and linguistic realizations as uttered by the speakers is examined.

  • 19. Mattias, Heldner
    et al.
    Beata, Megyesi
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    The Acoustic and Morpho-Syntactic Context of Prosodic Boundaries in Dialogs.2003In: Proceedings of Fonetik 2003, 2003Conference paper (Refereed)
    Abstract [en]

    This study investigates the structuring of speech in terms of prosodic boundaries. In particular, the relation between boundaries as perceived by the listeners, and their acoustic and linguistic realizations as uttered by speakers is examined.

  • 20.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Brill's PoS Tagger with Extended Lexical Templates for Hungarian1999In: Proceedings of the Workshop (W01) on Machine Learning in Human Language Technology: ACAI'99, 1999, p. 22-28Conference paper (Refereed)
  • 21.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Comparing Data-Driven Learning Algorithms for PoS Tagging of Swedish2001In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), 2001Conference paper (Refereed)
    Abstract [en]

    The aim of this study is a systematic evaluation and comparison of four state-of-the-art data-driven learning algorithms applied to part of speech tagging of Swedish. The algorithms included in this study are Hidden Markov Model, Maximum Entropy, Memory-Based Learning, and Transformation-Based Learning. The systems are evaluated from several aspects. Both the effects of tag set and the effects of the size of training data are examined. The accuracy is calculated as well as the error rate for known and unknown tokens. The results show differences between the approaches due to the different linguistic information built into the systems.

  • 22.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Data-Driven Methods for PoS tagging and Chunking of Swedish2001In: In the Proceedings of the Nordic Conference on Computational Linguistics, Nodalida 2001, 2001Conference paper (Other (popular science, discussion, etc.))
    Abstract [en]

    In this paper well-known state-of-the-art data-driven algorithms are applied to

    part-of-speech tagging and shallow parsing of Swedish texts.

  • 23.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. datorlingvistik.
    Improving Brill's PoS Tagger for an Agglutinative Language1999In: Proceedings of the Joint Sigdat Conference on Empirical Methods in Natural Language Processing and Very Large Corpora: EMNLP/VLC '99, 1999, p. 275-284Conference paper (Refereed)
    Abstract [en]

    In this paper Brill's rule-based PoS tagger is tested and adapted for Hungarian. It is shown that the present system does not obtain as high accuracy for Hungarian as it does for English (and other Germanic languages) because of the structural difference between these languages. Hungarian, unlike English, has rich morphology, is agglutinative with some inflectional characteristics and has fairly free word order. The tagger has the greatest difficulties with parts-of-speech belonging to open classes because of their complicated morphological structure. It is shown that the accuracy of tagging can be increased from approximately 83% to 97% by simply changing the rule generating mechanisms, namely the lexical templates in the lexical training module.

  • 24. Megyesi, Beata
    Phrasal Parsing by Using Data-Driven PoS Taggers2001In: Proceedings of the Conference on Recent Advances in Natural Language Processing: Euro Conference RANLP-2001, 2001, p. 166-173Conference paper (Refereed)
    Abstract [en]

    Three data-driven algorithms are applied to shallow parsing of Swedish texts by using PoS taggers as the basis for parsing. The constituent structure is represented by nine types of phrases in a hierarchical structure containing labels for every constituent type the token belongs to. The results show that best performance can be obtained by training on the basis of PoS tags with labels marking the phrasal constituents without considering the words themselves. Transformation-based learning gives highest accuracy (94.44%) followed by the Maximum Entropy framework (mxpost) (92.47%) and the Hidden Markov model (TnT) (92.42%).

  • 25.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Proceedings of the 20th Nordic Conference of Computational Linguistics2015Collection (editor) (Refereed)
  • 26.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. Datorlingvistik.
    Production and Perception of Pauses and their Linguistic Context in Read and Spontaneous Speech in Swedish.2002In: Proceedings of ICSLP'2002 - 7th International Conference on Spoken Language Processing, 2002Conference paper (Refereed)
    Abstract [en]

    We investigate the relationship between prosodic phrase boundaries in terms of pausing and the linguistic structure on morpho-syntactic and discourse levels in

    spontaneous dialogues as well as in read aloud speech in Swedish. Both the speakers' production and the listeners' perception of pausing are considered and mapped to the linguistic structure.

  • 27.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Shallow Parsing with PoS Taggers and Linguistic Features.2002In: Journal of Machine Learning Research: Special Issue on Shallow Parsing, Vol. 2, p. 639-668Article in journal (Refereed)
    Abstract [en]

    Three data-driven publicly available part-of-speech taggers are applied to shallow parsing of Swedish texts. The phrase structure is represented by nine types of phrases in a hierarchical structure containing labels for every constituent type the token belongs to in the parse tree. The encoding is based on the concatenation of the phrase tags on the path from lowest to higher nodes. Various linguistic features are used in learning; the taggers are trained on the basis of lexical information only, part-of-speech only, and a combination of both, to predict the phrase structure of the tokens with or without part-of-speech. Special attention is directed to the taggers' sensitivity to different types of linguistic information included in learning, as well as the taggers' sensitivity to the size and the various types of training data sets. The method can be easily transferred to other languages.

  • 28.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    The Open Source Tagger HunPoS for Swedish.2009In: Proceedings of the 17th Nordic Conference on Computational Linguistics (NODALIDA), 2009Conference paper (Refereed)
    Abstract [en]

    HunPoS, a freely available open source

    part-of-speech tagger—a reimplementation

    of one of the best performing taggers,

    TnT—is applied to Swedish and evaluated

    when the tagger is trained on various sizes

    of training data. The tagger’s accuracy is

    compared to other data-driven taggers for

    Swedish. The results show that the tagging

    performance of HunPoS is as accurate as

    TnT and can be used efficiently to tag running

    text.

  • 29.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    The Open Source Tagger HunPoS for Swedish2008Report (Other academic)
    Abstract [en]

    HunPoS, a freely available open source part-of-speech tagger—a reimplementation of one of the best performing taggers, TnT—is applied to Swedish and evaluated when the tagger is trained on various sizes of training data. The tagger’s accuracy is compared to other data-driven taggers for Swedish. The results show that the tagging performance of HunPoS is as accurate as TnT and can be used efficiently to tag running text.

  • 30. Megyesi, Beata
    et al.
    Carlson, Rolf
    Data-Driven Methods for Building a Swedish Treebank.2002In: Swedish Treebank Symposium, 2002Conference paper (Other academic)
  • 31.
    Megyesi, Beata
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Csató, Éva
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Dahlqvist, Bengt
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Gustafson-Capková, Sofia
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Pettersson, Eva
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Sågvall Hein, Anna
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Supporting Research Environment for Swedish and Turkish2008Report (Other (popular science, discussion, etc.))
  • 32.
    Megyesi, Beata
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Dahlqvist, Bengt
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Converting SUC2.0 to XCES with stand-off annotation2007Report (Other (popular science, discussion, etc.))
  • 33.
    Megyesi, Beata
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. Datorlingvistik.
    Dahlqvist, Bengt
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. Datorlingvistik.
    The Swedish-Turkish Parallel Corpus and Tools for its Creation2007In: Proceedings of NoDaLida 2007, 2007Conference paper (Refereed)
  • 34.
    Megyesi, Beata
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Dahlqvist, Bengt
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Pettersson, Eva
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Gustafson-Capkova, Sofia
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Supporting Research Environment for Less Explored Languages: A Case Study of Swedish and Turkish2008In: Resourceful Language Technology: Festschrift in Honor of Anna Sågvall Hein / [ed] Nivre, Joakim, Dahllöf, Mats, Megyesi, Beáta, Uppsala: Uppsala universitet, 2008, p. 96-110Chapter in book (Other academic)
  • 35.
    Megyesi, Beata
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Dahlqvist, Bengt
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Pettersson, Eva
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Swedish-Turkish Parallel Treebank2008In: Proceedings of the Sixth International Language Resources and Evaluation (LREC'08), Paris: European Language Resources Association (ELRA) , 2008Conference paper (Refereed)
    Abstract [en]

    In this paper, we describe our work on building a parallel treebank for a less studied and typologically dissimilar language pair, namely Swedish and Turkish. The treebank is a balanced syntactically annotated corpus containing both fiction and technical documents. In total, it consists of approximately 160,000 tokens in Swedish and 145,000 in Turkish. The texts are linguistically annotated using different layers from part of speech tags and morphological features to dependency annotation. Each layer is automatically processed by using basic language resources for the involved languages. The sentences and words are aligned, and partly manually corrected. We create the treebank by reusing and adjusting existing tools for the automatic annotation, alignment, and their correction and visualization. The treebank was developed within the project Supporting research environment for minor languages aiming at to create representative language resources for language pairs dissimilar in language structure. Therefore, efforts are put on developing a general method for formatting and annotation procedure, as well as using tools that can be applied to other language pairs easily.

  • 36.
    Megyesi, Beata
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Gustafson-Capkova, Sofia
    Pausing in Dialogues and Read Speech: Speaker's Production and Listeners Interpretation2001In: Proceedings of the Workshop on Prosody in Speech Recognition and Understanding, 2001Conference paper (Refereed)
    Abstract [en]

    In this study, we investigate the characteristics of pausing in speakers' production and listeners' interpretation in three different speaking styles in Swedish: elicited spontaneous dialogues, professional and non-professional news reading. Considerable attention is given to the positions in which pauses can appear, in particular their discourse context regarding theme shift. We show that the acoustic silent intervals that are perceived by the listeners correlate with the discourse structure, while perceived pauses having an acoustic silence in the speech signal, correlate to the duration of the acoustic silence.

    The results show clear differences between the speaking styles. In reading, the majority of acoustic pauses are perceived and the majority of both the acoustic and perceived pauses are located at theme shift. In dialogues, on the other hand, few acoustic pauses are perceived by the listeners and the majority of both the acoustic and perceived pauses are positioned at theme continuation. Furthermore, where many pauses are perceived by the listeners, such as in non-professional reading and dialogues, we find long acoustic silent intervals.

  • 37.
    Megyesi, Beata
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Näsman, Jesper
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Palmér, Anne
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Scandinavian Languages.
    The Uppsala Corpus of Student Writings: Corpus Creation, Annotation, and Analysis2016In: Language Resources and Evaluation, 2016Conference paper (Refereed)
    Abstract [en]

    The Uppsala Corpus of Student Writings consists of Swedish texts produced as part of a national test of students ranging in age from nine (in year three of primary school) to nineteen (the last year of upper secondary school) who are studying either Swedish or Swedish as a second language. National tests have been collected since 1996. The corpus currently consists of 2,500 texts containing over 1.5 million tokens. Parts of the texts have been annotated on several linguistic levels using existing state-of-the-art natural language processing tools. In order to make the corpus easy to interpret for scholars in the humanities, we chose the CoNLL format instead of an XML-based representation. Since spelling and grammatical errors are common in student writings, the texts are automatically corrected while keeping the original tokens in the corpus. Each token is annotated with part-of-speech and morphological features as well as syntactic structure. The main purpose of the corpus is to facilitate the systematic and quantitative empirical study of the writings of various student groups based on gender, geographic area, age, grade awarded or a combination of these, synchronically or diachronically. The intention is for this to be a monitor corpus, currently under development.

  • 38.
    Megyesi, Beata
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Rydin, Sara
    Towards a Finite-State Parser for Swedish2000In: Proceedings of NoDaLiDa 99, 2000, p. 115-123Conference paper (Refereed)
    Abstract [en]

    In this study, we describe a method for parsing part-of-speech tagged unrestricted texts in Swedish using finite-state networks. We use the Xerox Finite-State Tool because of its expressiveness and power for writing and compiling regular expressions and relations. The parser is divided into four modules: i) contiguous phrase structure marker, ii) phrasal head marker, iii) syntactic function tagger, and iv) non-contiguous group boundary recognizer. The aim is to develop a parser that can be used as a light/shallow parser for marking phrase structure and, when needed, to label syntactic functions. We believe that modularity is necessary since different NLP tasks require various levels of analysis. The parser for Swedish is under development, but present-day results are promising.

  • 39.
    Megyesi, Beata
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. Datorlingvistik.
    Sågvall Hein, Anna
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. Datorlingvistik.
    Csató, Éva Ágnes
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. turkiska språk.
    Building a Swedish-Turkish Parallel Corpus2006In: Proceedings of Language Resources and Evaluation Conference, 2006Conference paper (Refereed)
    Abstract [en]

    We present a Swedish-Turkish Parallel Corpus aimed to be used in linguistic research, teaching, and applications in natural language processing, primarily machine translation. The corpus being under development is built by using a Basic LAnguage Resource Kit (BLARK) for the two languages which is then used in the automatic alignment phase to improve alignment accuracy. The corpus is balanced with respect to source and target language and is automatically processed using the Uplug toolkit.

  • 40.
    Megyesi, Beáta
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Proceedings of the 1st International Conference on Historical Cryptology: HistoCrypt 20182018Conference proceedings (editor) (Refereed)
  • 41.
    Megyesi, Beáta
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Dahlqvist, Bengt
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Csató, Éva Á.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    The English-Swedish-Turkish Parallel Treebank2010In: Proceedings of Language Resources and Evaluation (LREC 2010), 2010Conference paper (Refereed)
    Abstract [en]

    We describe a syntactically annotated parallel corpus containing typologically partly different languages, namely English, Swedish and Turkish. The corpus consists of approximately 300 000 tokens in Swedish, 160 000 in Turkish and 150 000 in English, containing both fiction and technical documents. We build the corpus by using the Uplug toolkit for automatic structural markup, such as tokenization and sentence segmentation, as well as sentence and word alignment. In addition, we use basic language resource kits for the linguistic analysis of the languages involved. The annotation is carried on various layers from morphological and part of speech analysis to dependency structures. The tools used for linguistic annotation, e.g. HunPos tagger and MaltParser, are freely available data-driven resources, trained on existing corpora and treebanks for each language. The parallel treebank is used in teaching and linguistic research to study the relationship between the structurally different languages. In order to study the treebank, several tools have been developed for the visualization of the annotation and alignment, allowing search for linguistic patterns.

  • 42.
    Megyesi, Beáta
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Granstedt, Lena
    Umeå University, Sweden.
    Johansson, Sofia
    Stockholm University, Sweden.
    Prentice, Julia
    University of Gothenburg, Sweden.
    Rosen, Dan
    University of Gothenburg, Sweden.
    Schenström, Carl-Johan
    University of Gothenburg, Sweden.
    Sundberg, Gunlög
    Stockholm University, Sweden.
    Wiren, Mats
    Stockholm University, Sweden.
    Volodina, Elena
    University of Gothenburg, Sweden.
    Learner Corpus Anonymization in the Age of GDPR: Insights from the Creation of a Learner Corpus of Swedish2018In: Proceedings of the 7th NLP4CALL, Stockholm, 2018Conference paper (Refereed)
    Abstract [en]

    This paper reports on the status of learner corpus anonymization for the ongoing research infrastructure project SweLL. The main project aim is to deliver and make available for research a well-annotated corpus of essays written by second language (L2) learners of Swedish. As the practice shows, annotation of learner texts is a sensitive process demanding a lot of compromises between ethical and legal demands on the one hand, and research and technical demands, on the other. Below, is a concise description of the current status of pseudonymization of language learner data to ensure anonymity of the learners, with numerous examples of the above-mentioned compromises.

  • 43.
    Nivre, Joakim
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Dahllöf, MatsUppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.Megyesi, BeátaUppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Resourceful Language Technology: Festschrift in Honor of Anna Sågvall Hein2008Collection (editor) (Other academic)
    Abstract [en]

    As the first holder of the first chair in computational linguistics in Sweden, Anna Sågvall Hein has played a central role in the development of computational linguistics and language technology both in Sweden and on the international scene. Besides her valuable contributions to research, which include work on machine translation, syntactic parsing, grammar checking, word prediction, and corpus linguistics, she has been instrumental in establishing a national graduate school in language technology as well as an undergraduate program in language technology at Uppsala University. It is with great pleasure that we present her with this Festschrift to honor her lasting contributions to the field and to commemorate her retirement from the chair in computational linguistics at Uppsala University. The contributions to the Festschrift come from Anna’s friends and colleagues around the world and deal with many of the topics that are dear to her heart. A common theme in many of the articles, as well as in Anna’s own scientific work, is the design, development and use of adequate language technology resources, epitomized in the title Resourceful Language Technology.

  • 44.
    Nivre, Joakim
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Hall, Johan
    Nilsson, Jens
    Eryigit, Gülsen
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Nilsson, Mattias
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Saers, Markus
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Single Malt or Blended? A Study in Multilingual Parser Optimization2007In: Proceedings of the CoNLL Shared Task Session of EMNLP-CoNLL 2007, Association for Computational Linguistics , 2007, p. 933-939Conference paper (Refereed)
  • 45.
    Nivre, Joakim
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. Datorlingvistik.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. Datorlingvistik.
    Bootstrapping a Swedish Treebank Using Cross-Corpus Harmonization and Annotation Projection.2007In: Proceedings of Treebanks and Linguistic Theories, 2007Conference paper (Refereed)
  • 46.
    Nivre, Joakim
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. Datorlingvistik.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. Datorlingvistik.
    Bootstrapping a Swedish Treebank Using Cross-Corpus Harmonization and Annotation Projection2007In: Proceedings of the 6th International Workshop on Treebanks and Linguistic Theories, 2007, p. 97-102Conference paper (Refereed)
  • 47.
    Nivre, Joakim
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Gustafson-Capkova, Sofia
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Salomonsson, Filip
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Cultivating a Swedish Treebank2008In: Resourceful Language Technology: Festschrift in Honor of Anna Sågvall Hein., Uppsala: Acta Universitatis Upsaliensis , 2008, p. 111-120Chapter in book (Other academic)
  • 48.
    Nivre, Joakim
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Megyesi, Beáta
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Gustafson-Capková, Sofia
    Salomonsson, Filip
    Dahlqvist, Bengt
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Cultivating a Swedish Treebank2008In: Resourceful Language Technology. A Festschrift in Honor of Anna Sågvall Hein / [ed] Nivre, Joakim; Dahllöf, Mats; Megyesi, Beáta, Acta Universitatis Upsaliensis, 2008, p. 111-120Chapter in book (Other academic)
  • 49.
    Näsman, Jesper
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Megyesi, Beáta
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Palmér, Anne
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Scandinavian Languages.
    SWEGRAM: A Web-Based Tool for Automatic Annotation and Analysis of Swedish Texts2017In: Proceedings of the 21st Nordic Conference on Computational Linguistics, Nodalida 2017., Göteborg, 2017, p. 132-141Conference paper (Refereed)
    Abstract [en]

    We present SWEGRAM, a web-based tool for the automatic linguistic annotation and quantitative analysis of Swedish text, enabling researchers in the humanities and social sciences to annotate their own text and produce statistics on linguistic and other text-related features on the basis of this annotation. The tool allows users to upload one or several documents, which are automatically fed into a pipeline of tools for tokenization and sentence segmentation, spell checking, part-of-speech tagging and morpho-syntactic analysis as well as dependency parsing for syntactic annotation of sentences. The analyzer provides statistics on the number of tokens, words and sentences, the number of parts of speech (PoS), readability measures, the average length of various units, and frequency lists of tokens, lemmas, PoS, and spelling errors. SWEGRAM allows users to create their own corpus or compare texts on various linguistic levels.

  • 50.
    Pettersson, Eva
    et al.
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Megyesi, Beata
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    Nivre, Joakim
    Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
    A Multilingual Evaluation of Three Spelling Normalization Methods for Historical Text.2014In: Workshop on Language Technology for Cultural Heritage, Social Sciences and Humanities, LaTeCH 2014, 2014Conference paper (Refereed)
    Abstract [en]

    We present a multilingual evaluation of approaches for spelling normalisation of historical text based on data from five languages: English, German, Hungarian, Icelandic, and Swedish. Three different normalisation methods are evaluated: a simplistic filtering model, a Levenshteinbased approach, and a character-based statistical machine translation approach. The evaluation shows that the machine translation approach often gives the best results, but also that all approaches improve over the baseline and that no single method works best for all languages.

12 1 - 50 of 67
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf