uu.seUppsala University Publications
Change search
Link to record
Permanent link

Direct link
BETA
Alternative names
Publications (10 of 65) Show all publications
Stymne, S., Pettersson, E., Megyesi, B. & Palmér, A. (2017). Annotating Errors in Student Texts: First Experiences and Experiments. In: Proceedings of Joint 6th NLP4CALL and 2nd NLP4LA Nodalida workshop: . Paper presented at Joint 6th NLP4CALL and 2nd NLP4LA Nodalida workshop (pp. 47-60). Göteborg
Open this publication in new window or tab >>Annotating Errors in Student Texts: First Experiences and Experiments
2017 (English)In: Proceedings of Joint 6th NLP4CALL and 2nd NLP4LA Nodalida workshop, Göteborg, 2017, p. 47-60Conference paper, Published paper (Refereed)
Abstract [en]

We describe the creation of an annotation layer for word-based writing errors for a corpus of student writings. The texts are written in Swedish by students between 9 and 19 years old. Our main purpose is to identify errors regarding spelling, split compounds and merged words. In addition, we also identify simple word-based grammatical errors, including morphological errors and extra words. In this paper we describe the corpus and the annotation process, including detailed descriptions of the error types and guidelines. We find that we can perform this annotation with a substantial inter-annotator agreement, but that there are still some remaining issues with the annotation. We also report results on two pilot experiments regarding spelling correction and the consistency of downstream NLP tools, to exemplify the usefulness of the annotated corpus.

Place, publisher, year, edition, pages
Göteborg: , 2017
Keywords
error annotation, student writings
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-337518 (URN)
Conference
Joint 6th NLP4CALL and 2nd NLP4LA Nodalida workshop
Projects
Swe-CLARIN
Available from: 2017-12-30 Created: 2017-12-30 Last updated: 2018-01-13Bibliographically approved
Näsman, J., Megyesi, B. & Palmér, A. (2017). SWEGRAM: A Web-Based Tool for Automatic Annotation and Analysis of Swedish Texts. In: Proceedings of the 21st Nordic Conference on Computational Linguistics, Nodalida 2017.: . Paper presented at 21st Nordic Conference on Computational Linguistics, Nodalida 2017 (pp. 132-141). Göteborg
Open this publication in new window or tab >>SWEGRAM: A Web-Based Tool for Automatic Annotation and Analysis of Swedish Texts
2017 (English)In: Proceedings of the 21st Nordic Conference on Computational Linguistics, Nodalida 2017., Göteborg, 2017, p. 132-141Conference paper, Published paper (Refereed)
Abstract [en]

We present SWEGRAM, a web-based tool for the automatic linguistic annotation and quantitative analysis of Swedish text, enabling researchers in the humanities and social sciences to annotate their own text and produce statistics on linguistic and other text-related features on the basis of this annotation. The tool allows users to upload one or several documents, which are automatically fed into a pipeline of tools for tokenization and sentence segmentation, spell checking, part-of-speech tagging and morpho-syntactic analysis as well as dependency parsing for syntactic annotation of sentences. The analyzer provides statistics on the number of tokens, words and sentences, the number of parts of speech (PoS), readability measures, the average length of various units, and frequency lists of tokens, lemmas, PoS, and spelling errors. SWEGRAM allows users to create their own corpus or compare texts on various linguistic levels.

Place, publisher, year, edition, pages
Göteborg: , 2017
Series
Linköping Electronic Conference Proceedings, ISSN 1650-3686, E-ISSN 1650-3740 ; 131
Keywords
NLP, automatic linguistic annotation, quantitative text analysis
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-337519 (URN)978-91-7685-601-7 (ISBN)
Conference
21st Nordic Conference on Computational Linguistics, Nodalida 2017
Projects
SWE-CLARIN
Available from: 2017-12-30 Created: 2017-12-30 Last updated: 2018-01-13Bibliographically approved
Fornes, A., Megyesi, B. & Mas, J. (2017). Transcription of Encoded Manuscripts with Image Processing Techniques. In: Proceedings of Digital Humanities 2017.: . Paper presented at Digital Humanities Montreal, Canada, August 8-11, 2017.. Canada
Open this publication in new window or tab >>Transcription of Encoded Manuscripts with Image Processing Techniques
2017 (English)In: Proceedings of Digital Humanities 2017., Canada, 2017Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Canada: , 2017
Keywords
image processing, hand-written manuscripts, automatic transcription, historical cryptology
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-337517 (URN)
Conference
Digital Humanities Montreal, Canada, August 8-11, 2017.
Projects
DECODE
Funder
Swedish Research Council, E0067801
Available from: 2017-12-30 Created: 2017-12-30 Last updated: 2018-01-13Bibliographically approved
Volodina, E., Megyesi, B., Wirén, M., Granstedt, L., Prentice, J., Reichenberg, M. & Sundberg, G. (2016). A Friend in Need?: Research agenda for electronic Second Language infrastructure. In: Proceedings of SLTC 2016: . Paper presented at Swedish Language Technology Conference (SLTC) 2016.
Open this publication in new window or tab >>A Friend in Need?: Research agenda for electronic Second Language infrastructure
Show others...
2016 (English)In: Proceedings of SLTC 2016, 2016Conference paper, Oral presentation with published abstract (Refereed)
Abstract [en]

In this article, we describe the research and societal needs as well as ongoing efforts to shape Swedish as a Second Language (L2) infrastructure. Our aim is to develop an electronic research infrastructure that would stimulate empiric research into learners' language development by preparing data and developing language technology methods and algorithms that can successfully deal with deviations in the learner language.

Keywords
Second language acquisition, language technology, research infrastructure
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-337521 (URN)
Conference
Swedish Language Technology Conference (SLTC) 2016
Projects
SweLL
Available from: 2017-12-30 Created: 2017-12-30 Last updated: 2018-01-13Bibliographically approved
Borin, L., Tahmasebi, N., Volodina, E., Ekman, S., Jordan, C., Viklund, J., . . . Kosiński, T. (2016). Swe-Clarin: Language Resources and Technology for Digital Humanities. In: Extended Papers of the International Symposium on Digital Humanities: . Paper presented at International Symposium on Digital Humanities, Nov. 7-8, 2016, Växjö, Sweden (pp. 29-51).
Open this publication in new window or tab >>Swe-Clarin: Language Resources and Technology for Digital Humanities
Show others...
2016 (English)In: Extended Papers of the International Symposium on Digital Humanities, 2016, p. 29-51Conference paper, Published paper (Refereed)
Abstract [en]

CLARIN is a European Research Infrastructure Consortium (ERIC), which aims at (a) making extensive language-based materials available as primary research data to the humanities and social sciences (HSS); and (b) offering state-of-the-art language technology (LT) as an eresearch tool for this purpose, positioning CLARIN centrally in what is often referred to as the digital humanities (DH). The Swedish CLARIN node Swe-Clarin was established in 2015 with funding from the Swedish Research Council.

In this paper, we describe the composition and activities of Swe-Clarin, aiming at meeting the requirements of all HSS and other researchers whose research involves using text and speech as primary research data, and spreading the awareness of what Swe-Clarin can offer these research communities. We focus on one of the central means for doing this: pilot projects conducted in collaboration between HSS researchers and Swe-Clarin, together formulating a research question, the addressing of which requires working with large language-based materials. Four such pilot projects are described in more detail, illustrating research on rhetorical history, second-language acquisition, literature, and political science. A common thread to these projects is an aspiration to meet the challenge of conducting research on the basis of very large amounts of textual data in a consistent way without losing sight of the individual cases making up the mass of data, i.e., to be able to move between Moretti’s “distant” and “close reading” modes.

While the pilot projects clearly make substantial contributions to DH, they also reveal some needs for more development, and in particular a need for document-level access to the text materials. As a consequence of this, work has now been initiated in Swe-Clarin to meet this need, so that Swe-Clarin together with HSS scholars investigating intricate research questions can take on the methodological challenges of big-data language-based digital humanities.

Keywords
Swe-Clarin, CLARIN, digital humanities, language technology
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-337520 (URN)
Conference
International Symposium on Digital Humanities, Nov. 7-8, 2016, Växjö, Sweden
Available from: 2017-12-30 Created: 2017-12-30 Last updated: 2018-01-13Bibliographically approved
Megyesi, B., Näsman, J. & Palmér, A. (2016). The Uppsala Corpus of Student Writings: Corpus Creation, Annotation, and Analysis. In: Language Resources and Evaluation: . Paper presented at Language Resources and Evaluation (LREC) 2016.
Open this publication in new window or tab >>The Uppsala Corpus of Student Writings: Corpus Creation, Annotation, and Analysis
2016 (English)In: Language Resources and Evaluation, 2016Conference paper, Published paper (Refereed)
Abstract [en]

The Uppsala Corpus of Student Writings consists of Swedish texts produced as part of a national test of students ranging in age from nine (in year three of primary school) to nineteen (the last year of upper secondary school) who are studying either Swedish or Swedish as a second language. National tests have been collected since 1996. The corpus currently consists of 2,500 texts containing over 1.5 million tokens. Parts of the texts have been annotated on several linguistic levels using existing state-of-the-art natural language processing tools. In order to make the corpus easy to interpret for scholars in the humanities, we chose the CoNLL format instead of an XML-based representation. Since spelling and grammatical errors are common in student writings, the texts are automatically corrected while keeping the original tokens in the corpus. Each token is annotated with part-of-speech and morphological features as well as syntactic structure. The main purpose of the corpus is to facilitate the systematic and quantitative empirical study of the writings of various student groups based on gender, geographic area, age, grade awarded or a combination of these, synchronically or diachronically. The intention is for this to be a monitor corpus, currently under development.

Keywords
student writings, digital humanities, educational applications
National Category
Specific Languages Language Technology (Computational Linguistics)
Research subject
Computational Linguistics; Scandinavian Languages
Identifiers
urn:nbn:se:uu:diva-280192 (URN)
Conference
Language Resources and Evaluation (LREC) 2016
Projects
SWE-CLARIN
Funder
Swedish Research Council
Available from: 2016-03-08 Created: 2016-03-08 Last updated: 2018-01-10Bibliographically approved
Megyesi, B. (Ed.). (2015). Proceedings of the 20th Nordic Conference of Computational Linguistics. Sweden: ACL Anthology
Open this publication in new window or tab >>Proceedings of the 20th Nordic Conference of Computational Linguistics
2015 (English)Collection (editor) (Refereed)
Place, publisher, year, edition, pages
Sweden: ACL Anthology, 2015. p. 320
Series
NEALT Proceedings Series, ISSN 1650-3638 ; 23
Keywords
computational linguistics, natural language processing
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-264779 (URN)978-91-7519-098-3 (ISBN)
Available from: 2015-10-17 Created: 2015-10-17 Last updated: 2018-01-11
Pettersson, E., Megyesi, B. & Nivre, J. (2015). Ranking Relevant Verb Phrases Extracted from Historical Text. In: Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities: . Paper presented at ACL 2015.
Open this publication in new window or tab >>Ranking Relevant Verb Phrases Extracted from Historical Text
2015 (English)In: Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, 2015Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, we present three approaches to automatic ranking of relevant verb phrases extracted from historical text. These approaches are based on conditional probability, log likelihood ratio, and bagof-words classification respectively. The aim of the ranking in our study is to present verb phrases that have a high probability of describing work at the top of the results list, but the methods are likely to be applicable to other information needs as well. The results are evaluated by use of three different evaluation metrics: precision at k, R-precision, and average precision. In the best setting, 91 out of the top-100 instances in the list are true positives.

National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-264780 (URN)
Conference
ACL 2015
Available from: 2015-10-17 Created: 2015-10-17 Last updated: 2018-01-11
Pettersson, E., Megyesi, B. & Nivre, J. (2014). A Multilingual Evaluation of Three Spelling Normalization Methods for Historical Text.. In: Workshop on Language Technology for Cultural Heritage, Social Sciences and Humanities, LaTeCH 2014: . Paper presented at European Association for Computational Linguistics, EACL 2014..
Open this publication in new window or tab >>A Multilingual Evaluation of Three Spelling Normalization Methods for Historical Text.
2014 (English)In: Workshop on Language Technology for Cultural Heritage, Social Sciences and Humanities, LaTeCH 2014, 2014Conference paper, Published paper (Refereed)
Abstract [en]

We present a multilingual evaluation of approaches for spelling normalisation of historical text based on data from five languages: English, German, Hungarian, Icelandic, and Swedish. Three different normalisation methods are evaluated: a simplistic filtering model, a Levenshteinbased approach, and a character-based statistical machine translation approach. The evaluation shows that the machine translation approach often gives the best results, but also that all approaches improve over the baseline and that no single method works best for all languages.

National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-264781 (URN)
Conference
European Association for Computational Linguistics, EACL 2014.
Available from: 2015-10-17 Created: 2015-10-17 Last updated: 2018-01-11
Pettersson, E., Megyesi, B. & Nivre, J. (2014). A Multilingual Evaluation of Three Spelling Normalization Methods for Historical Text. In: Proceedings of the 8th Workshop on Language Technologyfor Cultural Heritage, Social Sciences, and Humanities(LaTeCH): . Paper presented at 14th Conference of the European Association for Computational Linguistics, EACL 2014, 26–30 April, Gothenburg, Sweden (pp. 32-41).
Open this publication in new window or tab >>A Multilingual Evaluation of Three Spelling Normalization Methods for Historical Text
2014 (English)In: Proceedings of the 8th Workshop on Language Technologyfor Cultural Heritage, Social Sciences, and Humanities(LaTeCH), 2014, p. 32-41Conference paper, Published paper (Refereed)
Abstract [en]

We present a multilingual evaluation of approaches for spelling normalisation of historical text based on data from five languages: English, German, Hungarian, Icelandic, and Swedish. Three different normalisation methods are evaluated: a simplistic filtering model, a Levenshteinbased approach, and a character-based statistical machine translation approach. The evaluation shows that the machine translation approach often gives the best results, but also that all approaches improve over the baseline and that no single method works best for all languages.

Keywords
spelling normalization, historical texts
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-239449 (URN)978-1-937284-85-5 (ISBN)
Conference
14th Conference of the European Association for Computational Linguistics, EACL 2014, 26–30 April, Gothenburg, Sweden
Funder
Swedish Research Council
Available from: 2014-12-26 Created: 2014-12-26 Last updated: 2018-01-11Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-4838-6518

Search in DiVA

Show all publications