uu.seUppsala University Publications
Change search
Link to record
Permanent link

Direct link
BETA
Alternative names
Publications (10 of 74) Show all publications
Pettersson, E. & Megyesi, B. (2019). Matching Keys and Encrypted Manuscripts. In: Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa '19): . Paper presented at The 22nd Nordic Conference on Computational Linguistics (NoDaLiDa '19), 30 September–2 October 2019, Turku, Finland.. Linköping: Linköping University Electronic Press
Open this publication in new window or tab >>Matching Keys and Encrypted Manuscripts
2019 (English)In: Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa '19), Linköping: Linköping University Electronic Press, 2019Conference paper, Published paper (Refereed)
Abstract [en]

Historical cryptology is the study of historical encrypted messages aiming at their decryption by analyzing the mathematical, linguistic and other coding patterns and their historical context. In libraries and archives we can find quite a lot of ciphers, as well as keys describing the method used to transform the plaintext message into a ciphertext. In this paper, we present work on automatically mapping keys to ciphers to reconstruct the original plaintext message, and use language models generated from historical texts to guess the underlying plaintext language.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2019
Keywords
historical cryptology
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-394240 (URN)978-91-7929-995-8 (ISBN)
Conference
The 22nd Nordic Conference on Computational Linguistics (NoDaLiDa '19), 30 September–2 October 2019, Turku, Finland.
Projects
DECRYPT
Funder
Swedish Research Council, 2018-06074
Available from: 2019-10-06 Created: 2019-10-06 Last updated: 2019-10-15Bibliographically approved
Ahrenberg, L. & Megyesi, B. (Eds.). (2019). Proceedings of the Workshop on NLP and Pseudonymisation. Paper presented at Workshop on NLP and Pseudonymisation, NODALIDA 2019, September 30, 2019, Turku, Finland. Linköping: Linköping Electronic Press
Open this publication in new window or tab >>Proceedings of the Workshop on NLP and Pseudonymisation
2019 (English)Conference proceedings (editor) (Refereed)
Place, publisher, year, edition, pages
Linköping: Linköping Electronic Press, 2019. p. 34
Series
Linköping Electronic Press, ISSN 1650-3686, E-ISSN 1650-3740 ; 66
Keywords
pseudonymisering
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-394239 (URN)978-91-7929-996-5 (ISBN)
Conference
Workshop on NLP and Pseudonymisation, NODALIDA 2019, September 30, 2019, Turku, Finland
Projects
swe-clarin
Funder
Swedish Research Council
Available from: 2019-10-06 Created: 2019-10-06 Last updated: 2019-10-15Bibliographically approved
Megyesi, B. & Volodina, E. (2019). Pseudonymization of Language Learner Data. In: Workshop om pseudonymisering av textdata: . Paper presented at Workshop om pseudonymisering av textdata, 22 mars 2019, Stockholm, Sweden.
Open this publication in new window or tab >>Pseudonymization of Language Learner Data
2019 (English)In: Workshop om pseudonymisering av textdata, 2019Conference paper, Oral presentation with published abstract (Other academic)
Abstract [en]

We present de-identification and pseudonymization of a learner corpus within the ongoing research infrastructure project SweLL[1]. The main project aim is to make available a linguistically annotated corpus of essays written by second language (L2) learners of Swedish. To ensure that the data collected in the project can be used openly in research protecting the subjects’ integrity, we developed data handling flow, a set of metadata about the learners, pseudonymization principles of learner texts, and tools in support of pseudonymization.  During data collection and storage, the data needs to be handled in a secure way, and the participating subjects must be de-identified in the corpus, where common personal identifiers such as names, age, geographic places, dates must be identified, masked and eventually replaced. These identifiers might occur in metadata about the learner, and in the learners’ text(s).

 

The SweLL project adopted a rather restrictive approach to metadata describing important aspects about each produced text and learner so that learners are de-identified while still providing important information for research purposes about the learner's gender, age given in 5-year interval spans, total time in Sweden, education level, mother tongue, and languages spoken in various communicative situations. The metadata does not provide exact date of birth, arrival date to Sweden, the country of origin or nationality of the learner, and no information is given about the educational establishment, where the essays have been collected.

 

De-identification through metadata might not be solely satisfactory, since the texts written by a learner may, and in fact often contain personal information about the learner. Pseudonymization involves the identification of personal information that can relate to the subject (e.g. My name is Ali), and the classification of that information, masked into certain predefined types (e.g. My name is first_name). As the first step, we manually mark-up text segments that reveal personal information in the corpus data. The identified segments are categorized as personal names, institutions (referring to schools, work place, sport teams), geographic data (such as country, city, region, areas, street name, numbers), transportation types and line names/numbers, age, date, phone number, email address, personal web page, social security number, account number, certificate/license number, profession and education, and sensitive information revealing physical or mental disabilities, political views, unique family relations, and any other items not covered by the previous categories.

 

Each marked text string with a category is then replaced in a systematic way to reproduce a "natural" text to increase reading flow. This step includes assigning unique id-numbers to each entity within a certain category type so if the particular entity is repeated in the text, the same running number is assigned to it and can be replaced by the same word. We also add morphological information to each masked entity to be able to replace it in the same morphological form as the original.

 

There are several ways to mask the sensitive information through substitution, either by rendering, or by replacement with another pre-defined token of the same category. Rendering is applied to information that can be collected from general resource lists, such as personal names and surnames; city and country names, nationalities and languages; geographic names; street names; names of schools, institutions, work places; etc. Replacement applies to strings containing information with certain formatting where general resource lists cannot suffice. Such cases include middle names or initials, numerical information such as phone numbers or dates. In some cases, when the annotator does not know how to categorize a certain text string, the original text is kept but marked by a placeholder. Distinction is made between objects that need to be replaced because of sensitivity, and objects that might be sensitive but can be replaced later, or to be removed later.

 

The pseudonymized corpus is under development, as are the tools supporting the pseudonymization process.

We expect the corpus and the tools to be released as open source by the end of 2020.

[1] https://spraakbanken.gu.se/eng/swell_infra

 

 

Keywords
pseudonymisation, GDPR, personal information
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-385921 (URN)
Conference
Workshop om pseudonymisering av textdata, 22 mars 2019, Stockholm, Sweden
Projects
SweLL
Funder
Riksbankens Jubileumsfond, IN16-0464:1
Available from: 2019-06-18 Created: 2019-06-18 Last updated: 2019-08-27Bibliographically approved
Megyesi, B., Palmér, A. & Näsman, J. (2019). SWEGRAM: Annotering och analys av svenska texter. Uppsala: Uppsala universitet
Open this publication in new window or tab >>SWEGRAM: Annotering och analys av svenska texter
2019 (Swedish)Report (Other academic)
Abstract [sv]

Dokumentet syftar till att beskriva verktyget swegram med vars hjälp du kan genomföra automatisk annotering och lingvistisk analys av svenska och engelska texter eller skapa din egen, lingvistiskt annoterade textsamling, en så kallad korpus. Vi presenterar verktygets beståndsdelar och ger förslag på hur man kan genomföra storskalig, empirisk språklig analys med hjälp av verktyget.  

Place, publisher, year, edition, pages
Uppsala: Uppsala universitet, 2019. p. 42
Keywords
analys av svenska
National Category
Specific Languages Language Technology (Computational Linguistics)
Research subject
Computational Linguistics; Scandinavian Languages
Identifiers
urn:nbn:se:uu:diva-385919 (URN)
Projects
swe-clarin
Available from: 2019-06-18 Created: 2019-06-18 Last updated: 2019-07-02Bibliographically approved
Megyesi, B., Blomqvist, N. & Pettersson, E. (2019). The DECODE Database: Collection of Historical Ciphers and Keys. In: Eugen Antal, Klaus Schmeh (Ed.), Proceedings of the 2nd International Conference on Historical Cryptology: HistoCrypt 2019. Paper presented at The 2nd International Conference on Historical Cryptology, HistoCrypt 2019, June 23-26 2019, Mons, Belgium (pp. 69-78). Linköping
Open this publication in new window or tab >>The DECODE Database: Collection of Historical Ciphers and Keys
2019 (English)In: Proceedings of the 2nd International Conference on Historical Cryptology: HistoCrypt 2019 / [ed] Eugen Antal, Klaus Schmeh, Linköping, 2019, p. 69-78Conference paper, Published paper (Refereed)
Abstract [en]

We present an on-line database DECODE consisting of encrypted historical manuscripts, aiming at the systematic collection of ciphers and keys to create infrastructural support for historical research in general, and historical cryptology in particular. The collected material is annotated with a metadata scheme developed specifically for historical ciphers. Information includes provenance and location of the manuscript, computer-readable transcription, possible decryption(s) of the ciphertext and translation(s) of the plaintext, images, and any additional materials of relevance to the particular manuscript. The database allows search in the existing collection and upload of new encrypted sources by users.

Place, publisher, year, edition, pages
Linköping: , 2019
Series
Linköping Electronic Conference Proceedings, NEALT Proceedings Series, ISSN 1650-3686, E-ISSN 1650-3740 ; 37
Keywords
database of ciphers and keys, historical cryptology, digital philology
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-385920 (URN)978-91-7685-087-9 (ISBN)
Conference
The 2nd International Conference on Historical Cryptology, HistoCrypt 2019, June 23-26 2019, Mons, Belgium
Projects
DECRYPT
Funder
Swedish Research Council, 2018-06074
Available from: 2019-06-18 Created: 2019-06-18 Last updated: 2019-06-18Bibliographically approved
Baró, A., Chen, J., Fornés, A. & Megyesi, B. (2019). Towards a Generic Unsupervised Method for Transcription of Encoded Manuscripts. In: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage: DATeCH2019. Paper presented at the 3rd International Conference on Digital Access to Textual Cultural Heritage. New York: ACM
Open this publication in new window or tab >>Towards a Generic Unsupervised Method for Transcription of Encoded Manuscripts
2019 (English)In: Proceedings of the 3rd International Conference on Digital Access to Textual Cultural Heritage: DATeCH2019, New York: ACM , 2019Conference paper, Published paper (Refereed)
Abstract [en]

Historical ciphers, a special type of manuscripts, contain encrypted information, important for the interpretation of our history. The first step towards decipherment is to transcribe the images, either manually or by automatic image processing techniques. Despite the improvements in handwritten text recognition (HTR) thanks to deep learning methodologies, the need of labelled data to train is an important limitation. Given that ciphers often use symbol sets across various alphabets and unique symbols without any transcription scheme available, these supervised HTR techniques are not suitable to transcribe ciphers. In this paper we propose an unsupervised method for transcribing encrypted manuscripts based on clustering and label propagation, which has been successfully applied to community detection in networks. We analyze the performance on ciphers with various symbol sets, and discuss the advantages and drawbacks compared to supervised HTR methods.

Place, publisher, year, edition, pages
New York: ACM, 2019
Keywords
Handwritten text recognition, Encoded manuscripts, Unsupervised methods.
National Category
Computer Sciences Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-385925 (URN)10.1145/3322905.3322920 (DOI)978-1-4503-7194-0 (ISBN)
Conference
the 3rd International Conference on Digital Access to Textual Cultural Heritage
Projects
DECRYPT
Funder
Swedish Research Council, 2018-06074
Available from: 2019-06-18 Created: 2019-06-18 Last updated: 2019-08-16Bibliographically approved
Volodina, E., Granstedt, L., Megyesi, B., Prentice, J., Rosén, D., Schenström, C.-J., . . . Wirén, M. (2018). Annotation of learner corpora: first SweLL insights. In: Abstracts of SLTC 2018: . Paper presented at Seventh Swedish Language Technology Conference, Stockholm, 7-9 November 2018 (pp. 86-89). Stockholm: Department of Computer and Systems Sciences and the Department of Linguistics at Stockholm University
Open this publication in new window or tab >>Annotation of learner corpora: first SweLL insights
Show others...
2018 (English)In: Abstracts of SLTC 2018, Stockholm: Department of Computer and Systems Sciences and the Department of Linguistics at Stockholm University , 2018, p. 86-89Conference paper, Oral presentation with published abstract (Other academic)
Place, publisher, year, edition, pages
Stockholm: Department of Computer and Systems Sciences and the Department of Linguistics at Stockholm University, 2018
Keywords
second language learning, L2 infrastructure
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-394242 (URN)
Conference
Seventh Swedish Language Technology Conference, Stockholm, 7-9 November 2018
Available from: 2019-10-06 Created: 2019-10-06 Last updated: 2019-10-15Bibliographically approved
Megyesi, B., Granstedt, L., Johansson, S., Prentice, J., Rosen, D., Schenström, C.-J., . . . Volodina, E. (2018). Learner Corpus Anonymization in the Age of GDPR: Insights from the Creation of a Learner Corpus of Swedish. In: Proceedings of the 7th NLP4CALL: . Paper presented at 7th NLP4CALL, SLTC workshop, Stockholm, Sweden, 7 November, 2018. Stockholm
Open this publication in new window or tab >>Learner Corpus Anonymization in the Age of GDPR: Insights from the Creation of a Learner Corpus of Swedish
Show others...
2018 (English)In: Proceedings of the 7th NLP4CALL, Stockholm, 2018Conference paper, Published paper (Refereed)
Abstract [en]

This paper reports on the status of learner corpus anonymization for the ongoing research infrastructure project SweLL. The main project aim is to deliver and make available for research a well-annotated corpus of essays written by second language (L2) learners of Swedish. As the practice shows, annotation of learner texts is a sensitive process demanding a lot of compromises between ethical and legal demands on the one hand, and research and technical demands, on the other. Below, is a concise description of the current status of pseudonymization of language learner data to ensure anonymity of the learners, with numerous examples of the above-mentioned compromises.

Place, publisher, year, edition, pages
Stockholm: , 2018
Series
Linköping Electronic Conference Proceedings, ISSN 1650-3686, E-ISSN 1650-3740 ; 152
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-367312 (URN)978-91-7685-173-9 (ISBN)
Conference
7th NLP4CALL, SLTC workshop, Stockholm, Sweden, 7 November, 2018
Funder
Riksbankens Jubileumsfond, IN16-0464:1
Available from: 2018-11-29 Created: 2018-11-29 Last updated: 2018-12-10Bibliographically approved
Megyesi, B. (Ed.). (2018). Proceedings of the 1st International Conference on Historical Cryptology: HistoCrypt 2018. Paper presented at 1st International Conference on Historical Cryptology: HistoCrypt 2018, Uppsala, June 18-20, 2018.. Linköping: Linköping University Electronic Press
Open this publication in new window or tab >>Proceedings of the 1st International Conference on Historical Cryptology: HistoCrypt 2018
2018 (English)Conference proceedings (editor) (Refereed)
Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2018. p. 159
Series
Linköping Electronic Conference Proceedings, ISSN 1650-3686, E-ISSN 1650-3740 ; 149
Keywords
historical ciphers, cryptograms, decoding, codebooks
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-361779 (URN)978-91-7685-252-1 (ISBN)
Conference
1st International Conference on Historical Cryptology: HistoCrypt 2018, Uppsala, June 18-20, 2018.
Funder
Swedish Research Council, E0067801
Available from: 2018-09-27 Created: 2018-09-27 Last updated: 2018-10-01Bibliographically approved
Stymne, S., Pettersson, E., Megyesi, B. & Palmér, A. (2017). Annotating Errors in Student Texts: First Experiences and Experiments. In: Proceedings of Joint 6th NLP4CALL and 2nd NLP4LA Nodalida workshop: . Paper presented at Joint 6th NLP4CALL and 2nd NLP4LA Nodalida workshop (pp. 47-60). Göteborg
Open this publication in new window or tab >>Annotating Errors in Student Texts: First Experiences and Experiments
2017 (English)In: Proceedings of Joint 6th NLP4CALL and 2nd NLP4LA Nodalida workshop, Göteborg, 2017, p. 47-60Conference paper, Published paper (Refereed)
Abstract [en]

We describe the creation of an annotation layer for word-based writing errors for a corpus of student writings. The texts are written in Swedish by students between 9 and 19 years old. Our main purpose is to identify errors regarding spelling, split compounds and merged words. In addition, we also identify simple word-based grammatical errors, including morphological errors and extra words. In this paper we describe the corpus and the annotation process, including detailed descriptions of the error types and guidelines. We find that we can perform this annotation with a substantial inter-annotator agreement, but that there are still some remaining issues with the annotation. We also report results on two pilot experiments regarding spelling correction and the consistency of downstream NLP tools, to exemplify the usefulness of the annotated corpus.

Place, publisher, year, edition, pages
Göteborg: , 2017
Keywords
error annotation, student writings
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-337518 (URN)
Conference
Joint 6th NLP4CALL and 2nd NLP4LA Nodalida workshop
Projects
Swe-CLARIN
Available from: 2017-12-30 Created: 2017-12-30 Last updated: 2018-01-13Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-4838-6518

Search in DiVA

Show all publications