Merging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection
2014 (English)In: LREC 2014 - Ninth International Conference On Language Resources And Evaluation, 2014Conference paper (Refereed)
This paper presents the compilation of the DSL corpus collection created for the DSL (Discriminating Similar Languages) shared task to be held at the VarDial workshop at COLING 2014. The DSL corpus collection were merged from three comparable corpora to provide a suitable dataset for automatic classification to discriminate similar languages and language varieties. Along with the description of the DSL corpus collection we also present results of baseline discrimination experiments reporting performance of up to 87.4% accuracy.
Place, publisher, year, edition, pages
language identification, language discrimination, comparable corpus, similar languages, language varieties
Language Technology (Computational Linguistics)
IdentifiersURN: urn:nbn:se:uu:diva-264098ISI: 000355611000019ISBN: 978-2-9517408-8-4OAI: oai:DiVA.org:uu-264098DiVA: diva2:860454
9th International Conference on Language Resources and Evaluation (LREC), MAY 26-31, 2014, Reykjavik, ICELAND