Logo: to the web site of Uppsala University

uu.sePublications from Uppsala University
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Morphosyntactic Corpora and Tools for Persian
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (Computational Linguistics)
2015 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

This thesis presents open source resources in the form of annotated corpora and modules for automatic morphosyntactic processing and analysis of Persian texts. More specifically, the resources consist of an improved part-of-speech tagged corpus and a dependency treebank, as well as tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and dependency parsing for Persian.

In developing these resources and tools, two key requirements are observed: compatibility and reuse. The compatibility requirement encompasses two parts. First, the tools in the pipeline should be compatible with each other in such a way that the output of one tool is compatible with the input requirements of the next. Second, the tools should be compatible with the annotated corpora and deliver the same analysis that is found in these. The reuse requirement means that all the components in the pipeline are developed by reusing resources, standard methods, and open source state-of-the-art tools. This is necessary to make the project feasible.

Given these requirements, the thesis investigates two main research questions. The first is how can we develop morphologically and syntactically annotated corpora and tools while satisfying the requirements of compatibility and reuse? The approach taken is to accept the tokenization variations in the corpora to achieve robustness. The tokenization variations in Persian texts are related to the orthographic variations of writing fixed expressions, as well as various types of affixes and clitics. Since these variations are inherent properties of Persian texts, it is important that the tools in the pipeline can handle them. Therefore, they should not be trained on idealized data.

The second question concerns how accurately we can perform morphological and syntactic analysis for Persian by adapting and applying existing tools to the annotated corpora. The experimental evaluation of the tools shows that the sentence segmenter and tokenizer achieve an F-score close to 100%, the tagger has an accuracy of nearly 97.5%, and the parser achieves a best labeled accuracy of over 82% (with unlabeled accuracy close to 87%).

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2015. , p. 191
Series
Studia Linguistica Upsaliensia, ISSN 1652-1366 ; 16
Keywords [en]
Persian, language technology, corpus, treebank, preprocessing, segmentation, part-of-speech tagging, dependency parsing
National Category
Natural Language Processing Engineering and Technology
Research subject
Computational Linguistics
Identifiers
URN: urn:nbn:se:uu:diva-248780ISBN: 978-91-554-9229-8 (print)OAI: oai:DiVA.org:uu-248780DiVA, id: diva2:800998
Public defence
2015-05-27, Universitetshuset / IX, Uppsala, 10:15 (English)
Opponent
Supervisors
Available from: 2015-05-06 Created: 2015-04-08 Last updated: 2025-02-01Bibliographically approved

Open Access in DiVA

fulltext(12975 kB)6893 downloads
File information
File name FULLTEXT02.pdfFile size 12975 kBChecksum SHA-512
4700c9999ae0f1ea567541d093b9325aff73eb97de6ee362fab2734806c3d65bc907e118b5ea8141ce89c85f182ceda43aa63beec2ad91b905367dd66f0e2918
Type fulltextMimetype application/pdf
Buy this publication >>

Authority records

Seraji, Mojgan

Search in DiVA

By author/editor
Seraji, Mojgan
By organisation
Department of Linguistics and Philology
Natural Language ProcessingEngineering and Technology

Search outside of DiVA

GoogleGoogle Scholar
Total: 6900 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 9901 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf