uu.seUppsala University Publications
Change search
ReferencesLink to record
Permanent link

Direct link
Breaking Bad: Extraction of Verb-Particle Constructions from a Parallel Subtitles Corpus
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (Computational linguistics)ORCID iD: 0000-0002-2837-3648
2014 (English)In: Proceedings of the 10th Workshop on Multiword Expressions, 2014Conference paper (Refereed)
Place, publisher, year, edition, pages
National Category
Language Technology (Computational Linguistics)
URN: urn:nbn:se:uu:diva-277876OAI: oai:DiVA.org:uu-277876DiVA: diva2:905864
EACL 2014

The automatic extraction of verb-particle constructions (VPCs) is of particular interest to the NLP community. Previous studies have shown that word alignment methods can be used with parallel corpora to successfully extract a range of multi-word expressions (MWEs). In this paper the technique is applied to a new type of corpus, made up of a collection of subtitles of movies and television series, which is parallel in English and Spanish. Building on previous research, it is shown that a precision level of 94 ± 4.7% can be achieved in English VPC extraction. This high level of precision is achieved despite the difficulties of aligning and tagging subtitles data. Moreover, a significant proportion of the extracted VPCs are not present in online lexical resources, highlighting the benefits of using this unique corpus type, which contains a large number of slang and other informal expressions. An added benefit of using the word alignment process is that translations are also automatically extracted for each VPC. A precision rate of 79.8 ± 8.1% is found for the translations of English VPCs into Spanish. This study thus shows that VPCs are a particularly good subset of the MWE spectrum to attack using word alignment methods, and that subtitles data provide a range of interesting expressions that do not exist in other corpus types.

Available from: 2016-02-23 Created: 2016-02-23 Last updated: 2016-02-23

Open Access in DiVA

No full text

Search in DiVA

By author/editor
Smith, Aaron
By organisation
Department of Linguistics and Philology
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 13 hits
ReferencesLink to record
Permanent link

Direct link