uu.seUppsala University Publications
Change search
ReferencesLink to record
Permanent link

Direct link
Improved Text Extraction from PDF Documents for Large-Scale Natural Language Processing
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
2014 (English)In: Computational Linguistics and Intelligent Text Processing, Cicling 2014, PT I, 2014, 102-112 p.Conference paper (Refereed)
Abstract [en]

The inability of reliable text extraction from arbitrary documents is often an obstacle for large scale NLP based on resources crawled from the Web. One of the largest problems in the conversion of PDF documents is the detection of the boundaries of common textual units such as paragraphs, sentences and words. PDF is a file format optimized for printing and encapsulates a complete description of the layout of a document including text, fonts, graphics and so on. This paper describes a tool for extracting texts from arbitrary PDF files for the support of large-scale data-driven natural language processing. Our approach combines the benefits of several existing solutions for the conversion of PDF documents to plain text and adds a language-independent post-processing procedure that cleans the output for further linguistic processing. In particular, we use the PDF-rendering libraries pdfXtk, Apache Tika and Poppler in various configurations. From the output of these tools we recover proper boundaries using on-the-fly language models and language-independent extraction heuristics. In our research, we looked especially at publications from the European Union, which constitute a valuable multilingual resource, for example, for training statistical machine translation models. We use our tool for the conversion of a large multilingual database crawled from the EU bookshop with the aim of building parallel corpora. Our experiments show that our conversion software is capable of fixing various common issues leading to cleaner data sets in the end.

Place, publisher, year, edition, pages
2014. 102-112 p.
, Lecture Notes in Computer Science, ISSN 0302-9743 ; 8403
Keyword [en]
noisy text processing, text normalization, parallel corpora
National Category
Computer and Information Science
URN: urn:nbn:se:uu:diva-236816ISI: 000342989200009ISBN: 978-3-642-54905-2; 978-3-642-54906-9OAI: oai:DiVA.org:uu-236816DiVA: diva2:766850
15th Annual Conference on Intelligent Text Processing and Computational Linguistics (CICLing), APR 06-12, 2014, Kathmandu, NEPAL
Available from: 2014-11-28 Created: 2014-11-24 Last updated: 2014-11-28Bibliographically approved

Open Access in DiVA

No full text

Search in DiVA

By author/editor
Tiedemann, Jörg
By organisation
Department of Linguistics and Philology
Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 344 hits
ReferencesLink to record
Permanent link

Direct link