HTSeq-Hadoop: Extending HTSeq for Massively Parallel Sequencing Data Analysis using Hadoop
2014 (English)In: Proc. 10th International Conference on e-Science, IEEE Computer Society, 2014, 317-323 p.Conference paper (Refereed)
Hadoop is a convenient framework in e-Science enabling scalable distributed data analysis. In molecular biology, next-generation sequencing produces vast amounts of data and requires flexible frameworks for constructing analysis pipelines. We extend the popular HTSeq package into the Hadoop realm by introducing massively parallel versions of short read quality assessment as well as functionality to count genes mapped by the short reads. We use the Hadoop-streaming library which allows the components to run in both Hadoop and regular Linux systems and evaluate their performance in two different execution environments: A single node on a computational cluster and a Hadoop cluster in a private cloud. We compare the implementations with Apache Pig showing improved runtime performance of our developed methods. We also inject the components in the graphical platform Cloudgene to simplify user interaction.
Place, publisher, year, edition, pages
IEEE Computer Society, 2014. 317-323 p.
Bioinformatics (Computational Biology)
IdentifiersURN: urn:nbn:se:uu:diva-242917DOI: 10.1109/eScience.2014.27ISBN: 978-1-4799-4288-6OAI: oai:DiVA.org:uu-242917DiVA: diva2:785398
FunderScience for Life Laboratory - a national resource center for high-throughput molecular bioscienceSwedish National Infrastructure for Computing (SNIC), p2013023eSSENCE - An eScience CollaborationEU, European Research Council, BM1006