uu.seUppsala University Publications
Change search
ReferencesLink to record
Permanent link

Direct link
HiveQL analytical query performance on Hadoop/MapReduce: Evaluating the impact of a set of de-normalization patterns, ORC file format and Snappy compression.
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Social Sciences, Department of Informatics and Media.
2015 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

A data warehouse (DW) provides the necessary underlying data infrastructure for reporting and analysis. Traditional DW tools cannot handle the large workloads of current enterprise DW. A possible solution to this problem is using massive parallel processing DW. This powerful architecture introduces prohibitive costs. A better resolution is augmenting existing DW by implementing big data technology – with Hadoop being the most mature and dominant platform. This resolution is referred as a big DW system that provides a good price-performance ratio.

However, considering the highly globalized business environment, a good price-performance ratio is not sufficient to be competitive. Because of this, companies are still reluctant to shift from their traditional DW systems to a big DW solution. In this work the aim is to investigate how to improve the performance component and provide better price-performance ratio that keep the operational costs low while maximizing revenue.

There are several ways to improve query performance in a big DW system. This work evaluates a set of de-normalization patterns, a column data format - ORC and a compression method - Snappy. The overall conclusion is that the ORC file format is a very efficient way to achieve significant query performance improvements, in particular due to the built-in encoding schema. The “horizontal partitioning” pattern is as well a good way for gaining improved performance in particular by helping achieve a specific join optimization (map-join). As the size of the data sets increases, adding another layer of compression such as Snappy becomes crucial to ensure both the occurrence of a map-join optimization and speed up data transfer across network.

The results from this work can be of value for small-medium sized company that cannot invest in expensive MPP architecture. Instead they can leverage the relatively low cost of big data technology and utilize these findings to gain adequate performance that drives forward their business.

Place, publisher, year, edition, pages
National Category
Information Studies
URN: urn:nbn:se:uu:diva-255380OAI: oai:DiVA.org:uu-255380DiVA: diva2:821994
Subject / course
Information Systems
Educational program
Master programme in Information Systems
Available from: 2015-06-16 Created: 2015-06-16 Last updated: 2015-06-16Bibliographically approved

Open Access in DiVA

No full text

By organisation
Department of Informatics and Media
Information Studies

Search outside of DiVA

GoogleGoogle Scholar

Total: 275 hits
ReferencesLink to record
Permanent link

Direct link