HiveQL analytical query performance on Hadoop/MapReduce: Evaluating the impact of a set of de-normalization patterns, ORC file format and Snappy compression.
Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
A data warehouse (DW) provides the necessary underlying data infrastructure for reporting and analysis. Traditional DW tools cannot handle the large workloads of current enterprise DW. A possible solution to this problem is using massive parallel processing DW. This powerful architecture introduces prohibitive costs. A better resolution is augmenting existing DW by implementing big data technology – with Hadoop being the most mature and dominant platform. This resolution is referred as a big DW system that provides a good price-performance ratio.
However, considering the highly globalized business environment, a good price-performance ratio is not sufficient to be competitive. Because of this, companies are still reluctant to shift from their traditional DW systems to a big DW solution. In this work the aim is to investigate how to improve the performance component and provide better price-performance ratio that keep the operational costs low while maximizing revenue.
There are several ways to improve query performance in a big DW system. This work evaluates a set of de-normalization patterns, a column data format - ORC and a compression method - Snappy. The overall conclusion is that the ORC file format is a very efficient way to achieve significant query performance improvements, in particular due to the built-in encoding schema. The “horizontal partitioning” pattern is as well a good way for gaining improved performance in particular by helping achieve a specific join optimization (map-join). As the size of the data sets increases, adding another layer of compression such as Snappy becomes crucial to ensure both the occurrence of a map-join optimization and speed up data transfer across network.
The results from this work can be of value for small-medium sized company that cannot invest in expensive MPP architecture. Instead they can leverage the relatively low cost of big data technology and utilize these findings to gain adequate performance that drives forward their business.
Place, publisher, year, edition, pages
IdentifiersURN: urn:nbn:se:uu:diva-255380OAI: oai:DiVA.org:uu-255380DiVA: diva2:821994
Subject / course
Master programme in Information Systems