Optimizing Hadoop Parameters Based on the Application Resource Consumption
Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
The interest in analyzing the growing amounts of data has encouraged the deployment of large scale parallel computing frameworks such as Hadoop. In other words, data analytic is the main reason behind the success of distributed systems; this is due tothe fact that data might not fit on a single disk, and that processing can be very time consuming which makes parallel input analysis very useful. Hadoop relies on the MapReduce programming paradigm to distribute work among the machines; so a good balance of load will eventually influence the execution time of those kinds of applications.
This paper introduces a technique to optimize some configuration parameters using the application's CPU utilization in order to tune Hadoop; the theories stated and proved in this paper rely on the fact that the CPUs should neither be over utilized nor under utilized; in other words, the conclusion will be a sort of an equation of the parameter to be optimized in terms of the cluster infrastructure.The future research concerning this topic is planned to focus on tuning other Hadoop parameters and to use more accurate tools to analyze the cluster performance; moreover, it is also interesting to research any possible ways to optimize Hadoop parameters based on other consumption criteria such the input/output statistics and the network traffic.
Place, publisher, year, edition, pages
IT, 13 034
Engineering and Technology
IdentifiersURN: urn:nbn:se:uu:diva-200144OAI: oai:DiVA.org:uu-200144DiVA: diva2:622285
Master Programme in Computer Science
Christoff, IvanChristoff, Ivan