Research and optimization of the Bloom filter algorithm in Hadoop
Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
An increasing number of enterprises have the need of transferring data from a traditional database to a cloud-computing system. Big data in Teradata (a data warehouse) often needs to be transferred to Hadoop, a distributed system, for further computing and analysis. However, if data stored in Teradata is not synced with Hadoop, e.g. due to data loss during the communication, sync and copy process, it will cause the data to disaccord. A survey shows that except for the algorithm provided by Hadoop, the Bloom filter algorithm can be a good choice for data reconciliation. MD5 hash technology is applied to reduce the amount of data transmission. In the experiments, data from both sides was compared using a Bloom filter. If there was any data loss during the process, different primary keys could be found. The result can be used to track the change of the original data.
During this thesis project, an experimental system using the Mapreduce mode of Hadoop was implemented. For the implementation, real data was used and the parameters were adjustable to analyze different schemes (Basic join, CBF, SBF and IBF).
Basic knowledge and the key technology of the Bloom filter algorithm are introduced initially. Then the thesis systematically expounds the existing Bloom filter algorithms and the pros and cons of each. It also introduces the principle of the Mapreduce program in Hadoop. In the next part, three schemes, all in concordance with the requirements are introduced in detail. Then in the 4th phase, the implementation of schemes in Hadoop as well as the design and implementation of the testing system are introduced. In the 5th phase, testing and analysis of each scheme is carried out. The feasibility of the schemes is analyzed with respect to performance and cost using experimental data. Finally, conclusions and ideas for further improvement of the Bloom filter are presented.
Place, publisher, year, edition, pages
IT, 13 020
Engineering and Technology
IdentifiersURN: urn:nbn:se:uu:diva-196637OAI: oai:DiVA.org:uu-196637DiVA: diva2:610552
Master Programme in Computer Science
Christoff, IvanChristoff, Ivan