Parallelizing k-means with hadoop/mahout for big data analytics

Cui, Jianbin

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/11317

Title:	Parallelizing k-means with hadoop/mahout for big data analytics
Authors:	Cui, Jianbin
Advisors:	Li, M Meng, H
Keywords:	Cloud computing;Parallel computing;Parallel k-means algorithm;Hadoop parameter;Map reduce
Issue Date:	2015
Publisher:	Brunel University London
Abstract:	The rapid development of Internet and cloud computing technologies has led to explosive generation and processing of huge amounts of data. The ever increasing data volumes bring great values to societies, but in the meantime bring forward a number of challenges. Data mining techniques have been widely used in decision analysis in financial, medical, management, business and many other fields. However, how to analyse and mine valuable information from the massive data has become a crucial problem as the traditional methods are hardly to achieve high scalability in data processing. Recently, MapReduce has emerged into a major programming model in dealing with big data analytics. Apache Hadoop, which is an open-source implementation of MapReduce, has been widely taken up by the community. Hadoop facilitates the utilization of a large number of inexpensive commodity computers. In addition, Hadoop provides support in dealing with faults which is especially useful for long running jobs. Mahout is a new open-source project of Apache, providing a number of machine learning and data mining algorithms based on the Hadoop platform. As a machine learning technique, K-means has been widely used in data analytics through clustering. However, K-means experiences high overhead in computation when the size of data to be analysed is large. This thesis parallelizes K-means using the MapReduce model and implements a parallel K-means with Mahout on the Hadoop platform. The parallel K-means reduces the computation time significantly in comparison with the standard K-means in dealing with a large data set. In addition, this thesis further evaluates the impact of Hadoop parameters on the performance of the Hadoop framework.
Description:	This thesis was submitted for the degree of Master of Philosophy and awarded by Brunel University London
URI:	http://bura.brunel.ac.uk/handle/2438/11317
Appears in Collections:	Electronic and Electrical Engineering Department of Electronic and Electrical Engineering Theses

Files in This Item:

File	Description	Size	Format
Jianbin Cui-1326088.pdf		1.03 MB	Adobe PDF	View/Open

Show full item record