Parallelizing k-means with hadoop/mahout for big data analytics

Cui, Jianbin

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/11317

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Li, M	-
dc.contributor.advisor	Meng, H	-
dc.contributor.author	Cui, Jianbin	-
dc.date.accessioned	2015-09-04T15:27:31Z	-
dc.date.available	2015-09-04T15:27:31Z	-
dc.date.issued	2015	-
dc.identifier.uri	http://bura.brunel.ac.uk/handle/2438/11317	-
dc.description	This thesis was submitted for the degree of Master of Philosophy and awarded by Brunel University London	en_US
dc.description.abstract	The rapid development of Internet and cloud computing technologies has led to explosive generation and processing of huge amounts of data. The ever increasing data volumes bring great values to societies, but in the meantime bring forward a number of challenges. Data mining techniques have been widely used in decision analysis in financial, medical, management, business and many other fields. However, how to analyse and mine valuable information from the massive data has become a crucial problem as the traditional methods are hardly to achieve high scalability in data processing. Recently, MapReduce has emerged into a major programming model in dealing with big data analytics. Apache Hadoop, which is an open-source implementation of MapReduce, has been widely taken up by the community. Hadoop facilitates the utilization of a large number of inexpensive commodity computers. In addition, Hadoop provides support in dealing with faults which is especially useful for long running jobs. Mahout is a new open-source project of Apache, providing a number of machine learning and data mining algorithms based on the Hadoop platform. As a machine learning technique, K-means has been widely used in data analytics through clustering. However, K-means experiences high overhead in computation when the size of data to be analysed is large. This thesis parallelizes K-means using the MapReduce model and implements a parallel K-means with Mahout on the Hadoop platform. The parallel K-means reduces the computation time significantly in comparison with the standard K-means in dealing with a large data set. In addition, this thesis further evaluates the impact of Hadoop parameters on the performance of the Hadoop framework.	en_US
dc.language.iso	en	en_US
dc.publisher	Brunel University London	en_US
dc.relation.uri	http://bura.brunel.ac.uk/bitstream/2438/11317/1/Jianbin%20Cui-1326088.pdf	-
dc.subject	Cloud computing	en_US
dc.subject	Parallel computing	en_US
dc.subject	Parallel k-means algorithm	en_US
dc.subject	Hadoop parameter	en_US
dc.subject	Map reduce	en_US
dc.title	Parallelizing k-means with hadoop/mahout for big data analytics	en_US
dc.type	Thesis	en_US
Appears in Collections:	Electronic and Computer Engineering Dept of Electronic and Electrical Engineering Theses

Files in This Item:

File	Description	Size	Format
Jianbin Cui-1326088.pdf		1.03 MB	Adobe PDF	View/Open

Show simple item record