Please use this identifier to cite or link to this item:
Title: A resource aware distributed LSI algorithm for scalable information retrieval
Authors: Liu, Yang
Advisors: Li, M
Keywords: Mapreduce;Simulation;Load balancing
Issue Date: 2011
Publisher: Brunel University School of Engineering and Design PhD Theses
Abstract: Latent Semantic Indexing (LSI) is one of the popular techniques in the information retrieval fields. Different from the traditional information retrieval techniques, LSI is not based on the keyword matching simply. It uses statistics and algebraic computations. Based on Singular Value Decomposition (SVD), the higher dimensional matrix is converted to a lower dimensional approximate matrix, of which the noises could be filtered. And also the issues of synonymy and polysemy in the traditional techniques can be overcome based on the investigations of the terms related with the documents. However, it is notable that LSI suffers a scalability issue due to the computing complexity of SVD. This thesis presents a resource aware distributed LSI algorithm MR-LSI which can solve the scalability issue using Hadoop framework based on the distributed computing model MapReduce. It also solves the overhead issue caused by the involved clustering algorithm. The evaluations indicate that MR-LSI can gain significant enhancement compared to the other strategies on processing large scale of documents. One remarkable advantage of Hadoop is that it supports heterogeneous computing environments so that the issue of unbalanced load among nodes is highlighted. Therefore, a load balancing algorithm based on genetic algorithm for balancing load in static environment is proposed. The results show that it can improve the performance of a cluster according to heterogeneity levels. Considering dynamic Hadoop environments, a dynamic load balancing strategy with varying window size has been proposed. The algorithm works depending on data selecting decision and modeling Hadoop parameters and working mechanisms. Employing improved genetic algorithm for achieving optimized scheduler, the algorithm enhances the performance of a cluster with certain heterogeneity levels.
Description: This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.
Appears in Collections:Electronic and Computer Engineering
Dept of Electronic and Computer Engineering Theses

Files in This Item:
File Description SizeFormat 
FullTextThesis.pdf904.9 kBAdobe PDFView/Open

Items in BURA are protected by copyright, with all rights reserved, unless otherwise indicated.