Brunel University Research Archive (BURA) >
School of Engineering and Design >
School of Engineering and Design Theses >
Please use this identifier to cite or link to this item:
|Title: ||A resource aware distributed LSI algorithm for scalable information retrieval|
|Authors: ||Liu, Yang|
|Advisors: ||Li, M|
|Publication Date: ||2011|
|Publisher: ||Brunel University School of Engineering and Design PhD Theses|
|Abstract: ||Latent Semantic Indexing (LSI) is one of the popular techniques in the information retrieval fields. Different from the traditional information retrieval techniques, LSI is not based on the keyword matching simply. It uses statistics and algebraic computations. Based on Singular Value Decomposition (SVD), the higher dimensional matrix is converted to a lower dimensional approximate matrix, of which the noises could be filtered. And also the issues of synonymy and polysemy in the traditional techniques can be overcome based on the investigations of the terms related with the documents. However, it is notable that LSI suffers a scalability issue due to the computing complexity of SVD.
This thesis presents a resource aware distributed LSI algorithm MR-LSI which can solve the scalability issue using Hadoop framework based on the distributed computing model MapReduce. It also solves the overhead issue caused by the involved clustering algorithm. The evaluations indicate that MR-LSI can gain significant enhancement compared to the other strategies on processing large scale of documents. One remarkable advantage of Hadoop is that it supports heterogeneous computing environments so that the issue of unbalanced load among nodes is highlighted. Therefore, a load balancing algorithm based on genetic algorithm for balancing load in static environment is proposed. The results show that it can improve the performance of a cluster according to heterogeneity levels.
Considering dynamic Hadoop environments, a dynamic load balancing strategy with varying window size has been proposed. The algorithm works depending on data selecting decision and modeling Hadoop parameters and working mechanisms. Employing improved genetic algorithm for achieving optimized scheduler, the algorithm enhances the performance of a cluster with certain heterogeneity levels.|
|Description: ||This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.|
|Appears in Collections:||School of Engineering and Design Theses|
Electronic and Computer Engineering
Items in BURA are protected by copyright, with all rights reserved, unless otherwise indicated.