Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/13510
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorAbbod, M-
dc.contributor.authorAlsaad, Amal-
dc.date.accessioned2016-11-17T11:14:56Z-
dc.date.available2016-11-17T11:14:56Z-
dc.date.issued2016-
dc.identifier.urihttp://bura.brunel.ac.uk/handle/2438/13510-
dc.descriptionThis thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University Londonen_US
dc.description.abstractMany text extraction and classification systems have been developed for English and other international languages; most of the languages are based on Roman letters. However, Arabic language is one of the difficult languages which have special rules and morphology. Not many systems have been developed for Arabic text categorization. Arabic language is one of the Semitic languages with morphology that is more complicated than English. Due to its complex morphology, there is a need for pre-processing routines to extract the roots of the words then classify them according to the group of acts or meaning. In this thesis, a system has been developed and tested for text classification. The system is based on two stages, the first is to extract the roots from text and the second is to classify the text according to predefined categories. The linguistic root extraction stage is composed of two main phases. The first phase is to handle removal of affixes including prefixes, suffixes and infixes. Prefixes and suffixes are removed depending on the length of the word, while checking its morphological pattern after each deduction to remove infixes. In the second phase, the root extraction algorithm is formulated to handle weak, defined, eliminated-long-vowel and two-letter geminated words, as there is a substantial great amount of irregular Arabic words in texts. Once the roots are extracted, they are checked against a predefined list of 3800 triliteral and 900 quad literal roots. Series of experiments has been conducted to improve and test the performance of the proposed algorithm. The obtained results revealed that the developed algorithm has better accuracy than the existing stemming algorithm. The second stage is the document classification stage. In this stage two non-parametric classifiers are tested, namely Artificial Neural Networks (ANN) and Support Vector Machine (SVM). The system is trained on 6 categories: culture, economy, international, local, religion and sports. The system is trained on 80% of the available data. From each category, the 10 top frequent terms are selected as features. Testing the classification algorithms has been done on the remaining 20% of the documents. The results of ANN and SVM are compared to the standard method used for text classification, the terms frequency-based method. Results show that ANN and SVM have better accuracy (80-90%) compared to the standard method (60-70%). The proposed method proves the ability to categorize the Arabic text documents into the appropriate categories with a high precision rate.en_US
dc.language.isoenen_US
dc.publisherBrunel University Londonen_US
dc.relation.urihttp://bura.brunel.ac.uk/bitstream/2438/13510/1/FulltextThesis.pdf-
dc.subjectData miningen_US
dc.subjectMachine learningen_US
dc.subjectAIen_US
dc.subjectInformation retrievalen_US
dc.subjectNLPen_US
dc.titleEnhanced root extraction and document classification algorithm for Arabic texten_US
dc.typeThesisen_US
Appears in Collections:Electronic and Computer Engineering
Dept of Electronic and Electrical Engineering Theses

Files in This Item:
File Description SizeFormat 
FulltextThesis.pdf2.17 MBAdobe PDFView/Open


Items in BURA are protected by copyright, with all rights reserved, unless otherwise indicated.