Enhanced root extraction and document classification algorithm for Arabic text

Alsaad, Amal

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/13510

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Abbod, M	-
dc.contributor.author	Alsaad, Amal	-
dc.date.accessioned	2016-11-17T11:14:56Z	-
dc.date.available	2016-11-17T11:14:56Z	-
dc.date.issued	2016	-
dc.identifier.uri	http://bura.brunel.ac.uk/handle/2438/13510	-
dc.description	This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University London	en_US
dc.description.abstract	Many text extraction and classification systems have been developed for English and other international languages; most of the languages are based on Roman letters. However, Arabic language is one of the difficult languages which have special rules and morphology. Not many systems have been developed for Arabic text categorization. Arabic language is one of the Semitic languages with morphology that is more complicated than English. Due to its complex morphology, there is a need for pre-processing routines to extract the roots of the words then classify them according to the group of acts or meaning. In this thesis, a system has been developed and tested for text classification. The system is based on two stages, the first is to extract the roots from text and the second is to classify the text according to predefined categories. The linguistic root extraction stage is composed of two main phases. The first phase is to handle removal of affixes including prefixes, suffixes and infixes. Prefixes and suffixes are removed depending on the length of the word, while checking its morphological pattern after each deduction to remove infixes. In the second phase, the root extraction algorithm is formulated to handle weak, defined, eliminated-long-vowel and two-letter geminated words, as there is a substantial great amount of irregular Arabic words in texts. Once the roots are extracted, they are checked against a predefined list of 3800 triliteral and 900 quad literal roots. Series of experiments has been conducted to improve and test the performance of the proposed algorithm. The obtained results revealed that the developed algorithm has better accuracy than the existing stemming algorithm. The second stage is the document classification stage. In this stage two non-parametric classifiers are tested, namely Artificial Neural Networks (ANN) and Support Vector Machine (SVM). The system is trained on 6 categories: culture, economy, international, local, religion and sports. The system is trained on 80% of the available data. From each category, the 10 top frequent terms are selected as features. Testing the classification algorithms has been done on the remaining 20% of the documents. The results of ANN and SVM are compared to the standard method used for text classification, the terms frequency-based method. Results show that ANN and SVM have better accuracy (80-90%) compared to the standard method (60-70%). The proposed method proves the ability to categorize the Arabic text documents into the appropriate categories with a high precision rate.	en_US
dc.language.iso	en	en_US
dc.publisher	Brunel University London	en_US
dc.relation.uri	http://bura.brunel.ac.uk/bitstream/2438/13510/1/FulltextThesis.pdf	-
dc.subject	Data mining	en_US
dc.subject	Machine learning	en_US
dc.subject	AI	en_US
dc.subject	Information retrieval	en_US
dc.subject	NLP	en_US
dc.title	Enhanced root extraction and document classification algorithm for Arabic text	en_US
dc.type	Thesis	en_US
Appears in Collections:	Electronic and Electrical Engineering Department of Electronic and Electrical Engineering Theses

Files in This Item:

File	Description	Size	Format
FulltextThesis.pdf		2.17 MB	Adobe PDF	View/Open

Show simple item record