Please use this identifier to cite or link to this item:
Title: Computational analysis of CpG site DNA methylation
Authors: Ghorbani, Mohammadmersad
Advisors: Payne, A
Taylor, SJE
Keywords: Grid computing;Scientific workflow;BOINC;Epigenetics;Distributed systems
Issue Date: 2013
Publisher: Brunel University, School of Information Systems, Computing and Mathematics
Abstract: Epigenetics is the study of factors that can change DNA and passed to next generation without change to DNA sequence. DNA methylation is one of the categories of epigenetic change. DNA methylation is the attachment of methyl group (CH3) to DNA. Most of the time it occurs in the sequences that G is followed by C known as CpG sites and by addition of methyl to the cytosine residue. As science and technology progress new data are available about individual’s DNA methylation profile in different conditions. Also new features discovered that can have role in DNA methylation. The availability of new data on DNA methylation and other features of DNA provide challenge to bioinformatics and the opportunity to discover new knowledge from existing data. In this research multiple data series were used to identify classes of methylation DNA to CpG sites. These classes are a) Never methylated CpG sites,b) Always methylated CpG sites, c) Methylated CpG sites in cancer/disease samples and non-methylated in normal samples d) Methylated CpG sites in normal samples and non-methylated in cancer/disease samples. After identification of these sites and their classes, an analysis was carried out to find the features which can better classify these sites a matrix of features was generated using four applications in EMBOSS software suite. Features matrix was also generated using the gUse/WS-PGRADE portal workflow system. In order to do this each of the four applications were grid enabled and ported to BOINC platform. The gUse portal was connected to the BOINC project via 3G-bridge. Each node in the workflow created portion of matrix and then these portions were combined together to create final matrix. This final feature matrix used in a hill climbing workflow. Hill climbing node was a JAVA program ported to BOINC platform. A Hill climbing search workflow was used to search for a subset of features that are better at classifying the CpG sites using 5 different measurements and three different classification methods: support vector machine, naïve bayes and J48 decision tree. Using this approach the hill climbing search found the models which contain less than half the number of features and better classification results. It is also been demonstrated that using gUse/WS-PGRADE workflow system can provide a modular way of feature generation so adding new feature generator application can be done without changing other parts. It is also shown that using grid enabled applications can speedup both feature generation and feature subset selection. The approach used in this research for distributed workflow based feature generation is not restricted to this study and can be applied in other studies that involve feature generation. The approach also needs multiple binaries to generate portions of features. The grid enabled hill climbing search application can also be used in different context as it only requires to follow the same format of feature matrix.
Description: This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.
Appears in Collections:Computer Science
Dept of Computer Science Theses

Files in This Item:
File Description SizeFormat 
FulltextThesis.pdf4.83 MBAdobe PDFView/Open

Items in BURA are protected by copyright, with all rights reserved, unless otherwise indicated.