Ensemble learning for detecting gene-gene interactions in colorectal cancer

Dorani, Faramarz (2018) Ensemble learning for detecting gene-gene interactions in colorectal cancer. Masters thesis, Memorial University of Newfoundland.

[img] [English] PDF - Accepted Version
Available under License - The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission.

Download (620kB)


The fundamental task of human genetics is to detect genetic variations that primarily contribute to a disease phenotype. The most popular method for understanding etiology of human inheritable diseases (e.g., cancer) is to utilize genome-wide association studies (GWAS). Colorectal cancer (CRC) is a common cause of deaths in developed countries; specifically, it has a high incidence rate in the province of Newfoundland and Labrador. Therefore, finding the affecting genetic factors associated with CRC can help better understand the disease in order to more effectively treat and prevent it. This study seeks to identify genetic variations associated with CRC using machine learning including feature selection and ensemble learning algorithms. In this study, we analyze a GWAS dataset on CRC collected from Newfoundland population. First, we perform quality control steps on the raw genetic data and prepare it for the machine learning methods. Second, we investigate six feature selection methods through a comparative study by applying them to a simulated dataset and CRC GWAS data. The best feature selection method, in terms of gene-gene interactions, is then used to choose a subset of more relevant features for the next step analysis. Subsequently, two ensemble algorithms, Random Forests and Gradient Boosting machine, are applied to the reduced data to identify significant interacting genetic markers associated with CRC. Last, the findings from machine learning methods are biologically validated using online databases and enrichment analysis tools. From the results of the ensemble algorithms, 44 significant genetic markers are detected in which 29 of them have corresponding genes in DNA. Among them, genes DCC, ALK and ITGA1 are previously found to be associated with CRC. In addition, there are genes E2F3 and NID2, which have the potential of having association with CRC, because of their already known associations with other types of cancer. Moreover, the biological interpretations of these genes reveal biological pathways that may help predict the risk of the disease and better understand the etiology of the disease.

Item Type: Thesis (Masters)
URI: http://research.library.mun.ca/id/eprint/13185
Item ID: 13185
Additional Information: Including bibliographical references (pages 89-103).
Keywords: machine learning, gene-gene interactions, colorectal cancer, feature selection, random forests, gradient boosting machine, GWAS, ensemble algorithms
Department(s): Science, Faculty of > Computer Science
Date: February 2018
Date Type: Submission
Library of Congress Subject Heading: Colon (Anatomy) -- Cancer -- Genetic aspects; Artificial intelligence -- Medical applications

Actions (login required)

View Item View Item


Downloads per month over the past year

View more statistics