Scalable feature selection methods by augmenting sparse lease squares

Marvikhorasani, Hanieh (2019) Scalable feature selection methods by augmenting sparse lease squares. Masters thesis, Memorial University of Newfoundland.

[img] [English] PDF - Accepted Version
Available under License - The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission.

Download (740kB)


Feature selection has been used widely for selecting a subset of genes (features) from microarray datasets, which help discriminate healthy samples from those with a particular disease. However, most feature selection methods suffer from high computational complexity when applied to these datasets due to the large number of genes present. Usually, a small subset of these genes have a contributing factor to the disease, and the rest of the genes are irrelevant to the condition. This study proposes a sparse method, Sparse Least Squares (SLS), based on singular value decomposition and least squares to filter out irrelevant features. In this thesis, we shall also consider reducing the number of features by clustering genes and selecting representative genes from each cluster based on two different metrics. These dataset size-reduction methods are incorporated into three state-of-the-art feature selection methods, namely, mRMR, SVM-RFE, and HSIC-Lasso. These methods are applied to three Inflammatory Bowel Disease (IBD) datasets and combined with support vector machines and random forest classifiers. Experimental results show that the proposed SLS method significantly reduces the running time of feature selection algorithms and improves the prediction power of the machine learning models. SLS is integrated into a novel feature selection method (DRPT), which, when combined with Support Vector Machine (SVM), is able to generate models to discriminate between healthy subjects and subjects with Ulcerative Colitis (UC) based on the expression values of genes in colon samples. The best models were validated on two validation datasets and achieving higher predictive performance than a model generated by a recently published biomarker discovery tool (BioDiscML).

Item Type: Thesis (Masters)
Item ID: 14322
Additional Information: Includes bibliographical references (pages 52-60).
Keywords: Machine learning, feature selection
Department(s): Science, Faculty of > Computer Science
Date: November 2019
Date Type: Submission
Digital Object Identifier (DOI):
Library of Congress Subject Heading: Ranking and selection (Statistics); Least squares.

Actions (login required)

View Item View Item


Downloads per month over the past year

View more statistics