Relevant gene subset selection: the maximum margin criterion in SVM and genetic algorithm

Huang, Xiao Bing (2006) Relevant gene subset selection: the maximum margin criterion in SVM and genetic algorithm. Masters thesis, Memorial University of Newfoundland.

[English] PDF - Accepted Version
Available under License - The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission.
Download (5MB)

Abstract

The data collected from a typical microarray experiment usually consist of tens of samples and thousands of genes (i.e., features). Usually only a small subset of the features is relevant to the differentiation of the samples. The problem of identifying an optimal subset of features for the differentiation is called Feature Subset Selection (FSS). The main purpose of the thesis is to develop a method for relevant gene subset selection using microarray gene expression data. Specifically, this thesis extends the classic Support Vector Machine (SVM) algorithm to present a new hill-climbing method Relevant Subset Selection Using The Maximum Margin Criterion (RSSMMC) and using its Genetic Algorithm (GA) version RSSMMC-GA for feature selection. This method identifies that there are two factors, one biological and the other mathematical, that can affect the SVM margin value. Through an analytic process, we neutralize the mathematical factor, which has no contribution to the relevant gene selection, and utilize the biological factor to select genes which contribute to the increase of the SVM margin. The result subset with a fixed number of features is determined when the maximum accumulative margin value is achieved. -- This method is shown experimentally to yield better performance than previous attempts which select features with correlation techniques and Recursive Feature Elimination (RFE), to generate biologically relevant genes. In contrast to the former methods, RSSMMC creates a unique and more compact gene subset. Moreover, since the RSSMMC method starts from an empty set to construct the subset whose size is usually small, it consumes less computation time than the comparing methods. This improvement is especially evident in large data sets.

Item Type:	Thesis (Masters)
URI:	http://research.library.mun.ca/id/eprint/10701
Item ID:	10701
Additional Information:	Bibliography: leaves 105-114.
Department(s):	Science, Faculty of > Computer Science
Date:	2006
Date Type:	Submission
Library of Congress Subject Heading:	Mathematical optimization; Regression analysis.

Actions (login required)

View Item

Download statistics

Downloads

Downloads per month over the past year

View more statistics