Using the Machine Learning Approach to Predict Patient Survival from High-Dimensional Survival Data

Zhang, Wenbin (2015) Using the Machine Learning Approach to Predict Patient Survival from High-Dimensional Survival Data. Masters thesis, Memorial University of Newfoundland.

[img] [English] PDF - Accepted Version
Available under License - The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission.

Download (1MB)


Survival analysis with high dimensional data deals with the prediction of patient survival based on their gene expression data and clinical data. A crucial task for the accuracy of survival analysis in this context is to select the features highly correlated with the patient’s survival time. Since the information about class labels is hidden, existing feature selection methods in machine learning are not applicable. In contrast to classical statistical methods which address this issue with the Cox score, we propose to tackle this problem by discretizing the survival time of patients into a suitable number of subgroups via silhouettes clustering validity. To cope with patient’s censoring, we use “k-nearest neighbor” based on clinical parameters that are truly associated with survival time. These are selected using penalized logistic regression and the penalized proportional hazards model with the EM algorithm. They are then used to estimate censored survival time. Next, the estimated class label is combined with feature selection to identify a list of genes that are correlated with the survival time and classifiers are applied to this subset of genes to determine which subtype is present in a future patient. By doing so, we expect that the identified subgroups are not only biologically meaningful but also differ in terms of survival. The effectiveness and efficiency of the proposed method are demonstrated through comparisons with classical statistical methods on real-world datasets and simulation datasets.

Item Type: Thesis (Masters)
Item ID: 12733
Additional Information: Includes bibliographical references (pages 92-102).
Department(s): Science, Faculty of > Computer Science
Date: October 2015
Date Type: Submission
Library of Congress Subject Heading: Machine learning ; Patients -- Statistics, Medical
Medical Subject Heading: Survival Analysis ; Gene Expression Profiling

Actions (login required)

View Item View Item


Downloads per month over the past year

View more statistics