Afshar, Majid (2021) Large-scale dimensionality reduction using perturbation theory and singular vectors. Doctoral (PhD) thesis, Memorial University of Newfoundland.
[English]
PDF
- Accepted Version
Available under License - The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission. Download (1MB) |
Abstract
Massive volumes of high-dimensional data have become pervasive, with the number of features significantly exceeding the number of samples in many applications. This has resulted in a bottleneck for data mining applications and amplified the computational burden of machine learning algorithms that perform classification or pattern recognition. Dimensionality reduction can handle this problem in two ways, i.e. feature selection (FS) and feature extraction. In this thesis, we focus on FS, because, in many applications like bioinformatics, the domain experts need to validate a set of original features to corroborate the hypothesis of the prediction models. In processing the high-dimensional data, FS mainly involves detecting a limited number of important features among tens/hundreds of thousands of irrelevant and redundant features. We start with filtering the irrelevant features using our proposed Sparse Least Squares (SLS) method, where a score is assigned to each feature, and the low-scoring features are removed using a soft threshold. To demonstrate the effectiveness of SLS, we used it to augment the well-known FS methods, thereby achieving substantially reduced running times while improving or at least maintaining the prediction accuracy of the models. We developed a linear FS method (DRPT) which, upon data reduction by SLS, clusters the reduced data using the perturbation theory to detect correlations between the remaining features. Important features are ultimately selected from each cluster, discarding the redundant features. To extend the clustering applicability in grouping the redundant features, we proposed a new Singular Vectors FS (SVFS) method that is capable of both removing the irrelevant features and effectively clustering the remaining features. As such, the features in each cluster solely exhibit inner correlations with each other. The independently selected important features from different clusters comprise the final rank. Devising thresholds for filtering irrelevant and redundant features has facilitated the adaptability of our model to the particular needs of various applications. A comprehensive evaluation based on benchmark biological and image datasets shows the superiority of our proposed methods compared to the state-of-the-art FS methods in terms of classification accuracy, running time, and memory usage.
Item Type: | Thesis (Doctoral (PhD)) |
---|---|
URI: | http://research.library.mun.ca/id/eprint/15221 |
Item ID: | 15221 |
Additional Information: | Includes bibliographical references (pages 134-142). |
Keywords: | dimensionality reduction, feature selection, singular vectors, perturbation theory, least squares |
Department(s): | Science, Faculty of > Computer Science |
Date: | March 2021 |
Date Type: | Submission |
Digital Object Identifier (DOI): | https://doi.org/10.48336/DB3Z-BF31 |
Library of Congress Subject Heading: | Dimension reduction (Statistics); Perturbation (Mathematics); Least squares; Data mining; Data mining--Statistical method. |
Actions (login required)
View Item |