Visual saliency prediction based on deep learning

Ghariba, Bashir (2020) Visual saliency prediction based on deep learning. Doctoral (PhD) thesis, Memorial University of Newfoundland.

[img] [English] PDF - Accepted Version
Available under License - The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission.

Download (2MB)

Abstract

The Human Visual System (HVS) has the ability to focus on specific parts of a scene, rather than the whole image. Human eye movement is also one of the primary functions used in our daily lives that helps us understand our surroundings. This phenomenon is one of the most active research topics in the computer vision and neuroscience fields. The outcomes that have been achieved by neural network methods in a variety of tasks have highlighted their ability to predict visual saliency. In particular, deep learning models have been used for visual saliency prediction. In this thesis, a deep learning method based on a transfer learning strategy is proposed (Chapter 2), wherein visual features in the convolutional layers are extracted from raw images to predict visual saliency (e.g., saliency map). Specifically, the proposed model uses the VGG-16 network (i.e., Pre-trained CNN model) for semantic segmentation. The proposed model is applied to several datasets, including TORONTO, MIT300, MIT1003, and DUT-OMRON, to illustrate its efficiency. The results of the proposed model are then quantitatively and qualitatively compared to classic and state-of-the-art deep learning models. In Chapter 3, I specifically investigate the performance of five state-of-the-art deep neural networks (VGG-16, ResNet-50, Xception, InceptionResNet-v2, and MobileNet-v2) for the task of visual saliency prediction. Five deep learning models were trained over the SALICON dataset and used to predict visual saliency maps using four standard datasets, namely TORONTO, MIT300, MIT1003, and DUT-OMRON. The results indicate that the ResNet-50 model outperforms the other four and provides a visual saliency map that is very close to human performance. In Chapter 4, a novel deep learning model based on a Fully Convolutional Network (FCN) architecture is proposed. The proposed model is trained in an end-to-end style and designed to predict visual saliency. The model is based on the encoder-decoder structure and includes two types of modules. The first has three stages of inception modules to improve multi-scale derivation and enhance contextual information. The second module includes one stage of the residual module to provide a more accurate recovery of information and to simplify optimization. The entire proposed model is fully trained from scratch to extract distinguishing features and to use a data augmentation technique to create variations in the images. The proposed model is evaluated using several benchmark datasets, including MIT300, MIT1003, TORONTO, and DUT-OMRON. The quantitative and qualitative experiment analyses demonstrate that the proposed model achieves superior performance for predicting visual saliency. In Chapter 5, I study the possibility of using deep learning techniques for Salient Object Detection (SOD) because this work is slightly related to the problem of Visual saliency prediction. Therefore, in this work, the capability of ten well-known pre-trained models for semantic segmentation, including FCNs, VGGs, ResNets, MobileNet-v2, Xception, and InceptionResNet-v2, are investigated. These models have been trained over an ImageNet dataset, fine-tuned on a MSRA-10K dataset, and evaluated using other public datasets, such as ECSSD, MSRA-B, DUTS, and THUR15k. The results illustrate the superiority of ResNet50 and ResNet18, which have Mean Absolute Errors (MAE) of approximately 0.93 and 0.92, respectively, compared to other well-known FCN models. Finally, conclusions are drawn, and possible future works are discussed in chapter 6.

Item Type: Thesis (Doctoral (PhD))
URI: http://research.library.mun.ca/id/eprint/14812
Item ID: 14812
Additional Information: Includes bibliographical references.
Keywords: Computer Vision, Visual Saliency, Semantic Segmentation, Deep Learning, CNNs
Department(s): Engineering and Applied Science, Faculty of
Date: October 2020
Date Type: Submission
Digital Object Identifier (DOI): https://doi.org/10.48336/bzj8-n176
Library of Congress Subject Heading: Visual discrimination--Computer simulation; Transfer learning (Machine learning).

Actions (login required)

View Item View Item

Downloads

Downloads per month over the past year

View more statistics