Promotech: a universal tool for promoter detection in bacterial genomes

Chevez-Guardado, Ruben (2020) Promotech: a universal tool for promoter detection in bacterial genomes. Masters thesis, Memorial University of Newfoundland.

[img] [English] PDF - Accepted Version
Available under License - The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission.

Download (2MB)

Abstract

A promoter is a genomic sequence where the transcription machinery binds to start copying a gene into an RNA molecule. Finding the location of bacterial promoter sequences is essential for microbiology since promoters play a central role in regulating gene expression. There are several tools to recognize promoters in bacterial genomes; however, most of them were trained on data from a single bacterium or a specific set of sigma factors. Promotech was developed to overcome this limitation, offering a machine-learning-based classifier trained to generate a model that generalizes and detects promoters in a wide range of bacterial species. During the study, two model architectures were tested, Random Forest and Recurrent Networks. The Random Forest model, trained with promoter sequences with a binary encoded representation of each nucleotide, achieved the highest performance across nine different bacteria and was able to work with short 40bp sequences and entire bacterial genomes using a sliding window. The selected model was evaluated on a validation set of four bacteria not used during training, having 50% positive and 50% negative promoter sequences resulting in an average AUPRC of 0.73±0.13 and an AUROC of 0.71±0.13. The Random Forest model achieved an average AUPRC and AUROC across the validation set's entire genomes of 0.14±0.1 and 0.71±0.17, but increased its performance to 0.75±0.18 AUPRC and 0.90±0.06 AUROC when it was configured to detect promoter clusters. Promotech was compared against state-of-the-art bacterial promoter detection programs using the balanced data set and outperformed these methods.

Item Type: Thesis (Masters)
URI: http://research.library.mun.ca/id/eprint/14767
Item ID: 14767
Additional Information: Includes bibliographical references (pages 60-65).
Keywords: Promoter, Machine Learning, Bacterial Genome, Promoter Detection, Bioinformatic
Department(s): Science, Faculty of > Computer Science
Date: September 2020
Date Type: Submission
Digital Object Identifier (DOI): https://doi.org/10.48336/e15b-s520
Library of Congress Subject Heading: Promoters (Genetics); Machine learning--Scientific applications; Bacterial genomes.

Actions (login required)

View Item View Item

Downloads

Downloads per month over the past year

View more statistics