N-gram methods of analyzing DNA sequence

Chilaka, Charles (2015) N-gram methods of analyzing DNA sequence. Masters thesis, Memorial University of Newfoundland.

[img] [English] PDF - Accepted Version
Available under License - The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission.

Download (2MB)


An DNA sequencing microarray experiment produces a 4 x N data matrix, comprising the signal strength of experimental DNA / reference oligonucleotide binding for each of four possible bases (A,C,G,T) at each of N positions. The highest of the four signals at each position is expected to result from the perfect match, and hence be the correct base call. Variation in absolute and relative signal strength may interfere with reliability of base calling. Variable base composition of the reference oligonucleotides in uences this in ways that are not fully understood. I used an n-gram representation of oligonucleotides in a neural network analysis to predict normalized signal intensities from an Affymetrix DNA sequencing experiment. For a DNA oligonucleotide, an n-gram can take on 4n values, e.g., a 1-gram uses the frequencies of each base, a 2-gram the di-nucleotide frequencies, and so on. Neural networks use a variable number of neurons in the hidden layer to create a correspon- dence between an input data set (divided into Training, Validation, and Test sets) and an output target set. I used a data set reduced from a sequence of 15,392 bases to 594 lines. I used 1- and 2-grams and their composite, with 20 to 40 neurons in the hidden layers of the neural network. For all models, the base with normalized value of 1.0 (that with the highest absolute value) was predicted for 100% of times. For 1-grams compared with the composite, regression values increased from 0.9898 to 0.9918, and measures of performance plots improved (decreased) from 3.195 x 10⁻³ to 2.525 x 10⁻³. For a 1- and 2-gram composition with 30 neurons in the hidden layer of the neural network, the diagonal of the confusion matrix had a high percentage value of 99.8%. Receiver Operating Characteristic (ROC) curves showed points in the upper-left corner, a sign of good test. The analysis suggests that neural network analysis of oligonucleotides as higher-order n-grams could be used to predict intensities in an Affymetrix or any other DNA sequencing microarray experiment.

Item Type: Thesis (Masters)
URI: http://research.library.mun.ca/id/eprint/12439
Item ID: 12439
Additional Information: Includes bibliographical references (pages 147-164).
Keywords: N-gram, Neural networks, Performance plots, Regression values, Confusion matrix
Department(s): Science, Faculty of > Computational Science
Date: June 2015
Date Type: Submission
Library of Congress Subject Heading: Nucleotide sequence--Mathematical models; DNA microarrays--Mathematical models; Neural networks (Computer science)--Data processing

Actions (login required)

View Item View Item


Downloads per month over the past year

View more statistics