Data representation scheme and similarity measures for a comprehensive computational chemistry database

Staveley, Mark Sinclair (2009) Data representation scheme and similarity measures for a comprehensive computational chemistry database. Doctoral (PhD) thesis, Memorial University of Newfoundland.

[img] [English] PDF - Accepted Version
Available under License - The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission.

Download (8MB)

Abstract

This thesis draws upon research in the areas of information retrieval, chemical informatics, and computational chemistry. -- Many research initiatives deal with very large amounts of data, and as a result information retrieval systems are becoming more and more of a necessity. Chemically-based information retrieval systems are of particular interest to computational chemists, as computational chemists not only produce large quantities of information (data), but they also use large quantities of computer processing power (CPU cycles). -- Currently there are no tools available through any of the Canadian High-Performance Computing consortia that have been designed and implemented to support the data management activities of computational chemists. The only electronic resources that are publicly available contain information that has either been obtained experimentally or through patent and publication searches. -- A system by the name of Chem-DRSM has been designed and implemented in order to support the structuring and browsing of computational chemistry data. It has been implemented using principles and methods associated with various chemically based data representation schemes and similarity measures. This thesis presents and discussed the design, implementation and evaluation of the Chem-DRSM system. An evaluation of the similarity measures found within the Chem-DRSM system was conducted using statistical information (precision and recall statistics), information from the distribution of similarity scores with test structures, and information gathered from a human study that involved subjects with an expert level of knowledge in chemistry. -- The Chem-DRSM contains three different similarity measures (namely the contextual cosine measure, standard cosine measure, and Tanimoto measure), which have all been adapted to make use of specialized chemical topological descriptors called Chemical Atom Topological Indices (CATI). The evaluation not only compares the performance of these metrics with each other, but also compares their performance with a version of the Tanimoto measure which uses chemical fingerprints (which is considered to be an industry standard). -- Results of the statistical evaluation showed that the standard cosine measure had a higher average precision (with a lower standard deviation) than the other measures (including the Tanimoto with chemical fingerprints). During the evaluation of the distribution of similarity scores produced by the different similarity measures it was observed that the standard cosine measure assessed the similarity of chemical structures with the most granularity. The level of granularity associated with the standard cosine measure is attributed, in part, to its use of statistical weighting information about the various descriptors found within chemical structures. This is in contrast to the Tanimoto measure, with chemical fingerprints, which only looks at the presence and absence of properties when distinguishing chemical structures. Furthermore, the standard cosine measure also identified more similar structures (as classified by the human study participants) than the Tanimoto measure. -- All of these different evaluation results show that the standard cosine, measure, which uses the CATI descriptors, defines a chemical information context for searching and browsing that is more appropriate than the chemical information context created by the Tanimoto measure which uses chemical fingerprints.

Item Type: Thesis (Doctoral (PhD))
URI: http://research.library.mun.ca/id/eprint/11471
Item ID: 11471
Additional Information: Includes bibliographical references (leaves 148-156).
Department(s): Science, Faculty of > Computer Science
Date: 2009
Date Type: Submission
Library of Congress Subject Heading: Cheminformatics; Chemistry--Databases; Information storage and retrieval systems--Chemistry.

Actions (login required)

View Item View Item

Downloads

Downloads per month over the past year

View more statistics