Chen, Tao (2009) Integrating unstructured data using property precedence. Masters thesis, Memorial University of Newfoundland.
- Accepted Version
Available under License - The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission.
Data integration involves combining data from a variety of independent data sources to provide a unified view of data in these sources. One of the challenging problems in data integration is to reconcile the structural and semantic differences among data sources. Many approaches have been introduced to resolve the problem. However, most of these models have difficulties in handling data with less structure and varying granularity. This thesis focuses on developing a novel data integration approach for unstructured data. To identify properties from unstructured data, we adapt a probability model to identify multi-term properties. To address the granularity issue, we use the concept of Property Precedence. Unlike other approaches, Property Precedence does not require that data be class-based and takes 'property' as the basic semantic construct. Considering that unstructured data might contain properties that are not explicitly revealed by the description, we design a model that derives knowledge about a property from the instances known to possess the property. We evaluate this model and the results indicate that it is capable of inferring that an instance possesses a property when this information is not explicit in the data. We build a property precedence schema using the above model to help decide the existence of a property in the instance. We compare the results with property precedence schemas built by other approaches and demonstrate that our approach performs better than the others. Finally, we implement queries based on property precedence and show that these queries overcome the semantic gap between data sources and can retrieve relevant data that cannot be retrieved using other approaches.
|Item Type:||Thesis (Masters)|
|Additional Information:||Includes bibliographical references (leaves 79-82)|
|Department(s):||Science, Faculty of > Computer Science|
|Library of Congress Subject Heading:||Data mining; Database management; Semantic computing|
Actions (login required)