Modeling and thematic analysis of neighborhood structures in the web and hierarchical identification of web communities

Nargis, Isheeta (2007) Modeling and thematic analysis of neighborhood structures in the web and hierarchical identification of web communities. Masters thesis, Memorial University of Newfoundland.

[img] [English] PDF - Accepted Version
Available under License - The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission.

Download (4MB)

Abstract

The web graph represents the structure of the World Wide Web by denoting each web page as a vertex and each hyperlink as an arc. The motivating goal behind the research constituting this thesis is twofold - firstly, to model the local structure of the web graph, and secondly, to discover communities of related web pages. -- We study the concept of the neighborhood graph to model the local structure surrounding a particular web page. We analyze some structural and statistical properties of the neighborhood graphs and perform a comparison with the corresponding properties of the whole web graph. In several aspects these neighborhood graphs show a similar characteristic to the entire web graph. Both the indegree and outdegree distribution of a sufficiently large neighborhood graph follow the power law phenomenon. We perform a thematic analysis of the local structure of the Web by discovering authorities and hubs in neighborhood graphs, where the set of authorities and hubs gives a representative flavor on the theme of the page in question. We also analyze temporal evolution of Hyperlinked communities and identify Core communities in neighborhood graphs. -- We devise an algorithm to extract communities solely based on the topology of the Web. Central to our approach is the innovative idea of Iterative Cycle Contraction to discover Web communities comprised of related web pages. The intuition behind this algorithm is that if two pages link to each other then they are thematically related. Successive iterations yield a hierarchical structure of communities and allow us to define a similarity measure between two web pages in the same community by noting at which iteration their corresponding vertices are first grouped into a single vertex. We apply the algorithm to some focused subgraphs of the web graph and evaluate its effectiveness by performing an investigation into the theme(s) of the putative communities that it finds. We find that the algorithm is successful at identifying communities and distinguishing communities with varying thematic content in neighborhood graphs. An examination of the distribution of community sizes in a particular iteration of the algorithm reveals that for sufficiently large neighborhood graphs a power law is observed.

Item Type: Thesis (Masters)
URI: http://research.library.mun.ca/id/eprint/11123
Item ID: 11123
Additional Information: Includes bibliographical references (leaves 150-154).
Department(s): Science, Faculty of > Computer Science
Date: 2007
Date Type: Submission
Library of Congress Subject Heading: Data structures (Computer science); Graph algorithms; Web sites--Classification; Web usage mining; World Wide Web--Mathematical models; World Wide Web--Statistical methods.

Actions (login required)

View Item View Item

Downloads

Downloads per month over the past year

View more statistics