A Study on Compositional Features of Graphs

Author: Prathyush Sambaturu
Date: 2017-05-15
Report no: IIIT/TH/2017/26
Advisor:Kamalakar Karlapalem,Kishore Kothapalli

Abstract

Many of the real world networks can be represented as graphs. One of the most important problems related to the graphs is to find similar graphs. Due to a large number of properties each graph can possess, it is often difficult to find a similarity measure that captures all the properties of graphs. In this thesis, we discuss a notion of graph similarity based on the composition. The composition of a graph is the set of its important subgraphs and their frequency of occurrence in the original graph. Existing measures for graphs or trees do not capture the idea of the compositions. Our major contributions are directed towards extraction of compositional features of the graphs and trees. To find compositional features we need to find efficient algorithms to enumerate the frequencies of the subgraph templates in the graph. In this work, we present four contributions: (i) Grammars to recognize and enumerate subtree templates, (ii) idea of compositional similarity, (iii) Clique Conversion Coefficient, and (iv) an effective pre-processing step for listing Maximal Cliques. We designed grammars to enumerate the sub-tree templates in ordered trees. These grammars represent a tree using a string of length O(n), where n is the number of nodes in the tree. The frequencies of sub-tree templates are deduced using an algorithm. The results show that with good design of a grammar that can encode the structure of trees and a compatible algorithm assist in efficient enumeration of such compositional features. In this work, we introduce a measure for graphs, Clique Conversion Coefficient(CCC), which captures the clique forming tendency of nodes in a graph. CCC can be used either as a weighted average of a sequence of values or the sequence of values itself. We define the measure with three variants: global, local, and average. CCC is closely related to both Clustering Coefficient and Density of a graph as in they are all designed to find how well the nodes in a graph group together. But as per our experimental results, CCC, in some cases, provided additional information about a graph, which is the measure of the extent to which the nodes in the graph tend to form cliques. Also, we motivate the need for CCC by clustering some real world graphs with CCC as a feature in combination with other features. Our results show that graphs with similar clique forming tendencies are clustered together with the use of CCC as a feature, while without CCC this kind of grouping is not necessarily possible. This kind of Clustering would have many applications in the areas like Social Network Analysis, Protein-Protein Interaction Analysis, etc., where the existence of Cliques plays an important role. In this thesis, we also discussthe clustering of ego networks of large YOUTUBE network using the sequence of values in CCC as features. The quality of these results is verified by contrasting the frequent subgraphs in each cluster obtained. This experiment highlights the utility of CCC in clustering subgraphs of a large graph based on their clique forming tendencies. Also, we designed an algorithm to list all the maximal cliques. Our contribution in this algorithm is the pre-processing step, where we break the graph into biconnected components and prune the nodes having a lower degree and do these two steps iteratively in this order. This method experimentally showed to be very effective for breaking the workload of listing maximal cliques in many sparse graphs.

Full thesis: pdf

Centre for Data Engineering

IIIT Hyderabad Publications

A Study on Compositional Features of Graphs

Abstract