IIIT Hyderabad Publications |
|||||||||
|
Models and Algorithms for Mining Subgraph Coverage Patterns and Graph Transactional Coverage PatternsAuthor: Srinivas Reddy Annappalli Date: 2022-12-22 Report no: IIIT/TH/2022/159 Advisor:P Krishna Reddy AbstractThe process of data mining discovers interesting knowledge from large amounts of data. Given a large amount of data, we have several data mining software systems for summarizing/characterizing, extracting frequent patterns and association rules, classification, and clustering tasks. The data mining based systems are employed to improve the performance of e-commerce systems, marketing, search systems, banking, healthcare, etc. Investigating data mining approaches to extract new knowledge from the given data is one of the active research areas. Graph model is widely employed to model the real world. For example, a chemical compound, a protein-ligand complex, and an XML document can be modeled as a graph or a graph transaction. In the literature, the topic of extracting the knowledge of frequent subgraphs from the given Graph Transactional Dataset (GTD) has been extensively studied. It has been shown that such knowledge could help in the drug discovery process. So far, in the literature, the issue of extracting coverage-related knowledge from GTD has not yet been explored. In this thesis, we propose new data mining models and approaches to extract two types of knowledge patterns from GTDs. Motivated by the issues in the process of drug discovery, as the first contribution, we propose the model and approach to extract the coverage-related knowledge of potential subgraph patterns from the given GTD. A subgraph pattern is a set of subgraphs. We consider that a subgraph covers the graph transaction if it belongs to that graph transaction. A subgraph pattern is considered as Subgraph Coverage Pattern (SCP) if the ratio of the union of the graph transactions covered by the subgraphs of the subgraph pattern to the size of GTD is not less than a user-given threshold value. We have defined three metrics to characterize an SCP. First, the relative frequency metric ensures that each subgraph of an SCP should appear in the threshold number of graph transactions. Second, the coverage support metric ensures that the union of graph transactions covered by each subgraph of SCP should satisfy the threshold percentage of GTD. Third, the overlap ratio metric ensures that the overlap ratio among the graph transactions covered by individual subgraphs of SCP satisfies overlap ratio constraint. Given a GTD and the user-given threshold values of relative frequency, coverage support and overlap ratio, the problem is to extract all SCPs from the given GTD. A straightforward approach requires the extraction of all possible subgraphs from GTD and generation of all possible subgraph patterns. In addition, for each subgraph pattern, it requires expensive graph-based computation of relative frequency, coverage support and overlap ratio to identify all SCPs of GTD. In this model, we propose the notions of the potential subgraphs, flat transactional modeling, and employ an overlap ratio-based candidate pruning heuristic for efficiently extracting SCPs from GTD. As the second contribution, with the motivation to improve the process of drug cocktail discovery, we have proposed the model and approach to extract the coverage-related knowledge of graph transactional patterns from the given GTD. A graph transactional pattern is a set of graph transactions. A drug compound can be modeled as a graph transaction, and its fragments can be modeled as subgraphs. A set of all fragments extracted from GTD is considered as a fragment set. We consider that a graph transaction covers a fragment if the fragment belongs to that graph transaction. A graph transactional pattern is considered as a Graph Transactional Coverage Pattern (GTCP) if the ratio of the union of fragments covered by each of its graph transactions to the size of the fragment set satisfies the coverage constraint. We have defined three metrics to characterize GTCP. First, the graph transactional coverage metric ensures that each graph of GTCP should cover at least threshold percentage of fragments in the fragment set. Second, the graph transactional pattern coverage metric ensures that the union of fragments covered by each graph transaction of GTCP should satisfy the coverage constraint. Third, the overlap ratio metric ensures that the overlap ratio among the fragments covered by individual graph transactions of GTCP should satisfy the overlap ratio constraint. Given a GTD and the threshold values of graph transactional coverage, graph transactional pattern coverage, and overlap ratio, the problem is to extract all GTCPs from given GTD. A straightforward approach involves extracting all of the possible fragments from GTD and generation of all possible graph transactional patterns, which increase exponentially as the size of GTD increases. In addition, for each candidate graph transactional pattern, requires expensive graph-based computation of graph transactional pattern coverage and overlap ratio for extracting all GTCPs from GTD. In this work, we propose the notions of the potential fragment, flat transactional modeling, and employ an overlap ratio-based candidate pruning technique for efficiently extracting GTCPs from a given GTD. The experiments on the GTDs formed from PubChem chemical compound dataset demonstrates the feasibility of extracting the knowledge of SCPs and GTCPs in terms of processing time. We have also shown the application of SCPs in drug discovery process and GTCPs in drug cocktail discovery process. Notably, the proposed models are general models and can be extended to any GTDs of any domain. We hope that the research encourages further research and leads to the development of decision support systems to improve performance in the domains such as computer-aided drug design, social networks and so on. Full thesis: pdf Centre for Data Engineering |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |