Improved Approaches to Discover Weighted and High Utility Itemsets

Author: Pradeep Pallikila
Date: 2022-02-10
Report no: IIIT/TH/2022/11
Advisor:P Krishna Reddy

Abstract

Extracting knowledge from large amounts of data has been an important research field for the past three decades. Data mining is one such area that deals with the investigation of frameworks and algorithms to extract different kinds of knowledge from large databases. Pattern mining is one of the data mining tasks, which deals with developing frameworks and algorithms for extracting interesting patterns (or itemsets) from large databases. In the literature, several pattern mining techniques have been employed to mine different kinds of patterns such as frequent patterns, coverage patterns, periodic patterns, weighted patterns, utility patterns, etc. In this thesis, we investigate improved approaches for mining weighted and high utility itemsets. As a first contribution, we have proposed an improved approach to mine Weighted Frequent Itemsets (WFIs) from spatio-temporal databases. The WFI framework allows the users to set weights to the items present in the database based on their importance and discovers all itemsets whose weighted sum in a transactional database is no less than the user-specified minimum weighted sum (minW S) value. The existing works in the literature have focused on finding WFIs in a transactional database and did not consider the spatial characteristics of the items within the data. The spatial characteristics of an item imply the geographical location of an item. We have proposed an improved WFI Mining (WFIM) approach for spatio-temporal databases by introducing the notion of neighborhood itemset, which is defined by considering the spatial characteristics of the items. An itemset is called a neighborhood itemset if the distance between each pair of items is no more than the user-specified maximum distance (maxDist) value. By adding the notion of neighborhood itemset to WFIM, we have proposed a new model, which we call Weighted Frequent Neighborhood Itemset Mining (WFNIM), to mine Weighted Frequent Neighborhood Itemsets (WFNIs) from spatiotemporal data. The generated WFNIs do not satisfy the anti-monotonic property. To extract WFNIs efficiently, we have proposed new pruning measures by exploiting the spatial nature of items to effectively reduce the search space and the computational cost of finding the desired patterns. A new pattern-growth algorithm has been presented to find all WFNIs from a given spatio-temporal database. Experimental results demonstrate that the proposed algorithm is efficient. A case study on JAPAN air pollution data is presented to demonstrate the usefulness of this model. As a second contribution, we have proposed an improved High Utility Itemset Mining (HUIM) approach. HUIM uses utility measure to mine High Utility Itemsets (HUIs) from the transactional database. In HUIM, each item is associated with internal utility (frequency of an item within the transaction) and external utility (cost of the item). The utility of an item is defined as the product of external and internal utilities. An itemset is called a high utility itemset if its utility is no less than the userspecified minimum utility (minU til) value. It can be noted that HUIM suffers from low utility item problem. To overcome this problem, we introduce a new null-invariant measure, called utility ratio, to evaluate the interestingness of an itemset in the database. We have extended HUIM to propose a new model of Relative High Utility Itemset Mining (RHUIM) to mine Relative High Utility Itemsets (RHUIs) in transactional data. An itemset is called RHUI if its utility is no less than the minU til value and its utility ratio is no less than the user-specified minimum utility ratio (minUR) value. We have proposed two new pruning measures based on utility ratio to reduce the search space. We also present a fast algorithm to find all desired itemsets in a transactional database. Experimental results demonstrate that the proposed algorithm is efficient. A case study on the real-world Yahoo! JAPAN retail data is presented to demonstrate the usefulness of this model. Overall, we have proposed improved approaches to discover weighted frequent neighborhood itemsets from spatio-temporal databases and relative high utility itemsets from transactional databases. We have demonstrated efficiency by conducting extensive experiments on real-world and synthetic databases. We have also demonstrated the application of the proposed approaches by applying the proposed approaches in real-world scenarios.

Full thesis: pdf

Centre for Data Engineering

IIIT Hyderabad Publications

Improved Approaches to Discover Weighted and High Utility Itemsets

Abstract