IIIT Hyderabad Publications |
|||||||||
|
MapReduce based Approaches to Extract Correlated Patterns and Weighted Frequent PatternsAuthor: Amulya Kotani Date: 2021-11-25 Report no: IIIT/TH/2021/111 Advisor:P Krishna Reddy AbstractWith rapid advancements in technology, the volume of data generated keeps increasing at an incredible pace. Data mining is the process of discovering novel and potentially useful information from large amounts of data. Currently, data mining techniques are being employed by organizations to improve performance by extracting useful knowledge from large datasets. Pattern mining is one of the important data mining tasks. The pattern mining techniques are being applied in applications such as social media, e-commerce, advertising and so on. In the literature, models and algorithms have been proposed for mining different kinds of patterns such as frequent patterns, correlated patterns, utility patterns, weighted frequent patterns and coverage patterns. There is an ongoing research to develop efficient parallel algorithms to extract patterns from large datasets by employing parallel processing frameworks such as shared memory, shared nothing, GPUs, Grid computing and MapReduce. In this thesis, we proposed parallel algorithms by employing MapReduce framework to extract correlated patterns and weighted frequent patterns. Notably, MapReduce framework is a programming model that enables distributed processing of huge amounts of data on a very large number of commodity machines in geographically distributed environments in a reliable and fault-tolerant manner. Once an algorithm is developed on a MapReduce framework, it can be easily scaled to process large datasets by employing a large number of machines in a shared-nothing environment. The correlated pattern mining involves the extraction of patterns from transactional databases subject to the user specified minimum support and minimum all-confidence constraints. Typically, for any parallel algorithm that runs on multiple machines, the data is segmented into multiple partitions and distributed among the machines. This segmentation influences the data shuffle among the machines and runtime of the algorithm. We introduce two novel data segmentation techniques - database segmentation and transaction segmentation to discover the correlated patterns efficiently. The former technique involves segmenting the database into multiple sub-databases such that each sub-database can be mined independently. The latter technique involves segmenting a transaction into multiple sub-transactions such that each sub-transaction can be processed as an individual transaction. We employ these data segmentation techniques and leverage MapReduce framework to propose Parallel Correlated Patterngrowth algorithm.The proposed segmentation techniques are algorithm agnostic, and therefore, can be incorporated into any parallel correlated pattern mining algorithm. The process of weighted frequent itemset (WFI) mining involves the extraction of itemsets from a given transactional database subject to the user defined minimum support, minimum weight and min imum weighted support constraints. It has the following two challenges: (i) finding WFIs is a computationally expensive process as these itemsets do not satisfy the downward closure property and (ii) there are only few computationally efficient algorithms to find WFIs in very large databases. Firstly, we propose Sequential Weighted Frequent Pattern-growth algorithm to discover WFIs efficiently. It employs three novel pruning techniques to reduce the computational cost effectively. We also leverage the MapReduce framework to achieve further computational optimization and propose Parallel Weighted Frequent Pattern-growth algorithm. Overall, we propose MapReduce based approaches to extract correlated patterns and weighted frequent patterns from large transactional databases. We demonstrate that the proposed approaches are highly scalable, memory and runtime efficient through results of experiments conducted on both realworld and synthetic datasets. Full thesis: pdf Centre for Others |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |