Mining Periodic Frequent Patterns using Period Summary and Map-Reduce

Author: Anirudh Alampally
Date: 2017-12-08
Report no: IIIT/TH/2017/99
Advisor:P Krishna Reddy

Abstract

The field of data mining has evolved to help organizations find interesting knowledge from databases. Under data mining, different knowledge extraction approaches like association rules, classification and clustering have been proposed. Association rules are generated by analyzing the transactional databases using the support and confidence interestingness measures to identify the most important relationships among the items. Discovering frequent patterns is a key step in association rule mining. A pattern is called a frequent pattern if it satisfies the user-defined threshold on minimum support. Periodic-frequent patterns come under the class of frequent patterns which capture the temporal dimension of a pattern. In this thesis, we made efforts to improve the performance of mining periodic-frequent patterns. Periodic-frequent patterns are an important class of regularities that exist in a transactional database. A pattern is called periodic-frequent if it satisfies the thresholds of minimum support and maximum periodicity. Here, support is the frequency and periodicity is the maximum inter-arrival time of the pattern. In the literature, an algorithm called periodic-frequent pattern growth (PF-growth) was proposed which extracts the complete set of periodic-frequent patterns from transactional databases. Here, the transactional database is initially compressed into a prefix-tree like structure called periodic-frequent pattern tree (PF-tree), which is recursively mined to extract the patterns. The PF-tree explicitly maintains the transaction identifier information of each transaction. This enables the calculation of periodicity, but nevertheless the PF-tree size increases. In recent times, the amount of data generated from e-Commerce and social-networking sites are very huge. Mining periodic-frequent patterns from these datasets require large amounts of main memory and faster processing power. We propose the following two approaches to overcome these issues: (i) mining periodic-frequent patterns using the notion of period summary and (ii) a Map-Reduce based approach to extract periodic-frequent patterns. In the first approach, a novel compression technique called period summary is introduced to reduce the amount of memory consumed to extract periodic-frequent patterns. It is based on the observation that the periodic-frequent patterns can also be extracted by storing period summaries instead of transaction identifiers in the tree structure. The performance is improved as the memory required to store the period summaries is significantly less than the memory required to store the list of transaction identifiers. The process of merging period summaries is also introduced to find the support and periodicity values of a candidate pattern. Apart from memory optimization, the time consumed has also been lowered. This is because, the linear time-complexity for finding a candidate pattern’s periodicity using a list of transaction identifiers has been reduced to a constant time operation using period summaries. Experimental results on three real-world datasets show that the proposed approach using period summary performs better than the existing approach both in terms of memory and time. Further in the second approach, a Map-Reduce based approach to extract periodic-frequent patterns is proposed. As the real-world datasets are very huge, extraction of periodic-frequent patterns on a single machine takes a lot of time and in some cases it becomes difficult because of the memory constraints. The proposed approach is scalable and utilizes multiple machines by considering Map-Reduce framework. It consists of two Map-Reduce phases. In the first phase, all one-sized periodic-frequent patterns are derived and in the second phase, independent trees are constructed on different machines. These independent trees are mined parallely on different machines to find the complete set of periodic-frequent patterns. Further, we proposed the notion of partition summary to reduce the amount of data shuffled among the machines, which improves the performance. Experiments on Apache Spark’s distributed environment show that the proposed approach speeds up with increase in the number of machines. Also, the notion of partition summary significantly reduces the amount of data shuffled among the machines. Overall, the process of mining periodic-frequent patterns from transactional databases has been made efficient, both in terms of memory and time. This paves a way for various e-Commerce and social-networking sites to use the knowledge of periodic-frequent patterns in the areas of customer relation management, inventory management, recommendation systems and so on.

Full thesis: pdf

Centre for Data Engineering

IIIT Hyderabad Publications

Mining Periodic Frequent Patterns using Period Summary and Map-Reduce

Abstract