DISC: Data-Intensive Similarity Measure for Categorical Data

Authors: Aditya Desai,Himanshu singh,Vikram Pudi
Conference: Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2011)

Date: 2011-05-21
Report no: IIIT/TR/2011/24

Abstract

The concept of similarity is fundamentally important in almost every scientic field. Clustering, distance-based outlier detection, classification, regression and search are major data mining techniques which compute the similarities between instances and hence the choice of a particular similarity measure can turn out to be a major cause of success or failure of the algorithm. The notion of similarity or distance for categorical data is not as straightforward as for continuous data and hence, is a major challenge. This is due to the fact that different values taken by a categorical attribute are not inherently ordered and hence a notion of direct comparison between two categorical values is not possible. In addition, the notion of similarity can dier depending on the particular domain, dataset, or task at hand. In this paper we present a new similarity measure for categorical data DISC - Data-Intensive Simi- larity Measure for Categorical Data. DISC captures the semantics of the data without any help from domain expert for dening the similarity. In addition to these, it is generic and simple to implement. These desirable features make it a very attractive alternative to existing approaches. Our experimental study compares it with 14 other similarity measures on 24 standard real datasets, out of which 12 are used for classification and 12 for regression, and shows that it is more accurate than all its competitors.

Full paper: pdf

Centre for Data Engineering

IIIT Hyderabad Publications

DISC: Data-Intensive Similarity Measure for Categorical Data

Abstract