Investigating Usage of Text Segmentation and Inter-passage Similarities to Improve Text Document Clustering

Authors: Shashank Paliwal,Vikram Pudi
Conference: 8th International Conference on Machine Learning and Data Mining (MLDM 2012 2012)

Date: 2012-07-16
Report no: IIIT/TR/2012/133

Abstract

Measuring inter-document similarity is one of the most essential steps in text document clustering. Traditional methods rely on representing text documents using the simple Bag-of-Words (BOW) model. A document is an organized structure consisting of various text segments or passages. Such single term analysis of the text treats whole document as a single semantic unit and thus, ignores other semantic units like sentences, passages etc. In this paper, we attempt to take advantage of underlying subtopic structure of text documents and investigate whether clustering of text documents can be improved if text segments of two documents are utilized, while calculating similarity between them. We concentrate on examining effects of combining suggested inter-document similarities (based on inter-passage similarities) with traditional inter-document similarities following a simple approach for the same. Experimental results on standard data sets suggest improvement in clustering of text documents.

Full paper: pdf

Centre for Data Engineering

IIIT Hyderabad Publications

Investigating Usage of Text Segmentation and Inter-passage Similarities to Improve Text Document Clustering

Abstract