IIIT Hyderabad Publications |
|||||||||
|
Wikipedia for Clustering Multilingual DocumentsAuthors: N kiran kumar,santosh GSK,Vasudeva Varma Conference: 16th International Conference on Applications of natural language to Information Systems (NLDB - 2011) Location Universitad de Alacante, Alicante, Spain Date: 2011-06-28 Report no: IIIT/TR/2011/18 AbstractThis paper presents Multilingual Document Clustering (MDC) using Wikipedia on comparable corpora. Particularly, we utilized the cross lingual links, category, outlinks, Infobox information present in Wikipedia to enrich the document representation. We have used Bisecting k-means algorithm for clustering multilingual documents based on the document similarities. Experiments are conducted based on the usage of English and Hindi Wikipedia. We have considered English and Hindi Datasets provided by FIRE'101 for Ad-hoc Cross-Lingual document retrieval task on Indian languages. No language specic tools are used, which makes the proposed approach easily extendable for other languages. The system is evaluated using F-score and Purity measures and the results obtained are encouraging. Full paper: pdf Centre for Search and Information Extraction Lab |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |