IIIT Hyderabad Publications |
|||||||||
|
Don't Use a Lot When Little Will Do : Genre Identication Using URLsAuthors: Nikhil Priyatam,Srinivasan Iyengar,Krish Perumal,Vasudeva Varma Conference: 14th International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2013) Location University of the Aegean, Samos, Greece Date: 2013-03-24 Report no: IIIT/TR/2013/12 AbstractThe ever increasing data on world wide web calls for the use of vertical search engines. Sandhan is one such search engine which offers search in tourism and health genres in more than 10 different Indian languages. In this work we build a URL based genre identification module for Sandhan. A direct impact of this work is on building focused crawlers to gather Indian language content. We conduct experiments on tourism and health web pages in Hindi language. We experiment with three approaches - list based, naive Bayes and incremental naive Bayes. We evaluate our approaches against another web page classication algorithm built on the parsed text of manually labeled web pages. We find that incremental naive Bayes approach outperforms the other two. While doing our experiments we work with different features like words, n-grams and all grams. Using n-gram features we achieve classification accuracies of 0.858 and 0.873 for tourism and health genres respectively. Full paper: pdf Centre for Search and Information Extraction Lab |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |