IIIT Hyderabad Publications |
|||||||||
|
RBNBC: Repeat Based Naive Bayes Classifier for Biological SequencesAuthors: Pratibha Rani,Vikram Pudi Conference: In Proc. of Intl. Conf. on Data Mining (ICDM-08), Pisa, Italy. Date: 2008-12-15 Report no: IIIT/TR/2008/172 AbstractIn this paper, we present RBNBC, a Repeat Based Naive Bayes Classifier of bio-sequences that uses maximal frequent subsequences as features. RBNBC's design is based on generic ideas that can apply to other domains where the data is organized as collections of sequences. Specifically, RBNBC uses a novel formulation of Naive Bayes that incorporates repeated occurrences of subsequences within each sequence. Our extensive experiments on two collections of protein families show that it performs as well as existing state-of-the-art probabilistic classifiers for bio-sequences. This is surprising as it is a pure data mining based generic classifier that does not use domain-specific background knowledge. We note that domain-specific ideas could further increase its performance. Full paper: pdf Centre for Data Engineering |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |