IIIT Hyderabad Publications |
|||||||||
|
RBNBC: Repeat Based Naive Bayes Classifier for Biological SequencesAuthors: Pratibha Rani,Vikram Pudi Date: 2008-09-27 Report no: IIIT/TR/2008/126 AbstractIn this paper, we present RBNBC, a Repeat Based Naive Bayes Classifier of bio-sequences that uses maximal frequent subsequences as features. The design of RBNBC is based on generic ideas that can apply to other domains where the data is organized as collections of sequences. Specifically, RBNBC uses a novel formulation of Naive Bayes that incorporates repeated occurrences of subsequences within each sequence. Our extensive experiments on two collections of protein families show that RBNBC performs as well as existing state-of-the-art probabilistic classifiers for bio-sequences. This is surprising as RBNBC is a pure data mining based generic classifier that does not require domain-specific background knowledge such as multiple alignment, data transformation and complex feature extraction methods. We note that such domain-specific ideas could further increase the performance of RBNBC. Full report: pdf Centre for Data Engineering |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |