REBMEC: Repeat Based Maximum Entropy Classifier for Biological Sequences

Authors: Pratibha Rani,Vikram Pudi
Conference: In Proc. of Intl. Conf. on Management of Data (COMAD-08), Mumbai, India.

Date: 2008-12-18
Report no: IIIT/TR/2008/173

Abstract

An important problem in biological data analysis is to predict the family of a newly discovered sequence like a protein or DNA sequence using the collection of available sequences. In this paper we tackle this problem and present REBMEC, a Repeat Based Maximum Entropy Classifier of biological sequences. Maximum entropy models are known to be theoretically robust and yield high accuracy, but are slow. This makes them useful as benchmarks to evaluate other classifiers. Specifically, REBMEC is based on the classical Generalized Iterative Scaling (GIS) algorithm and incorporates repeated occurrences of subsequences within each sequence. REBMEC uses maximal frequent subsequences as features but can support other types of features as well. Our extensive experiments on two collections of protein families show that REBMEC performs as well as existing state-of-the-art probabilistic classifiers for biological sequences without using domain-specific background knowledge such as multiple alignment, data transformation and complex feature extraction methods. The design of REBMEC is based on generic ideas that can apply to other domains where data is organized as collections of sequences.

Full paper: pdf

Centre for Data Engineering

IIIT Hyderabad Publications

REBMEC: Repeat Based Maximum Entropy Classifier for Biological Sequences

Abstract