Automated Feature Construction and Selection

Author: saket.maheshwary
Date: 2019-05-01
Report no: IIIT/TH/2019/31
Advisor:Vikram Pudi

Abstract

The effectiveness of machine learning techniques is strongly dependent on the choice of data representation and the features available. In recent years, the importance of feature engineering has been confirmed by the exceptional performance of deep learning techniques, that automate this task for some applications. For others, feature engineering requires substantial manual effort in designing and selecting features and is often tedious and non-scalable. An AI system must fundamentally understand the world around us which can only be done if it can learn to identify and disentangle the underlying implicit patterns present in the observed low-level sensory data. Feature learning presents a way to take advantage of human ingenuity and prior knowledge to compensate for that weakness. In order to make it possible to apply machine learning to different domains, it is very important to make learning algorithms less dependent on manual feature engineering, so that novel AI based applications could be developed faster.This thesis is about automated feature learning, i.e., learning representations of the data that make it easier to extract useful information when building classifiers or other predictors. A good representation is one that is useful as input to a supervised predictor. Among the various ways of learning representations, this thesis focuses on automating the task of feature learning without any domain knowledge via regression models. Our proposed algorithm is scalable regression-based feature learning algorithm. Being data driven, it requires no domain knowledge and is applicable to any dataset having numeric attributes. Such a generic representation is learnt by mining pairwise feature associations, identifying the linear or non-linear relationship between each pair, applying regularized regression and selecting those relationships that are stable. Our experimental evaluation on benchmark UC Irvine and gene expression datasets across different domains provides evidence that the features generated through our learning model can improve the prediction accuracy significantly for different classifiers without using any domain knowledge.

Full thesis: pdf

Centre for Others

IIIT Hyderabad Publications

Automated Feature Construction and Selection

Abstract