Articulatory Gesture Rich Representation Learning of Phonological Units in Low Resource Settings

Authors: Brij Mohan lal,Manish Shrivastava
Conference: 4th Statistical Language and Speech Processing (SLSP 2016)
Location Pilsen, Czech Republic
Date: 2016-10-11
Report no: IIIT/TR/2016/54

Abstract

Recent literature presents evidence that both linguistic (phonemic) and non linguistic (speaker identity, emotional content) information resides at a lower dimensional manifold embedded richly inside the higher-dimensional spectral features like MFCC and PLP. Linguistic or phonetic units of speech can be broken down to a legal inventory of articulatory gestures shared across several phonemes based on their manner of articulation. We intend to discover a subspace rich in gestural information of speech and captures the invariance of similar gestures. In this paper, we investigate unsupervised techniques best suited for learning such a subspace. Main contribution of the paper is an approach to learn gesture-rich representation of speech automat- ically from data in completely unsupervised manner. This study compares the representations obtained through convolutional autoencoder (ConvAE) and standard unsupervised dimension- ality reduction techniques such as manifold learning and Principal Component Analysis (PCA) through the task of phoneme classification. Manifold learning techniques such as Locally Linear Embedding (LLE), Isomap and Laplacian Eigenmaps are evaluated in this study. The repre- sentations which best separate different gestures are suitable for discovering subword units in case of low or zero resource speech conditions. Further, we evaluate the representation using Zero Resource Speech Challenge’s ABX discriminability measure. Results indicate that representation obtained through ConvAE and Isomap out-perform baseline MFCC features in the task of phoneme classification as well as ABX measure and induce separation between sounds composed of different set of gestures. We further cluster the representations using Dirichlet Pro- cess Gaussian Mixture Model (DPGMM) to automatically learn the cluster distribution of data and show that these clusters correspond to groups of similar manner of articulation. DPGMM distribution is used as apriori to obtain correspondence terms for robust ConvAE training.

Full paper: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Articulatory Gesture Rich Representation Learning of Phonological Units in Low Resource Settings

Abstract