Design of a Phonetically Balanced Code-Mixed Hindi-English Read Speech Corpus for Automatic Speech Recognition

Author: Ayushi Pandey
Date: 2018-01-22
Report no: IIIT/TH/2018/5
Advisor:Suryakanth V Gangashetty

Abstract

Code-mixing is a frequently encountered phenomenon in day-to-day natural language communication in the multilingual and bilingual communities. The phenomenon is so widespread that is often considered a different, emerging variety of the language. Computational modeling of the phenomenon of code-mixing and code-switching assumes particular relevance with the advancement of social media. However, computational studies for both textual and speech processing of code-mixing suffer from a sincere disadvantage: lack of data. In this thesis, we present the development of a Phonetically Balanced read speech corpus of Code-Mixed Hindi-English (the PBCM corpus). A Large Code-Mixed (LCM corpus) has been extracted from selected sections of two widely read Hindi newspapers. These sections (namely: Sports, Technology and Lifestyle) contain frequent English insertions embeddded within the matrix of Hindi sentences. Phonetic balance in the corpus has been established by selecting sentences that contain triphones lower in frequency than a predefined threshold. The assumption with the compulsory inclusion of such rare units is that the high frequency triphones will inevitably be included [66]. Using this metric, the Pearson’s correlation coefficient of the phonetically balanced corpus with a large code-mixed reference corpus was recorded to be 0.996. Statistics on the phone and triphone distribution have been presented, to graphically display the phonetic likeness between the large reference corpus (LCM corpus) and the corpus sampled through our method. A pilot PBCM corpus (PBCM-Phase I) has been recorded by 4 volunteer Hindi speakers. As a second contribution of this thesis, we present a baseline automatic speech recognition (ASR) system for code-mixed read speech in Hindi-English, developed upon the extrapolation of monolingual training resources. Two distinct monolingual acoustic models (Hindi and English) have been implemented to train a neural network and subspace GMM based speech recognition framework. The testing has been performed on PBCM-Phase I corpus. We examine the combination of a code-mixed trigram language model with each of the acoustic models described above. We contrast the combination of the code-mixed language model with a monolingual language model, and see drastic improvements in accuracy. The Hindi monolingual acoustic model, in combination with the code-mixed language model is our best performing experimental setup. (word error rate: 41.22%).

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Design of a Phonetically Balanced Code-Mixed Hindi-English Read Speech Corpus for Automatic Speech Recognition

Abstract