Salient Features for Multilingual Speech Recognition in Indian Scenario

Author: Hari Vydana
Date: 2019-05-25
Report no: IIIT/TH/2019/102
Advisor:Anil Kumar Vuppala

Abstract

Automatic Speech Recognition (ASR) systems have witnessed a lot of progress in the past decade. In high resourced scenarios like English, ASR systems have shown performance compared to human parity level on specific tasks. Speech recognition systems for Indian languages are less studied compared to other high resourced languages like English. India is a developing country with large emerging markets for speech recognition technology. Developing Indian language ASR systems requires certain challenges to addressed which are innate to Indian languages. India is a multilingual society. India has 23 official languages. To penetrate deep into the Indian markets, ASR systems which can be operated in multiple languages are to be developed. Collecting data from a multilingual environment is much tedious task than to acquire data from a monolingual environment. Often Indian language ASR systems have to be developed for low-resourced scenarios. Apart from multilingual nature, bilingualism is very prevalent in the Indian population which leads frequent code-switching and word borrowing between any two languages. Operating parallel ASR systems with code-switching capabilities in Indian scenarios is a huge challenge. This motivated us to work towards multilingual ASR systems which can handle codemixing and word borrowing efficiently. In this thesis, we address various issues related to development of ASR systems for Indian scenarios. An integrated ASR system is developed using common phone-set which can efficiently handle multilingual code-mixed speech. Acoustic modeling approaches such as HMM-GMM, HMM-SGMM and RNN-CTC have been studied to find the most suitable acoustic model. Various acoustic modeling units such as context independent phone, context dependent phone and syllables are explored to find the most suitable unit for developing joint acoustic models. Residual connections have been explored to improve the performance of the joint acoustic models. Studies directed towards supplementing the conventional features along with articulatory features have been explored for developing mulit-lingual ASR systems. Fricative landmarks are detected and the detected landmarks are used as the features for improving the performance of multilingual ASR system. Distinctive features from speech are modeled using a statistical approach and their relevance for improving the performance of a multilingual ASR is explored. In a low resourced scenario when the data is pooled from multiple sources, the meta-level information about the speaker is not accessible. A speaker normalization to handle those scenarios is explored. Some major conclusions from the work are: • Using Common phone-set for training joint acoustic models offers an attractive solution developing ASR systems to handle multilingual and code-mixed scenarios. • Acoustic models that can model context independent phones have performed better compared to the context dependent tri-phone units. RNN-CTC based joint-acoustic model has performed better that HMM-SGMM model. • Using residual networks to develop the joint acoustic models have stabilized the training. They have improved the performance of the acoustic model when the model is sufficiently deep. • Fricative landmarks when fused with the features have improved the performances of the ASR systems when the data size is small, with relatively larges sized datasets the performance have not improved significantly. • Using distinctive features predicted from DNNs along with the input features have not significantly improved the performance of multilingual ASR system. • When the meta-information about the speaker ID is not available, Using the speaker codes derived from a speaker ID network can be used to improve the performance of a multilingual ASR system

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Salient Features for Multilingual Speech Recognition in Indian Scenario

Abstract