Multilingual Phonetic Features for Indian Language Identification

Author: Tirusha Mandava
Date: 2020-02-06
Report no: IIIT/TH/2020/6
Advisor:Anil Kumar Vuppala

Abstract

Language identification (LID) refers to the task of identifying the language being spoken by a speaker in a given speech signal. LID system acts as a front-end module for different applications such as multilingual automatic speech recognizer, multilingual dialogue system, and voice service systems. Notably, in an Indian scenario, where almost every state has a language of its own and every language having various dialects, development of a LID system becomes crucial. In literature, the standard acoustic features such as Mel-frequency cepstral coefficients, shifted delta cepstral coefficients and prosody (pitch, duration, and intonation) have been explored in the Indian context. These features cannot adequately discriminate the languages since most of the Indian languages have an overlapped set of phonemes. Even though Indian languages have an overlapped set of phonemes, due to phonotactic constraints, the characteristics of a particular sound unit differ in different languages which motivates us to explore phonetic features. In the literature, most of the works used senone based deep neural networks to extract phonetic features. We have used a joint acoustic model (JAM) which is trained using long short term memoryconnectionist temporal classification (LSTM-CTC) network to extract phonetic features. The JAM implicitly learns the language-specific information without any prior knowledge of the language. Hence, it is hypothesized that features extracted from JAM effectively discriminate the languages. These features are referred to as CTC features. The temporal variations in CTC features are modeled using the attention-based residual time-delay neural network to predict language ID. Attention mechanism aggregates frame-level features by selecting prominent frames through a parametrized attention layer. The performance of the proposed LID framework (CTC with attention-based RES-TDNN) has been evaluated on IIITH-ILSC database, which consists of 22 official Indian languages and Indian English. The proposed network outperformed other state-of-the-art methods with an equal error rate of 8.82%.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Multilingual Phonetic Features for Indian Language Identification

Abstract