IIIT Hyderabad Publications |
|||||||||
|
Towards Building an Automatic Speech Recognition Systems in Indian Context using Deep LearningAuthor: Sai Ganesh Mirishkar 2018802003 Date: 2023-11-24 Report no: IIIT/TH/2023/162 Advisor:Anil Kumar Vuppala AbstractAutomatic Speech Recognition (ASR) systems are increasingly prevalent in our daily lives, with commercial applications such as Siri, Alexa, and Google Assistant. However, the focus of these systems has been largely angled towards English, leaving a considerable portion of non-English speakers underserved. This is particularly evident in India, a linguistically diverse country with many languages classified as low-resource in the context of ASR due to the scarcity of annotated speech data. This thesis aims to bridge this gap, focusing on enhancing ASR systems for Indian languages using deep learning methodologies. India is a land of language diversity. There are approximately 2000 languages spoken around, and among those officially registered are 23. Of those, very few have ASR capability. This is because building an ASR system requires thousands of hours of annotated speech data, a vast amount of text, and a lexicon that can span all the words in the languages. The necessity for a comprehensive presence in the diverse Indian markets demands the development of multilingual Automatic Speech Recognition (ASR) systems. It’s a common scenario where ASR systems for Indian languages have to be implemented in low-resourced contexts. Furthermore, the complexity of the linguistic landscape is amplified due to the high prevalence of bilingualism in the Indian population, leading to frequent instances of code-switching and linguistic borrowing between languages. Operating concurrent ASR systems that can handle code-switching in the Indian context presents a considerable challenge. This predicament has spurred our research endeavors, driving us to focus on constructing a large corpus for one language and leveraging its phonetic space on other language families in monolingual and multilingual ASR scenarios. This thesis incorporates a crowd-sourcing strategy to collect an extensive speech corpus, particularly for Telugu. Using this approach, around 2000 hours of Telugu speech data, capturing regional variations through three modes: spontaneous, conversational, and read, under various background conditions, has been collected. This data served as a foundation for developing and evaluating neural network architectures tailored to the characteristics of Indian languages. We also explored the potential of self-supervised learning to understand and enhance learned representations, fine-tuning them to suit different language families and data sizes. This approach led to insights into the shared phonetic space among Indian languages, allowing for the development of a multilingual ASR system utilizing a joint acoustic model approach. These studies mark a significant stride towards overcoming the challenges of multilingualism in the Indian context, setting a path for creating more inclusive and effective ASR systems. The research findings presented in this thesis not only contribute towards building efficient and accurate ASR systems for low-resource Indian languages but also underscore the power of deep learning approaches in linguistic technology. It is our hope that this work will motivate and aid further research in this direction, promoting linguistic diversity and broadening access to information and communication technologies for speakers of low-resource languages. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |