IIIT Hyderabad Publications |
|||||||||
|
Unsupervised spoken content mismatch detection for automatic data validation under Indian context for building HCI systemsAuthor: Nayan Anand 2021701014 Date: 2024-06-14 Report no: IIIT/TH/2024/81 Advisor:Chiranjeevi Yarra AbstractThis thesis explores the critical challenges and provides solutions associated with automatic spoken data validation in the complex multilingual and multicultural context of India, which is crucial for developing efficient human-computer interaction (HCI) systems such as automatic speech recognition (ASR) and Text-to-speech synthesis (TTS). The diversity in linguistic backgrounds and the prevalence of non-native language speakers create unique challenges in speech communication. These challenges are exacerbated by the frequent mismatches between recorded speech and its reference text, referred to as misspoken utterances. To tackle some of these challenges, this work introduces novel unsupervised techniques for detecting spoken content mismatches. The developed methods leverage state-of-the-art self-supervised speech representation models such as Wav2Vec-2.0 and HuBERT, integrating them with Dynamic Time Warping (DTW) as well as its variants such as Phone level cost maximised DTW approach (Ph-DTW), and Phone level cost maximised weighted DTW approach (Ph-WDTW) along with cross-attention mechanisms. This work develops and tests the techniques on specially curated datasets such as IIITH MM2 Speech-Text and Indic TIMIT, which include a wide variety of phonetic and linguistic features reflective of India’s language diversity. The methodologies proposed are rigorously evaluated for their effectiveness in improving the accuracy and efficiency of spoken data validation in an unsupervised manner. The results demonstrate significant advancements in the automatic detection of mismatches, thereby enhancing the reliability of speech data for training sophisticated HCI systems. By reducing the reliance on labour-intensive manual validation processes, these approaches significantly contribute to the scalability of speech data processing. Overall, this thesis not only addresses a significant gap in the technological handling of spoken data validation but also sets a foundation for future research and development in speech technology applications within diverse linguistic landscapes. The implications of this work are broad, offering potential improvements in various data-intensive speech applications such as ASR, TTS, and Computer-aided language learning systems (CALL) to name a few. This would be achieved by ensuring a readily accessible clean training, testing, and validation set for the development of target models for the aforementioned use-cases, thus addressing the reliable data scarcity to a great extent Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |