IIIT Hyderabad Publications |
|||||||||
|
Stress Transfer in Speech to Speech Machine Translation SystemAuthor: Sai Akarsh C 2019111017 Date: 2024-06-26 Report no: IIIT/TH/2024/103 Advisor:Anil Kumar Vuppala AbstractThe proliferation of online educational resources has revolutionized learning access globally. Many companies have started investing in online education as the idea of a better future without the need to go to top colleges but still receive a good education is very appealing to the Indian population. However, English is the ”Lingua Franca” of India and most of the online and offline information use it. But India being a land of diverse languages cannot cater to the entire population only in English. To allow and enable students from different regions of India to use online education to their advantage, a practical solution is required to tackle this. Speech-to-Speech Machine Translation (SSMT) holds promise in bridging this gap by translating educational content automatically. This is possible due to the recent advancement in machine learning, speech processing and most importantly the availability of powerful hardware. In SSMT, only the contextual information is retained when converting speech into text and information about the speaker and any prosodic elements are lost. Therefore the generated output speech is monotonous and machine-like that lacks emotion and depth. This speech devoid of prosody makes the converted education content not engaging and potentially hinders learning. There is a need for a system that can use available information in the source speech such as prosody and emotion and generate speech in the target language along with these prosodic cues and emotional state. Despite some work in literature, the unique challenges of incorporating prosody, especially stress or emphasis in Indian language-based SSMT remain largely unexplored. This thesis addresses the limitations and needs for an SSMT system capable of transferring prosodic stress cues from Indian English to Hindi, especially in lecture-mode educational content. The importance of stress in educational content is that it would draw the attention of the student to properly focus on a concept when the instructor puts stress on it. For preserving stress in the generated speech, a model that can detect stress in the source language needs to be built. This work curates an open-source stress region annotated dataset in Indian English, as an accurate stress annotated dataset is very hard to come by, especially in the field of research. This dataset consists of lecture mode educational content by various speakers collected from an open-source online platform. Several acoustic, spectral and temporal features correlated to stress were extracted from the dataset to train different machine-learning models for predicting stressed regions in Indian English speech. This work also proposes Stress-Net a deep neural network (DNN) that gives better performance and generalizability. As most Text-to-speech (TTS) systems do not deal with stress as a conditioning input, this work proposes modifications to existing TTS architectures like FastPitch and YourTTS to add stress using the source speech as a reference. Different components of the SSMT pipeline work together and generate stress cues that condition the TTS when generating speech. Stress cues are information about the location of the words that are stressed and a quantitative measure of the amount of stress relative to the rest of the words in the sentence. Stress introduction using this information is achieved through variance modifiers, allowing the TTS to be trained on a generic neutral speech corpus without a TTS-specific stressed speech dataset. This work produces a comprehensive Indian English-Hindi SSMT system equipped to translate educational content from Indian English to Hindi while preserving stress to enhance user engagement. This work includes the curation of a stress-annotated Indian English dataset, the creation of stress detection models, the integration of these stress detection models into the SSMT pipeline and the modification modules required by TTS for an acceptable level of stress addition. This paves the way for more effective online learning experiences in India’s diverse linguistic landscape Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |