Learning Varied Feature Representations through Diverse Neural Architectures for Cross-Lingual Voice Conversion

Author: Saisirisha Rallabandi
Date: 2019-04-23
Report no: IIIT/TH/2019/29
Advisor:Suryakanth V Gangashetty

Abstract

How to make machines speak more like humans? With the emerging technologies and computing abilities, once prescient ideas to emulate the human nature in machines are eventually achievable in the recent past. We specifically discuss the artificial production of speech by leaning into Text-to-Speech (TTS) synthesis and Voice Conversion (VC) sys-tems. TTS systems carry out the conversion of linguistic information present in the text to the minimal sound units which are further synthesized into speech using a vocoder. TTS systems are employed for many applications such as public announcements, voice assistants in mobile phones, screen readers for visually challenged and language learning systems. Incorporating the speaker variation would facilitate us in building a variety of synthetic voices. VC technology enables the speaker variations through the artificial imitation of the desired target speaker. We have thus explored VC frameworks for speaker variations (conversion) with and without parallel corpus. VC systems mainly find their applications in emotion conversion, personalized TTS systems, Normal to Lombard speech conversion and speech to singing conversion. In our thesis, we discuss artificial speech generation for: expressive speech synthesis, text-dependent Voice (speaker) conversion, and the major contributions through our work to Cross-Lingual Voice Conversion (CLVC). TTS Synthesis and VC involve modeling of sequential data. Thus, they highly rely on the contextual information. Accordingly, various approaches to appropriate feature representation and modeling of sequences in building the above-mentioned speech systems are discussed in this thesis. Our contributions: the continuous representation of text for prosody modeling, integrating signal correlation as one of the sub costs for phone-based concatenative speech synthesis. Furthermore, a light-weighted approach to CLVC is presented. Key ideas of this thesis are: • The sentence level modeling of text provides supra-segmental information required for the pre- diction of phone durations in Text-to-Speech Synthesis. Hence, we have incorporated a Recurrent Neural Network based Language Modeling (RNNLM) for continuous representation of sentences in addition to the matrix factorization for phones and words. Accordingly, a unique representation of text has been leveraged for the expressive speech synthesis. Embedding the signal correlation information while calculating the join cost facilitates an optimal selection of units for concatena-tive speech synthesis. Thus, Cross-Correlation was included in the join cost to further minimize the total cost when selecting the units. • A text-independent and transcription free approach was proposed for CLVC. Stacking of source and target speaker’s speech has enabled the projection of the language and speaker dependent features onto a common space. An auto-encoder is utilized as a parallel corpus generator to mitigate the issues involved in CLVC. A traditional parallel VC is then carried out between the generated and natural speech in target speaker’s voice. Further, a cross-language conversion is conducted for the target language. We conclude that an error reduction network can also be utilized to address the mapping of unaligned sequences.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Learning Varied Feature Representations through Diverse Neural Architectures for Cross-Lingual Voice Conversion

Abstract