Building Scalable and Unrestricted Text to Speech Systems for Indian Languages

Author: Sai Krishna Rallabandi
Date: 2018-09-11
Report no: IIIT/TH/2018/72
Advisor:Suryakanth V Gangashetty

Abstract

Synthetic or artificial speech has been developed steadily during the last decades. Especially, the intelligibility has reached an adequate level for most applications, especially for communication impaired people. This thesis is concerned with the development of deployable text to speech (TTS) systems for Indian languages. The research described here was conducted with two principle aims: To build stable, scalable and unrestricted synthesis systems for Indian languages; and to design such systems by (a) making observations about the structure of the data and (b) keeping the human in loop so that we end up with systems both acceptable and efficient. In order to do that, I first present a baseline synthesis system for Indian languages built employing unit selection and concatenation as the algorithm and syllables as the basic unit with reduced vowel epenthesis based backoff to handle the missing units. During synthesis, the required utterance specified as a string of syllables help in the selection of appropriate units from the stored database. The system uses word to phone mapping to handle the words from a different language (English) and the selected units are concatenated using a variant of Waveform Similarity Overlap Addition (WSOLA) based method. Two extensions to the system are discussed. • Inspired by the smoothness of the speech obtained after the backend concatenation algorithm, a method is presented to exploit the signal correlation as a join cost in the viterbi algorithm itself with an aim to further improve the naturalness of the system. Two variants of signal similarity are examined - cross correlation based and average magnitude difference function based. • Modeling the prosodic characteristics is an important trait for a text to speech system. Deep generative models which are the standard today benefit from a low dimensional continuous input rather than an array of 1-hot k encoding for the phonemes as is used today. A method has been discussed to obtain continuous representation for the phonemes using a matrix factorization approach and show that the representation can be used to model prosody, by predicting the phoneme durations. The system can be retrained in 6 hours on the same languages and should take approximately the same amount of time for any new language.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Building Scalable and Unrestricted Text to Speech Systems for Indian Languages

Abstract