IIIT Hyderabad Publications |
|||||||||
|
Multilingual Text-to-Speech Synthesis using Sequence-to-Sequence Neural NetworksAuthor: sivanand.a Date: 2018-06-23 Report no: IIIT/TH/2018/27 Advisor:Kishore S Prahllad,Suryakanth V Gangashetty AbstractText-to-speech (TTS) synthesis is typically carried out in two ways: (1) By concatenating waveform segments of units (often dubbed unit selection synthesis (USS)) and (2) By predicting speech parameters from text using statistical models (also called statistical parametric speech synthesis systems (SPSS)). Most commercial TTS systems use USS approach as it produces highly natural speech. However, the USS approach requires the recorded waveforms to be stored which demands memory, but the statistical approach alleviates this by modeling the speech compactly in a parametric form. Also, using the waveform directly offers little scope to alter the characteristics to produce different varieties like speakers, genders, voice-qualities, languages, etc. On the other hand, the parameters of a statistical model can be suitably transformed to produce the desired variations. The above advantages (compactness and flexibility) come at the cost of the speech sounding slightly robotic than the unit-selection counter-part. A typical SPSS system has several components namely text feature extraction, speech parameter extraction, aligning text and speech features, a text feature-to-speech parameter regression model and a duration prediction model. Each of these components are independently hand-engineered making the SPSS system susceptible to errors in any one of them. The loss in naturalness of SPSS output has been majorly attributed to the limitations of the regression model (also dubbed acoustic model) to capture the complexity of mapping from text features to speech parameters and the representations used for text and speech data. In addition, the use of separate alignment model leads to erroneous averaging in acoustic modeling. In this thesis, we address the issues of acoustic modeling, textual representation, acoustic representation, multilingual multispeaker synthesis and end-to-end synthesis with implicit text feature extraction and alignment model within the SPSS framework. Techniques developed to address these issues include the use of recurrent neural networks for acoustic modeling, contextual representation using hidden layeroutput, unit-level acoustic representation using recurrent auto-encoder, language and speaker codes, and windowing based attention modeling for end-to-end synthesis using sequence-to-sequence (seq2seq) neural networks. The major conclusions of this thesis work are: • Recurrent neural networks improve the acoustic modeling by capturing required amount of context and generating smooth parameter trajectories than the deep neural network based SPSS. • Speaker and language codes can be used to build multilingual and multispeaker synthesizer using single recurrent neural network. Building a single model for multiple languages and speakers offers a compact model which is also capable of polyglot synthesis. • A unit-level acoustic representation can be derived using a seq2seq auto-encoder. This representation is also useful in building a unit-level SPSS. • An end-to-end TTS system can be trained from characters to speech waveform directly using seq2seq with attention model. As a result, much of the hand-engineered components of SPSS system can be substituted by a single seq2seq neural network thereby reducing the effect of independent errors. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |