IIIT Hyderabad Publications |
|||||||||
|
Data Efficiency in Neural Stylistic Speech SynthesisAuthor: Nishant Prateek 201225113 Date: 2023-11-07 Report no: IIIT/TH/2023/156 Advisor:Manish Shrivastava AbstractRecent developments in neural text-to-speech synthesis (NTTS) have been able to produce highquality human-like speech samples. However, these models require large amounts of studio-quality recordings for training. As a result, training NTTS models to serve multiple speakers and styles is resource intensive, both in terms of time and compute costs. In this thesis, we discuss the data-efficiency methods for training an NTTS acoustic model with limited data. We explore several data-efficiency techniques for speaker and style modelling with limited data. We discuss multi-speaker training for NTTS model with limited data for each speaker, and speaker adaptation for fine-tuning a pre-trained multi-speaker NTTS model with few minutes of training data for the target speaker. Controllability is TTS systems refers to the ability of explicitly controlling the prosodic variations of the synthesised speech. We explore controllability in NTTS with latent-variable conditioning with variational autoencoders (VAE). These can be used for one-shot style transfer from a reference speech sample. We further discuss improving the posterior flexibility of latent-variable from VAE using normalising flows. We adapt the multi-speaker training strategy to generate newscaster style speech with limited stylistic training data. We first analyse prosodic variations in the neutral style, newscaster style, and mixed expressive corpora. From this, we conclude that the newscaster style is more dynamic than the neutral style, however with lower dynamic range and prosodic variations than a mixed expressive corpora containing recordings with different emotions. The problem of generating newscaster style of speech with limited training data is posed as that of creating a bi-style model in which a one-hot style ID can be modified to generate either neutral or newscaster style speech. We only use a quarter of the data for newscaster style as opposed to that of neutral style. Combining the two styles gives the model enough volume of training data to learn the textual and acoustic alignments, while only a fractional amount of stylistic data is required to factorise the two styles. To further improve on the naturalness, we condition the NTTS acoustic model with contextualised word embeddings (CWE). This gives the model additional syntactic and semantic context on the input text. The proposed bi-style NTTS model conditioned is shown to improve on naturalness and styleappropriateness for newscaster speech over both neutral NTTS and concatenative systems in MUSHRA evaluations conducted with expert listeners. Full thesis: pdf Centre for Security, Theory and Algorithms |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |