Speech recognition based confidence measures for building voices from untranscribed speech

Author: Tejas Godambe
Date: 2016-05-03
Report no: IIIT/TH/2016/14
Advisor:Kishore S Prahllad,V G Suryakanth

Abstract

Today, large amount of audio data is available on the web in the form of audiobooks, podcasts, video lectures, video blogs, news bulletins. In addition, we can effortlessly record and store audio data such as read/lecture/impromptu speech on hand-held devices. These data are rich in prosody, provide a plethora of voices to choose from, and their availability can significantly reduce the overhead of data preparation involved in building general purpose synthesizers, thus helping to rapidly building synthetic voices. But, a few problems such as the following are associated with readily using this data for speech synthesis (1) these audio files are generally long and audio-transcriptions alignment is memory intensive (2) available corresponding transcriptions are approximate, (3) many times no transcriptions are available at all, (4) the audio may contain disfluencies and non-speech noises, since the audio is not specifically recorded for building synthetic voices, and (5) if we obtain automatic transcripts, they are not error free. Earlier works on long audio alignment which addressed the first and second issue generally preferred reasonable transcripts, and mainly focused on (1) less manual intervention, (2) mispronunciation detection and (3) segmentation error recovery. In this thesis, we used a public domain large vocabulary automatic speech recognition (ASR) system to obtain transcripts, followed by confidence measure based data pruning which together address the five issues with the found data, and also ensure the above three points. For proof of concept, we built voices in English language using audiobook (read speech) in female voice downloaded from Librivox and lecture (spontaneous speech) in male voice downloaded from Coursera using both reference and hypotheses transcriptions, and evaluated them in terms of intelligibility and naturalness with the help of perceptual listening test on Blizzard 2013 corpus. The results of subjective intelligibility and naturalness test show that we can build voices of quality comparable to those built using reference transcriptions with the use of automatic transcripts and confidence measure based data pruning.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Speech recognition based confidence measures for building voices from untranscribed speech

Abstract