Deep Learning based Speech Disfluency Detection

Author: Sparsh Garg
Date: 2022-04-14
Report no: IIIT/TH/2022/26
Advisor:Anil Kumar Vuppala

Abstract

Spontaneous speech is a particular type of speech setting where a speaker speaks without preparing in advance. This makes the speaker think about what to say on the spot, formulate the utterances and then produce the speech. Such a setting often leads to abrupt breaks or discontinuities in the normal conversation flow called disfluencies. Disfluencies can provide information regarding the speaking style, speaker identity and language fluency, which can be useful for several speech-based applications. For automatic speech recognition (ASR) systems, the presence of these disfluencies leads to a higher word error rate, since most ASR systems are developed on non-spontaneous read speech data. Thus, the detection of disfluencies in spontaneous speech becomes an essential task for many applications. For training any machine learning system, one need data. In this thesis, we introduce IIITH-IED dataset for disfluencies in spontaneous lecture-mode speech, and then use it to develop frame-level automatic disfluency detection systems. Finally, we propose a transfer learning method to detect disfluencies in spontaneous lecture mode speech using three frame-level automatic disfluency detection systems trained on stuttered speech. Compared to baseline systems, proposed method gives an average improvement of 2.25% across all disfluencies and all detection systems.

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Deep Learning based Speech Disfluency Detection

Abstract