IIIT Hyderabad Publications
Analysis of mimicry speech based on excitation source information
Author: Gomathi Ramya
Report no: IIIT/TH/2016/3
Abstract Speech communication is a major medium of communication among human beings. All human beings have flexibility in changing parameters of speech like loudness, duration, pitch and intonation within their voice limits. Voice imitation is a fine art in which the professional imitator develops his ability to mimic other speakers. Professional imitators have the ability to convince the listeners that they are listening to someone else. It is the flexibility of speech production mechanism that allows imitators to perform voice imitation. In this work, analysis and synthesis of mimicry speech has been carried out. Previous studies on voice imitation have focussed on features at segmental and suprasegmental levels. The present analysis of voice imitation is carried out at suprasegmental, segmental and subsegmental levels. The supraseg- mental features studied in this work are the instantaneous fundamental frequency (f0) contour and du- ration. The segmental feature used is linear prediction cepstral coefficients (LPCCs). The strength of excitation at the instants of significant excitation and a loudness measure reflecting the sharpness of the impulse-like excitation around epochs are the subsegmental features. The study focusses on how close the imitation is to the target speech and how much deviation happens from his natural speech. The ob- servations are correlated with perceptual studies. The suprasegmental and subsegmental features show a tendency to get closer to the target features. The segmental features which represent the vocal tract shape and size are difficult to change for the imitator. The importance of source and system parameters is studied by synthesis experiments. The natural utterance of the professional imitator is transformed into imitated utterance by variations in excitation source and system parameters. Subjective studies on the synthesised speech shows that suprasegmental features play a significant role in imitation. This analysis is extended into an application where the natural and imitated speech are distinguished using neural network models. The models are built using both excitation source and system features. The models built using excitation source features have better performance than the models built using system features.
Full thesis: pdf
Centre for Language Technologies Research Centre
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved.