Representation of speech using features and models of production

Author: Anand Joseph X M 200799012
Date: 2022-10-31
Report no: IIIT/TH/2022/144
Advisor:Yegnanarayana Yegna

Abstract

Digital (discretized and quantized) samples are used to represent speech signals in a computer. Beyond recording and playback, these representations as such have limited use. Most speech processing applications involve extraction of relevant information from the speech signals. For example, speech recognition requires extraction of parameters or features that describe the different sound units uniquely. In speaker recognition, it is necessary to extract the speaker-specific features to discriminate one speaker from another. In emotion recognition, the emotion-specific characteristics need to be identified and extracted. In speech compression, the features of speech production need to be captured, so that the speech can be effectively and efficiently used for transmission and storage. This thesis focuses on extraction and representation of speech production features, not only for understanding the link between the speech signal and the dynamic vocal tract system, but also to examine the possibility of compact representation for transmission and storage. The first attempt is to produce a spectrographic display that shows different types of segments with good temporal resolution, and also changes in the formant contours reflecting the rapid movement of the articulators of the dynamic vocal tract system. Methods are proposed to extract formants and their bandwidths. Glottal source processing methods are developed to obtain the characteristics of the glottal vibration within each cycle. While these source and system features are useful in understanding the production process, they are not adequate to reproduce all the important characteristics of speech, as evidenced by the poor quality of the formant vocoder developed using the extracted features. For a compact representation of the speech signal for reproduction, the highly successful linear prediction (LP) based source-system model is used. Even approximate separation of the source and system components helps in effectively compressing the speech information using artificial neural network (ANN) models for these components. The compressed parameters are also useful for other applications such as speech recognition, speaker recognition and speech enhancement

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Representation of speech using features and models of production

Abstract