Extraction of Information from Speech for the Detection and Assessment of Dysarthria

Author: Gurugubelli Krishna
Date: 2021-03-08
Report no: IIIT/TH/2021/17
Advisor:Anil Kumar Vuppala

Abstract

Dysarthria is a disorder resulting from weaknesses of neuromuscular execution in motor speech production due to brain tumors, brain injury, stroke, cerebral palsy, and facial paralysis. The abnormalities in the resonance, articulation, respiration, phonation, and prosody of speech are associated with dysarthria leads to poor speech intelligibility. The symptoms of poor speech quality and reduced intelligibility can be used to identify the dysarthria. The degree to which the listener understands the dysarthric individual’s speech is referred to as speech intelligibility of that speaker. Dysarthric speech detection and intelligibility assessment are very important steps in the clinical diagnosis of dysarthria. The subjective intelligibility assessment methods can be influenced by the listener’s familiarity with patients, the contextual, suprasegmental factors, and semantic/syntactic features. Moreover, the subjective intelligibility assessment methods are costly and time-consuming. On the other hand, objective intelligibility assessment methods are economical, repeatable, reliable, and can help remote patient rehabilitation monitoring. The growing evidence suggesting that clinicians are becoming more receptive to objective intelligibility assessment systems in which the trained acoustic model can assess the speech intelligibility. An objective assessment method like acoustic to articulatory representation of dysarthric speech can effectively uncover the dysarthria specific features that are useful for the assessment. Additionally, the detection of articulatory changes from the speech is useful for the assessment of dysarthria. Hence, the extraction of features from speech signals is considered to be one of the most important steps in developing clinical tools for the automatic detection and assessment of dysarthria. In literature, few attempts have been made to detect dysarthria, assess the intelligibility of dysarthric speech, and identify the type of dysarthria. The main objectives of this thesis are dysarthric speech detection (binary classification task) and dysarthric speech intelligibility assessment (four class classification task). In this thesis, all the studies are done using the UA-Speech database. In the first study, the effect of tongue tip movement in subjects with dysarthria was investigated by analyzing the duration of rhotic approximant sounds. The duration of rhotic approximant was measured using the third formant trajectory estimated from speech. The formant trajectory is estimated from the spectrogram computed using quasi-closed-phase (QCP) analysis that is an accurate method to estimate formants. This study showed that the duration of rhotic approximant is a good indicator of the severity level of dysarthria. This method requires sophisticated landmark detection techniques to measure the rhotic approximant duration for the automatic detection and assessment of dysarthria. Moreover, the duration feature showed variations due to gender, speaker, and phonetic context. Therefore, subsequent studies of the thesis focused on other state-of-art signal processing methods for the robust detection and assessment of dysarthria. The previous study anticipated that subtle variations in vocal tract system dynamics need to be captured by the feature representation for an accurate assessment of dysarthria. In this regard, narrow bandpass filtering based instantaneous spectral representation was investigated in this thesis to capture these variations. A new feature representation called perceptually enhanced narrow bandpass filtering based cepstral coefficient (PE-NBFCC) was proposed to detect and assess dysarthria. This study showed that the proposed features performed better than conventional features in terms of classification accuracy in both dysarthria detection and assessment tasks. Further to improve the performance of detection and assessment systems, phase information in the speech signals was explored. In this regard, the importance of analytic phase information was examined by proposing narrow bandpass filter bank-based instantaneous frequency cepstral coefficients (NBFB-IFCC). The proposed features showed better performance compared to standard features in terms of accuracy and F1-Score. The score-level fusion of proposed features with the magnitude spectral features improved the classification accuracy, which indicates the complementary information present in the analytic phase feature. Finally, this thesis explored the excitation features extracted using epoch-based speech processing. This study proposed a method for epoch extraction, namely zero-phase pitch selective bandpass filtering (ZP-PSBF). This study showed that the importance of the excitation source features for the detection and intelligibility assessment. The issues addressed in this thesis are summarized as follows: • Acoustic analysis of rhotic approximants was made, and the relation between duration of rhotic approximant sounds and dysarthria severity was investigated. • Based on the idea of the single frequency filtering technique, this thesis proposes the narrow bandpass filter bank (NBFB) to estimate the instantaneous changes of the vocal tract system during speech production. • This thesis investigated the importance of analytic phase information in the perception of speech intelligibility. Further, the importance of group-delay features and analytic phase features was investigated for the automatic detection and assessment of dysarthria. • The knowledge of epoch locations in the continuous speech was investigated to detect and assess dysarthria. The experimental results showed that the system features performed better than the excitation source features in the assessment of dysarthria. Among different features, the PE-NBFCC and modified group delay feature combination showed the best performance in terms of 98.25% accuracy in dysarthric speech detection. In the dysarthria assessment, the PE-NBFCC and NBFB-IFCC feature combination showed better performance in terms of 67.98% accuracy

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Extraction of Information from Speech for the Detection and Assessment of Dysarthria

Abstract