IIIT Hyderabad Publications |
|||||||||
|
Acoustic Analysis of Voice Disorders from Clinical PerspectiveAuthor: Purva Barche Date: 2024-06-10 Report no: IIIT/TH/2024/72 Advisor:Anil Kumar Vuppala AbstractVoice disorders are caused due to abnormality in the laryngeal system. The signs and symptoms of voice disorder may include: abnormal pitch (too high pitch, too low pitch, pitch breaks), reduction in loudness, degradation of individual’s voice quality (breathy, rough, and strained voice quality), loss of voice and so on. Instrumental assessment, auditory-perceptual assessment and objective assessment are most widely used methods for diagnosing the voice disorders. Instrumental assessment methods often involve the use of laryngoscopes and stroboscopes, but these procedures can be expensive and painful. Auditory-perceptual methods used by Speech-Language Pathologists (SLPs) is considered as a gold standard for detecting voice disorder. The decisions taken in the subjective intelligibility test vary with experience of SLPs, type of scale used, and also depend on the examiner’s experience. To address these limitations, objective or automatic assessment methods have been extensively explored in the literature. These approaches extract acoustic features from speech signals, offering reliable, costeffective, and repeatable assessments. Objective assessment methods have potential to be used as a pre-diagnostic measure for voice disorder assessment by SLPs. This thesis primarily focuses on the objective or automatic assessment methods of voice disorders. Various objective assessment methods for the automatic detection of voice disorders have been explored in the literature. These methods aim to detect the presence or absence of voice disorders, as well as assess their severity ratings. However, clinical assessment of voice disorders relies on considering the underlying etiological diagnosis. Therefore, this study proposes a clinical approach to assess voice disorders. Along with the detection which was explored in the literature, this thesis explored an objective assessment method which can automatically identify the cause of voice disorders based on the acoustic features extracted from the speech signal. The resulting speech samples are categorized into four distinct categories: structural, neurogenic, functional, and psychogenic. To conduct a comprehensive clinical analysis, a multi-level classification approach is employed. This approach involves training four binary classifiers on acoustic features to achieve a thorough assessment from a clinical perspective. Voice disorders are characterised by irregularities in the vocal fold vibration, incomplete glottal closure and opening, variation in the amplitude of consecutive opening and closing of the vocal folds. Hence the parameters, which can capture these disturbances in a better way will be able to discriminate the voice disorders from healthy samples. From the source-filter model of speech production these features can be captured in a better way from excitation source signals. Glottal flow waveform, zero frequency filtered (ZFF) signal and linear prediction (LP) residual signals are some evidence of excitation source signal. Features derived from these evidences were used to capture the characteristics of voice disorders. First study explores perturbation (jitter, shimmer, noise to harmonic ratios etc.) and cepstral features derived from the excitation source evidence for detection and identification of voice disorders. In this regard state-of-art speech signal processing techniques, such as quasi-closed-phase (QCP) analysis, LP analysis and ZFF techniques, have been explored in this thesis in order to capture the excitation source information. From this study, it was concluded that perturbation parameters can capture voice disorder information in a better way. In addition it was also found that excitation source based features can discriminate between the organic voice disorder from non-organic voice disorder, as well as structural voice disorders from the neurogenic voice disorder category. However, distinguishing functional voice disorders from psychogenic voice disorders proved to be challenging in the study. From the first study, it was found that excitation source based features are able to differentiate the various categories of voice disorders. Computation of these features involves the detection of epoch locations from speech. Therefore, accurate estimation of epoch locations is important for computing these features for the automatic detection and identification of voice disorders. Second study aimed to compare the various algorithms for detecting epoch locations from the speech associated with voice disorders. In this regard, nine state-of-the-art epoch extraction algorithms were considered, and their performance for different categories of voice disorders was evaluated. From the results it can be concluded that most of the epoch extraction methods showed better performance for healthy speech; however, their performance was degraded for speech associated with voice disorders. Furthermore, the performance of epoch extraction methods was degraded for the speech of structural and neurogenic disorders compared to the speech of psychogenic and functional disorders. This degradation in performance might be due to rapid change in fundamental frequency (F0) associated with subjects suffering with voice disorders as compared to healthy subjects. Some of the state-of-the-art epoch extraction methods depend on the average value of F0 for computation of epochs, hence if for these methods F0 is derived for each region for calculation of epoch locations then identified epoch locations might be more accurate. With this motivation to improve the performance, application of region-based processing as a pre-processing step on the state-of-the-art epoch extraction method was proposed for voice disorder scenarios. Results of this study showed that performance was improved for voice disorder scenarios with the application of region-based processing to state-of-the-art epoch extraction techniques which might be due to local F0 being used to estimate the epoch locations as compared to average F0 used in the state-of-the art epoch extraction algorithms. Moreover, to improve the performance of the voice disorder detection and identification system, the system was built using the features extracted by applying the region-wise processing to the state-of-the-art epoch extraction algorithm. From this study it was found that performance is improved as compared to the baseline features leading to the conclusion that the accurate identification of epoch locations plays an important role in case of voice disorder detection and identification. Previous studies have revealed that features obtained from the excitation source signal can effectively distinguish between various categories of voice disorders. However, their effectiveness relies on the precise estimation of fundamental frequency and accurate epoch location. Detecting the pitch con-tour is more straightforward in mild dysphonic voices compared to severely affected ones. Additionally, it has been observed that careful consideration should be given to the type of signal, gender, and fundamental frequency when calculating these features. Hence the following study in this thesis focused on the supra-segmental analysis (speech analysis with a frame size greater than 100 ms) of speech signal instead of short-term analysis (frame size of 20 ms) used in the previous study. Voice disorders affect the pitch, loudness, and voice quality, which are perceived at the supra-segmental level in the speech signal. To capture the voice quality feature, we explored the effectiveness of long term average spectrum (LTAS) features. For the detection and identification of voice disorders, this study explores the effectiveness of LTAS features using auditory filter banks like gammatone and Constant-Q. The performance of the system is also compared with LTAS features derived from critical band filter bank and single frequency filter (SFF) based filter bank. From the results it was observed that performance of the detection and identification system is improved using the gammatone and constant-Q based LTAS features as compared to the baseline features. The reason for improvement might be due to auditory filter banks which were designed to mimic the human auditory system. Compared to our previous study, significant improvement of performance for all the experiments was observed which might be due to the reason that long term features can capture the voice disorders information in a better way as compared to the features extracted using short-term analysis methods. The previous study concluded the importance of spectral-temporal domain analysis for the voice disorder detection and identification system. Stockwell-Transform (S-Transform) is a time-frequency analysis method which provides better time-frequency localization as compared to other representations like short-time Fourier transform (STFT), wavelet-transform, etc. Therefore, S-Transform was explored for the classification of voice disorders from a clinical perspective. We proposed cepstral features derived from S-Transform for building the detection and identification system for assessing voice disorders. Additionally, we demonstrated the effectiveness of using the S-Transform method for capturing the acoustic characteristics of various voice qualities. As compared to baseline features, proposed features performed best in terms of classification accuracy for voice disorder detection task. Also, the proposed features performed better in case of identification tasks. Further, the experimental results reveal that the combination of cepstral coefficients derived from S-Transform with baseline features improved the performance of proposed systems by 8% and 4% for detection and identification task, respectively. Keywords: Clinical perspective, Voice disorders, Detection and identification of voice disorders, Excitation source features, Region-wise processing, Supra-segmental analysis, Long term average spectrum, Time-frequency analysis, Stockwell-Transform. Full thesis: pdf Centre for Language Technologies Research Centre |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |