Enhancement of Speech Intelligibilityusing Spectral and Temporal Manipulations

Author: Nivedita Chennupati
Date: 2020-06-29
Report no: IIIT/TH/2020/65
Advisor:Yegnanarayana Yegna

Abstract

Intelligibility of speech is an important factor for effective speech communication. It refers to howclearly one can perceive the words/message from a speech signal. Environmental factors either atthe speaker’s end and/or at the listener’s end affect speech intelligibility. If the listener is in a noisyenvironment, while clean speech is being transmitted from the speaker’s end, one needs to modifythe clean speech to improve its intelligibility in the presence of noise. If the speaker is in a noisyenvironment, one needs to enhance the degraded speech to improve its quality and intelligibility beforeit is transmitted to the listener. In some other scenario, where the background at the speaker’s endis speech from another speaker, one needs to do speaker separation to improve the intelligibility of thedesired speaker. In order to improve speech intelligibility in these scenarios, it is necessary to understandthe factors responsible for intelligibility of speech. The objective of this thesis is to identify factorsresponsible for intelligibility of speech, and to modify the speech signal to enhance the intelligibility insome practical scenarios.A flexible analysis, manipulation and synthesis tool is needed for exploring the factors responsiblefor intelligibility. A signal processing method based on single frequency filtering (SFF) is used toanalyze the speech signals. In the SFF method, the speech signal is shifted in frequency by multiplyingit with a complex exponential signal, and is passed through a single frequency filter to get an outputat any desired frequency. The SFF output is represented using its magnitude (or envelope) and phasecomponents. The thesis shows that the original signal can be reconstructed with very low error, byusing the SFF outputs at sufficient number of equally spaced frequencies. This thesis also examines therelative significance of the SFF magnitude and the SFF phase towards speech intelligibility, by usingvarious parameter values of the SFF analysis.Different speaking styles such as Lombard speech (speech articulated in presence of noise) or clearlyarticulated speech are analyzed using the SFF analysis to determine the factors contributing to theintelligibility of speech. It is observed that the speech signal is more intelligible when the SFF magnitude has higher dynamic range locally (fine structure) and lower dynamic range globally (gross structure).Based on these observations, an algorithm is proposed to improve the intelligibility of clean speech inthe presence of noise. The algorithm comprises of four modifications on the SFF magnitude of theclean speech at fine and gross, and at temporal and spectral levels. The original SFF phase along withthe modified SFF magnitude is used to synthesize speech with improved intelligibility in noise. Thethesis also proposes another method using simple manipulations on the SFF envelopes to improve theintelligibility of clean speech in noise.To improve the intelligibility of speech degraded by noise, the high signal-to-noise ratio (SNR)property of the SFF analysis is exploited. The proposed algorithm involves estimation of a binary maskfrom the SFF outputs of the degraded speech by using different thresholding techniques. The binarymask assigns a value of one to the speech dominant Time-Frequency (T-F) bins and zero to others.Only the speech dominant T-F bins from the SFF envelope and phase of the noisy signal are used tosynthesize the signal. The thesis also studies the significance of ideal binary mask, clean magnitude andclean phase towards intelligibility improvement of degraded speech.When the interfering background is speech, speaker separation is required to increase the intelligibilityof the desired speaker. The SFF output can be used to extract the impulse-like events corresponding tothe glottal closure instants (GCIs) of the speaker. This property along with high spectral resolution of theSFF output is used for developing a speaker separation algorithm. The proposed algorithm uses signalsrecorded from two microphones. The time delay between the two microphones is computed using thecross-correlation between the impulse-like representations of the signals from the two microphones. Thecompensation of the delay between the two microphone signals reinforces the impulses correspondingto the desired speaker. The SFF on reinforced impulses highlights the harmonics of the speaker, andhence helps in selecting the frequency regions dominated by the speaker. The SFF phase modificationis done by using iterative phase reconstruction to further enhance the speech signals.The key contributions of the thesis are:1. Development of synthesis procedure for single frequency filtering (SFF) representation of speech.2. Examination of significance of SFF magnitude and phase towards speech intelligibility usingdifferent parameters involved in SFF analysis.3. Enhancing intelligibility of clean speech in noise, degraded speech and speech in multi-speakerscenario by proposing various manipulations on SFF outputs

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Enhancement of Speech Intelligibilityusing Spectral and Temporal Manipulations

Abstract