Robust Estimation of Direction of Arrival and Time-Frequency Masks for Speech Enhancement

Author: Sushmita Thakallapalli
Date: 2022-02-11
Report no: IIIT/TH/2022/20
Advisor:Suryakanth V Gangashetty,Nilesh Madhu

Abstract

Nowadays, we use speech-enabled smart devices either to improve human-machine or humanhuman interaction. One of the primary tasks in these speech-enabled devices is speech enhancement. Speech enhancement is the extraction of the desired speech signal from the noisy and reverberant mixture signals recorded by the microphones. The performance of the speech-enabled devices relies significantly on the performance of the speech enhancement methods. Among these enhancement methods, beamformers and Time-Frequency (TF) mask-based methods are widely used. Beamformers are linear spatial filters that aim to boost the signal coming from a specific direction by appropriate configuration of the microphone array, and in doing so attenuates interfering signals from other directions. A source TF mask identifies the TF regions where the source is dominant and can be applied on the mixture TF representation to extract the desired source. Several beamformers and a few TF mask estimators estimate the azimuth and elevation angles called the Direction of Arrival (DoA) of the sources from the microphone data. Several other speech enhancement methods estimate the source TF masks. In all these methods, it is necessary that the DoA and the masks be accurately estimated for high quality of the enhanced signals. The thesis addresses the following problems: the problem of robust DoA and TF-mask estimation from the noisy and reverberant microphone signals and the problem of enhancement of a TF mask. Specifically, this thesis proposes two robust DoA estimators and a novel TF mask interpolation technique to improve the TF masks. Further, the efficacy of eigenvalue features for robust TF mask estimation in a resource constrained, neural network-based speech enhancement task is investigated. The first DoA estimator is a Non-negative Matrix Factorization weighted Steered Response Power beamformer abbreviated as the SRP-NMF. The broadband SRP beamformers cannot perform multi-speaker DoA estimation in a single time frame, a drawback which is overcome in the SRP-NMF by NMF weighting. The weights are obtained by NMF of the mixture spectrogram and correspond to the NMF atoms of the underlying sources. On evaluations conducted on data from public challenges and data generated from recorded room impulse responses and with various microphone array configurations, the SRP-NMF method outperforms the widely used variants of narrowband and broadband DoA estimators in terms of source detection capa-bility and DoA estimation accuracy. The second DoA estimator, SFF-PHAT-env, estimates the directional information of the sources by, PHAse Transform (PHAT) weighted cross-correlation of the amplitude envelopes at several frequencies (obtained by passing the microphone signals through a narrowband filter called Single Frequency Filtering (SFF)) across the channels. The high signal-to-noise ratio regions in the envelopes, PHAT weighting, and multiple evidences at several frequencies, result in robust DoA estimates. The performance of SFF-PHAT-env is compared with the other existing SFF-based methods and the state-of-the-art Generalized Cross Correlation (GCC)-based methods. The tests are conducted on publicly available data collected in real rooms in challenging conditions From the experiments, it is observed that the best performing SFF-based methods are better or comparable to the best GCC-based estimator in detection metrics such as F-measure and accuracy metrics based on azimuth error. Irrespective of whether the TF mask is obtained as an ideal binary mask (IBM) or by any practical method, there are often regions of the target speech that are suppressed, leading to audible artefacts. The influence of these errors could be reduced by estimating such missing data points by some form of interpolation using NMF. The existing NMF-based methods of interpolation are computationally intensive and do not offer a means to control the degree of interpolation, resulting in over-estimation of the missing regions and leading to noise-vocoded output. In the proposed NMF-based interpolation method, we address the drawbacks of the existing methods by considering the improvement achievable by applying the proposed method to ideal binary mask-based gain functions. The instrumental quality metrics (PESQ and SNR)) indicate the added benefit of the missing data interpolation compared to the output of the ideal binary mask. In resource constraint devices, the learning model cannot be complex. Hence the demand to perform a task is on the input features. The more discriminative the features, the better is the performance. Eigenvalues are spectral features that can discriminate coherent sound sources from the spatially uncorrelated ones. However, the eigenvalues have not been used for neural-network-based enhancement. In this thesis, for extracting speech from noise, we explore the efficacy of the instantaneous generalized eigenvalue features for neural network-based TF mask estimation. These features are compared with the widely used spectral features Tests are conducted in both matched and unmatched noise conditions. Eigenvalue features show better improvements in objective scores that measure the quality and intelligibility of speech signals

Full thesis: pdf

Centre for Language Technologies Research Centre

IIIT Hyderabad Publications

Robust Estimation of Direction of Arrival and Time-Frequency Masks for Speech Enhancement

Abstract