Aaron Nicolson -

Aaron Nicolson

Public Documents 3

On Training Targets for Supervised LPC Estimation to Augmented Kalman Filter-based Sp...

Sujan Kumar Roy

and 2 more

April 30, 2021

The performance of speech coding, speech recognition, and speech enhancement largely depends upon the accuracy of the linear prediction coefficient (LPC) of clean speech and noise in practice. Formulation of speech and noise LPC estimation as a supervised learning problem has shown considerable promise. In its simplest form, a supervised technique, typically a deep neural network (DNN) is trained to learn a mapping from noisy speech features to clean speech and noise LPCs. Training targets for DNN to clean speech and noise LPC estimation fall into four categories: line spectrum frequency (LSF), LPC power spectrum (LPC-PS), power spectrum (PS), and magnitude spectrum (MS). The choice of appropriate training target as well as the DNN method can have a significant impact on LPC estimation in practice. Motivated by this, we perform a comprehensive study on the training targets using two state-of-the-art DNN methods— residual network and temporal convolutional network (ResNet-TCN) and multi-head attention network (MHANet). This study aims to determine which training target as well as DNN method produces more accurate LPCs in practice. We train the ResNet-TCN and MHANet for each training target with a large data set. Experiments on the NOIZEUS corpus demonstrate that the LPC-PS training target with MHANet produces a lower spectral distortion (SD) level in the estimated speech LPCs in real-life noise conditions. We also construct the AKF with the estimated speech and noise LPC parameters from each training target using ResNet-TCN and MHANet. Subjective AB listening tests and seven different objective quality and intelligibility evaluation measures (CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR) on the NOIZEUS corpus demonstrate that the AKF constructed with MHANet-LPC-PS driven speech and noise LPC parameters produced enhanced speech with higher quality and intelligibility than competing methods.

DeepLPC-MHANet: Multi-Head Self-Attention for Augmented Kalman Filter-based Speech En...

Sujan Kumar Roy

and 2 more

April 12, 2021

Current augmented Kalman filter (AKF)-based speech enhancement algorithms utilise a temporal convolutional network (TCN) to estimate the clean speech and noise linear prediction coefficient (LPC). However, the multi-head attention network (MHANet) has demonstrated the ability to more efficiently model the long-term dependencies of noisy speech than TCNs. Motivated by this, we investigate the MHANet for LPC estimation. We aim to produce clean speech and noise LPC parameters with the least bias to date. With this, we also aim to produce higher quality and more intelligible enhanced speech than any current KF or AKF-based SEA. Here, we investigate MHANet within the DeepLPC framework. DeepLPC is a deep learning framework for jointly estimating the clean speech and noise LPC power spectra. DeepLPC is selected as it exhibits significantly less bias than other frameworks, by avoiding the use of whitening filters and post-processing. DeepLPC-MHANet is evaluated on the NOIZEUS corpus using subjective AB listening tests, as well as seven different objective measures (CSIG, CBAK, COVL, PESQ, STOI, SegSNR, and SI-SDR). DeepLPC-MHANet is compared to five existing deep learning-based methods. Compared to other deep learning approaches, DeepLPC-MHANet produced clean speech LPC estimates with the least amount of bias. DeepLPC-MHANet-AKF also produced higher objective scores than any of the competing methods (with an improvement of 0.17 for CSIG, 0.15 for CBAK, 0.19 for COVL, 0.24 for PESQ, 3.70\% for STOI, 1.03 dB for SegSNR, and 1.04 dB for SI-SDR over the next best method). The enhanced speech produced by DeepLPC-MHANet-AKF was also the most preferred amongst ten listeners. By producing LPC estimates with the least amount of bias to date, DeepLPC-MHANet enables the AKF to produce enhanced speech at a higher quality and intelligibility than any previous method.

On Training Targets for Deep Learning Approaches to Clean Speech Magnitude Spectrum E...

Aaron Nicolson

and 1 more

February 12, 2021

The estimation of the clean speech short-time magnitude spectrum (MS) is key for speech enhancement and separation. Moreover, an automatic speech recognition (ASR) system that employs a front-end relies on clean speech MS estimation to remain robust. Training targets for deep learning approaches to clean speech MS estimation fall into three main categories: computational auditory scene analysis (CASA), MS, and minimum mean-square error (MMSE) training targets. In this study, we aim to determine which training target produces enhanced/separated speech at the highest quality and intelligibility, and which is most suitable as a front-end for robust ASR. The training targets were evaluated using a temporal convolutional network (TCN) on the DEMAND Voice Bank and Deep Xi datasets—which include real-world non-stationary and coloured noise sources at multiple SNR levels. Seven objective measures were used, including the word error rate (WER) of the Deep Speech ASR system. We find that MMSE training targets produce the highest objective quality scores. We also find that CASA training targets, in particular the ideal ratio mask (IRM), produce the highest intelligibility scores and perform best as a front-end for robust ASR.