Improved Wiener Filter Algorithm for Speech Enhancement

: Most of the existing speech enhancement algorithms are aimed at improving the quality of speech, and the algorithms that can improve the speech intelligibility effectively are rare. Speech intelligibility has been found to improve listening comfort and it is generally related to the distortion of the speech signal closely. Studies have assessed the impact of speech distortion introduced by gain functions and shown that one of the main reasons that existing algorithms cannot improve speech intelligibility is because they allow amplification distortions more than 6dB. Therefore, these distortions of the enhanced amplitude spectrum should be corrected to improve the speech intelligibility. The early research by Loizou et al. obtained the experimental results on the ideal state and we are unable to use it in reality because there is no clean speech in reality. In this paper, we modify the method proposed by Loizou et al. and select the estimated speech under two hypothetical conditions to verify the improvement of the speech intelligibility. The short-term objective intelligibility value verifies the improvement of speech intelligibility as the improved algorithm of speech intelligibility is applied to reality successfully.


Introduction
Humans communicate with the outside world through speech signals, but the ubiquitous noise in life can cause a lot of interference to voice communication. Therefore, practical speech enhancement technology is needed to reduce noise pollution and restore the pure signal as far as possible. Great progress has been made in the development of singe-microphone noise reduction algorithms for hearing applications and speech communication systems [1] and the majority algorithms of them are aimed at improving speech quality and listening comfort, which is named speech intelligibility. Speech intelligibility also can reflect the listener's ability to understand the speech signal accurately. Existing speech enhancement algorithm can improve the speech quality in a certain degree but does not effectively improve consistently and substantially speech intelligibility.
Most noise-reduction algorithms involve two consecutive stages of processing as described in Figure 1: estimating SNR and applying gain function. This paper mainly focuses on improving the speech intelligibility by modifying the gain function. The Wiener filtering algorithm is similar to many algorithms used in hearing aids [2,3], which estimated SNR though a gain function to the spectral envelopes proportional in each frequency bin. Explain in detail, a high SNR spectral bin receives a high gain nearly 1, while a low SNR, which possibly masked by noise, receives a low gain nearly 0. The Wiener gain function has also been applied to hearing impaired listeners successfully (maybe under ideal conditions) [4].
It is critical to choose the frequency-specific gain function for the noise-reduction algorithm. Since the target and masker signals spectrally overlap, sometimes the target signal may be over-attenuated while in other instances, it may be over-amplified. Most noise-reduction algorithms usually can introduce two types of envelope distortions by the gain functions. Amplification distortion arising when the target signal is over-estimated (e.g., if we define a as the true value of the target envelope, and the estimated envelope is a a + ∆ , a ∆ is some positive increment), and attenuation distortion arising when the target signal is underestimated (e.g., a a − ∆ represent the estimated envelope). It cannot be equivalent of the sensory effect of these two distortions on speech intelligibility, in practice, there has to exist the right balance between these two distortions, in most cases we do not know how a specific parameter of the noise-algorithm need to be revised so as to get a higher speech intelligibility. Researches by Loizou et al. have shown that the effects of the two distortions on speech intelligibility are different. Areas with amplification distortion exceeding 6.02dB have serious damage to speech intelligibility and other areas have little effect on speech intelligibility [5].
In order to assess the impact of the speech distortion introduced by the gain function on the intelligibility of speech, Loizou et al. obtained a series of experiments to conclude that the speech intelligibility is damaged seriously when the amplification distortion in excess of 6dB. Since the experiments use clean speech and there is no clean speech in reality, so that the algorithm cannot be applied to reality. This paper learn the research thoughts from Loizou et al., and use the enhanced speech amplitude spectrum under two different assumptions, simulate and average the values and analyze the simulation under different background noise and SNR.
Current study has focused on applying the improved speech intelligibility algorithms to reality. In the start of the paper, we analyzed the impact of two types of speech distortion (amplification distortion and attenuation distortion) introduced by noise-suppressive gain functions. Then based on the Wiener filter algorithm, there was a modified algorithm proposed to improve the speech intelligibility, it is verified that the algorithm improves the intelligibility of the speech amplitude spectrum estimated by two different amplitude square spectrum estimator. Finally, this paper summarized the reasons that existing algorithms do not improve the speech intelligibility and the methods to improve the speech intelligibility.

Wiener Filtering Algorithm
The Wiener filtering algorithm is an estimation algorithm in the sense of minimum mean square error. Assuming that ( ) y t is noisy speech, which consists of clean speech and additive noise [6], namely: Simultaneous Fourier transform on both sides: X k j , ( , ) D k j are the amplitude spectrum representation of the time domain signal in the frequency domain at the j th frame and k th frequency bin, respectively. After a Wiener filtering with a gain function is: the enhanced speech spectrum: where, ( , ) k j ξ is a priori estimate of the k th spectral component, ( , ) H k j is the gain function corresponding to the spectral, ( ) , k j X ∧ is the enhanced speech amplitude spectrum.
Research in the literature [7] shows: the speech intelligibility was influenced by gain bias overestimated significantly, while the intelligibility was not affected when the gain function underestimated in the positive SNR regions. Loizou et al. evaluates the effects of two speech distortions introduced by the gain functions and draw important conclusions through several sets of experiments: amplification distortion occurring when the enhanced speech amplitude spectrum is over-estimated, and attenuation distortion occurring when the enhanced speech amplitude spectrum is under-estimated [1]. Researches by Loizou et al. have shown that the effects of the two distortions on speech intelligibility are different. Attenuation distortion has less effect on speech intelligibility, and when the amplification distortion is smaller than 6 dB, the performance of speech intelligibility in a stationary noise environment was nearly unaffected. One reason that existing speech enhancement algorithms do not improve speech intelligibility is to allow amplification distortion in excess of 6 dB [1]. In addition, the signal-to-noise ratio also has over-estimated and under-estimated errors. The experimental results show that the SNR estimation error has different influence on the intelligibility of enhanced speech in different regions [6]. The overestimation of the a priori SNR does more damage to the speech intelligibility if the SNR is less than-10dB while in other region the underestimation of the a priori SNR has less effect on the speech intelligibility.
According to the relationship between the enhanced magnitude spectrum and the clean speech magnitude spectrum, the distortion can be divided into three regions: Region I, in this region, suggesting only attenuation distortion, the region formula as follows: Region II, in this region, existing amplification distortion ranges from 0 to 6 dB, and the region formula as follows: Region III, in this region, the amplification distortion in excess of 6 dB [5], the region formula as follows: The above content shows that the speech intelligibility is significantly improved with the limitation of regions I+II, and declined after the limitation of region III. Combining regions I and II as I+II constraint: It can be seen from the above formula that the speech amplitude spectrum of the region III is all set to zero under ideal conditions, which was judged by making a comparison with the original clean speech.

The Improved Algorithm in Region III
In the real application, there is no original clean speech for comparison to make a judgment in region III. Therefore we make the following improvements by estimating a priori SNR and gain function to determine the range: In region III, the enhanced magnitude spectrum is as follows: Squared on both sides: , , , , Therefore the gain function of the amplification distortion greater than 6dB is: The amplifying distortion region of magnitude spectrum greater than 6 dB can be determined by a priori SNR and gain function in (17), The T-F units falling in Region III is equivalent to have the amplification distortion in excess of 6 dB and should take measures to eliminate it. The experiments of the same article also shows that attenuation distortions had a minimal effect on speech intelligibility

Two Hypothetical Conditions to Carry Out Simulation Experiments
Let Expressing the equation (18) in polar form [7]: where { k Y , k X , k D }denote the magnitudes and { ( ) }denote the phases at frequency bin k of the noise speech, clean speech, and noise, respectively. The MMSE estimator of the short-time power spectrum proposed by Wolfe and Godsill [7] as follows: k ξ and k γ denotes the a priori SNR and a posteriori SNR, respectively. That's above derivation is the MMSE estimator. Bring (20) into equation (17) and find the range of the gain function where the amplification distortion greater than 6dB is: So that the whole gain function is: The gain function of this range is helpful for improving speech intelligibility in theory.
Assuming that ( ) x n and ( ) d n are uncorrelated stationary random processes, ( ) y P ω denotes the sum of the power spectra of the clean speech and noise [7,8].
The above assumption is based on statistical sense. Approximating the power spectrum by using the magnitude-squared spectrum and rewriting the equation (25) as follows: we are referring to 2 k Y , 2 k X and 2 k D as the magnitude-squared spectra of the noisy, clean and noise signals, respectively.
Using the Bayes' rule as follows can get the posterior probability density of the clean speech magnitude-squared spectrum: Computing the mean of the posteriori density [9]in equation (27) and obtain the MMSE estimator: Bring (28) into equation (17) and find the range of the gain function where the amplification distortion greater than 6dB is: We believe that the gain function of this range is also helpful for improving speech intelligibility. Specific experimental steps for modified algorithm to test and verify the speech intelligibility [10]: 1. Take four kinds of speech and four kinds of noise in MATLAB speech noise library, and experiment in four kinds of signal-to-noise ratio environments; 2. Using two kinds of speech spectrum estimated by MMSE estimator; 3. Calculate the a priori SNR according to the "decision-directed " method [11]; 4. Obtain the modified gain function according to the formula (9); 5. Obtaining the enhanced amplitude spectrum according to equation (4); 6. The area where the speech amplitude spectrum is distorted more than 6dB is limited by the formula (17).

Experimental Results and Analysis
In order to verify the usability of the modified algorithm proposed by Loizou et al., using IEEE consonants and sentences as corpus for clean speech [11], The noise signal is White, Pink, Babble, F16 in the NOISEX-92 database. The clean speech signal and the noise signal were recorded at a sampling rate of 8 kHz, the quantization precision was 16 bits and the signal was 20ms per frame during processing with 50% overlap. The test condition of the modified algorithm is that 4 kinds of clean speech signals and 4 kinds of noise signals, which are respectively in 4 kinds of SNR environments(0dB、5dB、10dB、15dB) with two different estimated speech amplitude spectrum. Calculating the average value of objective intelligibility evaluation(LSD、 STOI、PESQ) by MATLAB simulation tool.
In order to accurately quantify the performance of the improved algorithm, three objective evaluation criteria were used to assess the intelligibility of speech amplitude spectrum in two hypothetical situations [12]. Simulation experiments of this paper are based on different background noise and different signal-to-noise ratio conditions, and makes compare with the ideal state of Loizou. The objective evaluation criteria [13] are Log-Spectral Distance (LSD), Short-Time Objective Intelligibility (STOI) and Perceptual Evaluation of Speech Quality (PESQ).   Table 1 shows the results of analysis using Log-Spectral Distance. The LSD indicates the degree of closeness between the enhanced speech and the clean speech and its expression is as follows: From the equation, while the algorithm with LSD value small, the enhanced speech is closer to the original speech [14], and it presents a better enhancement effect.
As can be seen from the table 1, at 0dB and 5dB SNR, LSD value of amplitude spectrum2 smaller than amplitude spectrum1, so that amplitude spectrum 2 with modified algorithm is closer to the original speech. At 10dB and 15dB SNR, LSD value of amplitude spectrum1 smaller than amplitude spectrum2, the situation is exactly the opposite of the former. In F16 noise [15], amplitude spectrum1 and amplitude spectrum2 all have smaller LSD value than the other noise types. Under the same background noise, the higher the SNR, the closer the enhanced speech is to original speech. Table 2 shows the results of analysis using Short-Term Objective Intelligibility. STOI is an indicator for evaluating enhanced speech intelligibility [12]. In addition, the value range of STOI is small, generally between 0 and 1, so only a small amount of speech is needed to obtain the required data, which is convenient and fast, and it has become the preferred measure of objective evaluation criteria. While the algorithm with STOI value large, the intelligibility of the speech is high, and it presents a well performance of the improved algorithm.
As can be seen from the table 2, at all SNR environments, STOI value of amplitude spectrum1 is larger than amplitude spectrum2, so that amplitude spectrum1 with modified algorithm is closer to the ideal state. Little difference in STOI values under different background noises. Table 3 shows the results of Perceptual Evaluation of Speech Quality. The PESQ is an indicator that uses the overall loudness error of the estimated clean speech and the enhanced speech to judge the overall quality of the output speech. The expression of PESQ is as follows: . The range of PESQ is between 0.5~4.5 − . While the algorithm with a large value of PESQ, the voice quality is favorable.
As can be seen from the table 3, at all SNR environments, PESQ value of amplitude spectrum2 larger than amplitude spectrum1, Amplitude spectrum2 with modified algorithm is closer to the ideal state, and under the same background noise, the higher the SNR, the better the enhanced speech quality. In White noise, amplitude spectrum1 and amplitude spectrum2 all have larger PESQ value than the other noise types.
In most cases, we usually use Wiener filtering noise-reduction algorithm to process the noise-damaged speech and the gain-induced distortion is limited to one of three areas: Region I involving only attenuation distortion, Region II involving only amplification distortion below 6 dB, and Region III including amplification distortion in excess of 6dB. Generous improvements in intelligibility, relative to noise-corrupted speech, were acquired when the noise-suppressed speech included only attenuation distortion. In the case where the attenuation and amplification distortion exist simultaneously, if the amplification distortion is less than 6 dB, significant intelligibility can be obtained.

Conclusion
This paper refers to the research results of predecessors and introduces the reason that existing algorithms do not improve speech intelligibility is because they allow amplification distortions in excess of 6dB. Therefore, eliminating the amplification distortion in exceed of 6dB or at least properly controlled it. Due to the lack of clean speech in reality, the experiments of Loizou et al. cannot be applied to the real situation. This paper uses the amplitude spectrum under two hypothetical conditions to carry out simulation experiments, and compares the results with the ideal state, we find the speech intelligibility has indeed improved through the objective evaluation index of speech intelligibility, and the improvement of the methods has been successfully applied to reality. In recent years, deep neural network algorithms have also been applied to the field of speech enhancement [16]. Speech enhancement algorithms based on improved speech intelligibility also have good development prospects [17,18]. It should be more effective to identify and eliminate amplification distortions greater than 6 dB. In future research, we can consider combining speech intelligibility with deep neural network algorithms to estimate a priori SNR and distortion regions that affect intelligibility accurately [19][20][21], so as to obtain clean speech with better quality and intelligibility.