Enhanced Automatic Speech Recognition System Based on Enhancing Power-Normalized Cepstral Coefficients

Many new consumer applications are based on the use of automatic speech recognition (ASR) systems, such as voice command interfaces, speech-to-text applications, and data entry processes. Although ASR systems have remarkably improved in recent decades, the speech recognition system performance still significantly degrades in the presence of noisy environments. Developing a robust ASR system that can work in real-world noise and other acoustic distorting conditions is an attractive research topic. Many advanced algorithms have been developed in the literature to deal with this problem; most of these algorithms are based on modeling the behavior of the human auditory system with perceived noisy speech. In this research, the power-normalized cepstral coefficient (PNCC) system is modified to increase robustness against the different types of environmental noises, where a new technique based on gammatone channel filtering combined with channel bias minimization is used to suppress the noise effects. The TIDIGITS database is utilized to evaluate the performance of the proposed system in comparison to the state-of-the-art techniques in the presence of additive white Gaussian noise (AWGN) and seven different types of environmental noises. In this research, one word is recognized from a set containing 11 possibilities only. The experimental results showed that the proposed method provides significant improvements in the recognition accuracy at low signal to noise ratios (SNR). In the case of subway noise at SNR = 5 dB, the proposed method outperforms the mel-frequency cepstral coefficient (MFCC) and relative spectral (RASTA)–perceptual linear predictive (PLP) methods by 55% and 47%, respectively. Moreover, the recognition rate of the proposed method is higher than the gammatone frequency cepstral coefficient (GFCC) and PNCC methods in the case of car noise. It is enhanced by 40% in comparison to the GFCC method at SNR 0dB, while it is improved by 20% in comparison to the PNCC method at SNR −5dB.


Introduction
Despite advanced signal processing techniques used nowadays, the existing automatic speech recognition (ASR) system still cannot meet the performance of the human auditory system. This motivated many researchers to develop several robust feature extraction techniques. These techniques are based on modeling the physiological attitude of human audition to recognize speech in the presence of environmental noise and background interference.
In the literature, there are numerous approaches have been proposed to address these problems. They are mainly categorized into two significant approaches [1]. These approaches are the feature-space approach and the model-space approach. The feature-space approach depends on modifying the auditory features that suppress noises. It applies different adaptation techniques to the test features in order to match the training features. However, in the model-space approach, the acoustic model parameters are adjusted to decrease the noise effects. Although the model-space approach achieves higher accuracy in comparison to the feature-space approach, it still requires a higher computational time. In both the feature-space approach and model-space approach, the hidden Markov model (HMM) [2,3] is mostly used as a statistical machine learning technique. In the literature, there are numerous feature extraction techniques, which have recently been applied in ASR systems. For instance, mel-frequency cepstral coefficient (MFCC) [4] and the perceptual linear predictive (PLP) [5] technique are considered the most widely used techniques in speech recognition and speaker identification systems.
In the literature, the PNCC system has several advantages [15,16]. The PNCC system reported in [15] demonstrates effective robustness against different sources of environmental disturbances such as background additive noise, linear channel distortion, and reverberation in comparison to MFCC and PLP methods. In this paper, the PNCC feature extraction method [15,17] was modified to obtain acoustic features that can work against the noise at low SNRs without affecting the system performance. This paper is organized as follows: Section 2 discusses the proposed system in detail; Section 3 shows the experimental work and results; and Section 4 summarizes the outcomes of the paper and future works.

Proposed Enhanced PNCC Algorithm
In this section, the modifications that were applied to the PNCC system will be explicated in detail. Figure 1 illustrates a comparison between the block diagram structures of the proposed PNCC method and state-of-the-art methods such as PNCC, GFCC, RASTA-PLP, and MFCC [15]. As shown in Figure 1, the block diagrams are divided into three stages. These stages are pre-processing, noise suppression, and final processing.

Proposed Enhanced PNCC Algorithm
Preprocessing is the first stage in the block diagram. In this stage, a high pass pre-emphasis Finite Impulse Response (FIR) filter = 1 − 0.97 is applied at the input speech waveform in all methods except RASTA-PLP. Then, the pre-emphasized waveform is divided into short overlapped frames with 25.6 ms frame duration and with 10 ms overlap between frames. Each frame is multiplied by the Hamming window. In the next stage, the fast Fourier transform (FFT) is applied with 256-bit resolution. The power spectral density (PSD) is obtained by computing the magnitude-square of the output frequencies as | | .
The gammatone filter banks provide more robust features rather than Mel-filter banks and bark filters for the MFCC and RASTA-PLP methods, respectively. Therefore, a set of 25 gammatone filters [18,19] were generated from the frequency response of the gammatone kernel basis functions from Equation (1) with center frequencies ranging from 100 Hz to 4 kHz, as shown in Figure 2.
where is the amplitude, is the discrete-time index, is the filter's order, is the filter's bandwidth, is the center frequency, and is the phase of the carrier.

Proposed Enhanced PNCC Algorithm
Preprocessing is the first stage in the block diagram. In this stage, a high pass pre-emphasis Finite Impulse Response (FIR) filter H(z) = 1 − 0.97z −1 is applied at the input speech waveform in all methods except RASTA-PLP. Then, the pre-emphasized waveform is divided into short overlapped frames with 25.6 ms frame duration and with 10 ms overlap between frames. Each frame is multiplied by the Hamming window. In the next stage, the fast Fourier transform (FFT) Y(k) is applied with 256-bit resolution. The power spectral density (PSD) is obtained by computing the magnitude-square of the output frequencies as Y(k) 2 .
The gammatone filter banks provide more robust features rather than Mel-filter banks and bark filters for the MFCC and RASTA-PLP methods, respectively. Therefore, a set of 25 gammatone filters [18,19] were generated from the frequency response of the gammatone kernel basis functions from Equation (1) with center frequencies ranging from 100 Hz to 4 kHz, as shown in Figure 2.
where a is the amplitude, n is the discrete-time index, θ is the filter's order, b is the filter's bandwidth, f c is the center frequency, and ϕ is the phase of the carrier. In order to decrease the computational time, the gammatone filter response is set to be zero if the filter bank is less than 0.5 percent of its maximum value. The magnitude of each gammatone filter is then multiplied by the PSD of each frame, and the summation is calculated according to the following equation: where is the frame index, is the gammatone filter index, is spectrum resolution, and is the function of each filter bank in the frequency domain.

Noise Suppression
Most noise suppression techniques that are used in speech recognition systems are based on using different filtering techniques on several frequency frames. Increasing the number of analysis frames provides high performance for modeling and eliminating noise effects [6,[20][21][22]. Many algorithms are based on high-pass or band filtering techniques along with frames such as the RASTA-PLP feature extraction technique. RASTA filtering is used in this technique. This filter is a band-pass filter designed to remove the slowly varying background noise along with frames [6]. The mathematical expression of PSD should be a positive value. One of the drawbacks of using high-pass filtering is that it can obtain a negative value [6]. For the noise suppression technique in the standard PNCC system, firstly the medium-time power is applied by calculating the running average of five following frequency channels. It is considered low-pass filtering along with frames. Secondly, another filtering technique is applied using asymmetric noise suppression filtering, which is combined with temporal masking. This technique is used to remove the impact of slowly-varying background noise and define excitation and non-excitation states. The excitation state represents the filtered speech activity state, while the non-excitation states represent the non-voice activity, which is determined from the estimated floor-level. Then, weight smoothing is applied by calculating the running average across channels. Finally, the original time-frequency domain is modulated by the constructed transfer function. Nevertheless, in the proposed PNCC method, the large-time power is used instead of medium-time power and then channel bias minimizing is applied as explicated in the following section.

Large-Time Power Calculation
In the proposed method, the large-time power filter is used as an average low-pass average filter, similar to the medium-time power filter, but it is applied on a higher number of frames. This filter is In order to decrease the computational time, the gammatone filter response is set to be zero if the filter bank is less than 0.5 percent of its maximum value. The magnitude of each gammatone filter is then multiplied by the PSD of each frame, and the summation is calculated according to the following equation: where m is the frame index, l is the gammatone filter index, M is spectrum resolution, and G l (k) is the function of each filter bank in the frequency domain.

Noise Suppression
Most noise suppression techniques that are used in speech recognition systems are based on using different filtering techniques on several frequency frames. Increasing the number of analysis frames provides high performance for modeling and eliminating noise effects [6,[20][21][22]. Many algorithms are based on high-pass or band filtering techniques along with frames such as the RASTA-PLP feature extraction technique. RASTA filtering is used in this technique. This filter is a band-pass filter designed to remove the slowly varying background noise along with frames [6]. The mathematical expression of PSD should be a positive value. One of the drawbacks of using high-pass filtering is that it can obtain a negative value [6]. For the noise suppression technique in the standard PNCC system, firstly the medium-time power is applied by calculating the running average of five following frequency channels. It is considered low-pass filtering along with frames. Secondly, another filtering technique is applied using asymmetric noise suppression filtering, which is combined with temporal masking. This technique is used to remove the impact of slowly-varying background noise and define excitation and non-excitation states. The excitation state represents the filtered speech activity state, while the non-excitation states represent the non-voice activity, which is determined from the estimated floor-level. Then, weight smoothing is applied by calculating the running average across channels. Finally, the original time-frequency domain is modulated by the constructed transfer function. Nevertheless, in the proposed PNCC method, the large-time power is used instead of medium-time power and then channel bias minimizing is applied as explicated in the following section.

Large-Time Power Calculation
In the proposed method, the large-time power filter is used as an average low-pass average filter, similar to the medium-time power filter, but it is applied on a higher number of frames. This filter is applied by computing the running average power P[m, l] for 2M + 1 consecutive frames as in the following equation: where m represents the frame index, and l is the gammatone channel index. As shown in Figure 3a, the power values of the gammatone spectrogram for clean speech does not change rapidly along with consequent frames. On the other hand, the power values of additive environmental and white noise change more frequently. Thus, these noises cause a sudden variation in speech values over consequent frames of each gammatone channel, as shown in Figure 3b,c. These sudden changes vary in each gammatone channel according to each type of 110 noise. Additionally, pre-emphasizing causes boosting of the additive noise values with speech at high frequency. Therefore, in this stage, the average filtering over each gammatone channel will cause smoothing of these sudden changes in the power values. Furthermore, the number of average window frames 2M + 1 affects recognition performance. Using a large average window causes the 115 blur effect that can destroy the speech information, which degrades the system performance, especially in the case of undistorted speech utterances. Moreover, it is computationally demanding. On the other hand, small average window size leads to incomplete removal of sudden changes in power that are associated with some background noise conditions. Therefore, in this paper, the value of 120 coefficient M is set to be equal to 5, and it was chosen experimentally by selecting the value that provides high performance at different noise levels.

Channel Bias Minimizing
Most of the speech information values are concentrated at low and medium frequencies, while the noise spectrum values distribution depends on each type of noise. After gammatone filtering, the PSD values for of each noisy speech frame is smoothed according to each gammatone filter function. However, the large-time power averaging filter will cause smoothing of the power values of each consequent gammatone channel, but differently. Since the energy values of additive noise change are faster than the speed energies over the frames, the smoothing effect will cause spreading of the noise energies more than the speech energies along each gammatone channel. Therefore, it will produce a channel bias. This bias depends on the spectral noise distribution more than the speech spectral distribution. Since the noise PSD is usually not uniform, specifically in the case of environmental noise, the bias value varies from each gammatone channel to another. In Equation (4), the channel bias effect is minimized by subtracting channel values from the minimum number within each channel, multiplied by the constant bias factor d where 0 < d < 1. This constant bias factor depends on the value of the coefficient. It was also chosen experimentally, and its value is equal to 0.6.

Mean Power Normalization
The human auditory processing system contains an automatic gain control that decreases the effect of amplitude variation with the perceived acoustic wave. In the PNCC system, using the power function nonlinearity that is explicated in the next stage drives the produced power values from the processing to 145, which is sensitive to the absolute power variation, although the impact of variation is usually small. The mean power normalization is applied to minimize the potential effect of amplitude scaling. The input power is normalized by dividing the perceived power by a running average of the total power. Initially, the estimated mean power µ[m] is calculated from the following equation: where λ µ is a forgetting factor whose value is 0.999. Then the normalized power is computed from the evaluated running power µ[m] as per the following equation: where k an arbitrary constant [15,17].
where is a forgetting factor whose value is 0.999. Then the normalized power is computed from the evaluated running power μ[ ] as per the following equation: where an arbitrary constant [15,17].

Power Function Nonlinearity
This stage simulates the compressive relation between the intensity of sound in decibels and the auditory nerve spontaneous firing rate [23,24]. It is included within the physiological model of human auditory processing. In the RASTA-PLP system, the power function nonlinearity is simulated as a cubic-root amplitude compression relation. However, this compressive relation is logarithmic in MFCC and GFCC systems. Based on experiments [15,25], to find the human rate intensity relation that fits to the [23] curve, it has been found that the power-law approximation curve of 1/15 exponent presents a proper fit to the physiological data, while optimizing recognition accuracy for the noisy speech compared to logarithmic and cubic-root curves.

Final Processing
In the last stage, the discrete cosine transform (DCT) is calculated to obtain the speech cepstral features. In order to decrease the convolutive channel distortion effect, the cepstral mean normalization (CMN) [26] is applied by moving all of the cepstral features to have a zero mean.

Experimental Work
In this paper, an excerpt of the TIDIGITS database [27] was utilized to evaluate the performance of the proposed method in comparison to the state-of-the-art techniques. This excerpt consists of 207 speakers; each of them pronounces 11 digits twice. It was partitioned into two subsets. The training set was 37 men and 57 women, while the testing set was 56 men and 57 women. Eight different types of noise were added to the testing datasets with different SNRs from -5 dB to 20 dB with a step size of 5 dB. These noises were added at the same random sequence. One of them was additive white Gaussian noise (AWGN), and the rest were AURORA environmental noise.
Likewise, in the PNCC methods, as well as in the MFCC [28], RASTA-PLP [28], and GFCC methods, each utterance was divided by 25.6 ms overlapped frames with 10 ms shifts between frames. Then, each frame was multiplied by a Hamming window. After that, the FFT was applied

Power Function Nonlinearity
This stage simulates the compressive relation between the intensity of sound in decibels and the auditory nerve spontaneous firing rate [23,24]. It is included within the physiological model of human auditory processing. In the RASTA-PLP system, the power function nonlinearity is simulated as a cubic-root amplitude compression relation. However, this compressive relation is logarithmic in MFCC and GFCC systems. Based on experiments [15,25], to find the human rate intensity relation that fits to the [23] curve, it has been found that the power-law approximation curve of 1/15 exponent presents a proper fit to the physiological data, while optimizing recognition accuracy for the noisy speech compared to logarithmic and cubic-root curves.

Final Processing
In the last stage, the discrete cosine transform (DCT) is calculated to obtain the speech cepstral features. In order to decrease the convolutive channel distortion effect, the cepstral mean normalization (CMN) [26] is applied by moving all of the cepstral features to have a zero mean.

Experimental Work
In this paper, an excerpt of the TIDIGITS database [27] was utilized to evaluate the performance of the proposed method in comparison to the state-of-the-art techniques. This excerpt consists of 207 speakers; each of them pronounces 11 digits twice. It was partitioned into two subsets. The training set was 37 men and 57 women, while the testing set was 56 men and 57 women. Eight different types of noise were added to the testing datasets with different SNRs from −5 dB to 20 dB with a step size of 5 dB. These noises were added at the same random sequence. One of them was additive white Gaussian noise (AWGN), and the rest were AURORA environmental noise.
Likewise, in the PNCC methods, as well as in the MFCC [28], RASTA-PLP [28], and GFCC methods, each utterance was divided by 25.6 ms overlapped frames with 10 ms shifts between frames. Then, each frame was multiplied by a Hamming window. After that, the FFT was applied with 256-bit resolution. In the last stage, 13 features were obtained for each technique. After applying CMN, ∆ and ∆∆ features were calculated' the total number of extracted features was 39 features.
For all types of feature extraction techniques in this paper, the Gaussian mixture model-hidden Markov model (GMM-HMM) technique was used to generate acoustic models of each word [3]. The sphinx Carnegie Mellon University (CMU) lexical dictionary was used with three Gaussian mixture components per each phoneme. All the feature extraction methods were trained with noise-free utterances, and tested with noise-free and noisy utterances. The database structure was a single word. Therefore, the language model was only one word, and in this case, the recognition accuracy was equal to the word recognition rate (WRR). Figure 4 shows the recognition accuracy, which is presented graphically for the proposed PNCC method in comparison to the state-of-the-art PNCC system, as well as the most commonly used baseline systems, such as GFCC, MFCC, and RASTA-PLP. The first graphs are obtained accuracy in the case of AWGN, while the rest of the graphs are for different types of environmental noise. The performance of these systems for each noise is shown in the case of clean data and at six SNR levels from −5 dB to 20 dB with step size 5 dB. For all types of feature extraction techniques in this paper, the Gaussian mixture model-hidden Markov model (GMM-HMM) technique was used to generate acoustic models of each word [3]. The sphinx Carnegie Mellon University (CMU) lexical dictionary was used with three Gaussian mixture components per each phoneme. All the feature extraction methods were trained with noise-free utterances, and tested with noise-free and noisy utterances. The database structure was a single word. Therefore, the language model was only one word, and in this case, the recognition accuracy was equal to the word recognition rate (WRR). Figure 4 shows the recognition accuracy, which is presented graphically for the proposed PNCC method in comparison to the state-of-the-art PNCC system, as well as the most commonly used baseline systems, such as GFCC, MFCC, and RASTA-PLP. The first graphs are obtained accuracy in the case of AWGN, while the rest of the graphs are for different types of environmental noise. The performance of these systems for each noise is shown in the case of clean data and at six SNR levels from -5 dB to 20 dB with step size 5 dB.    As seen in Figure 4, it was concluded that the lowest robustness against noise acquired for MFCC and RASTA-PLP systems was interchangeable. With some types of noise, the performance of MFCC was higher than RASTA-PLP, and the reverse was true for other circumstances, while in the rest of the cases, the performance varied for each noise level. Replacing the mel triangular filter banks in MFCC processing by the gammatone filter banks in the GFCC system provided significantly measurable improvements in the recognition accuracy for all types of degradation. However, with significant types of noise, the GFCC performance was less than both of these systems at a SNR of −5dB. Using the time-frequency noise suppression technique, mean power normalization, and power function nonlinearity using (1/15) root remarkably improved the recognition accuracy in the PNCC systems in comparison to the last three systems. The recognition accuracy of the enhanced proposed PNCC dramatically outperformed all the systems for all noise types, specifically at low SNRs, whereas at high SNR values, the recognition accuracy was almost unchanged.

Results and Discussion
Moreover, the proposed robustness technique against the noise did not degrade the system performance in the case of the undistorted speech waveform. Thus, the blurring effect from the largetime power averaging filter was almost unnoticeable. Figure 5 demonstrates an in-depth analysis of the percentage improvement rate of the proposed method in comparison to the other methods at low SNRs, such as −5 dB, 0 dB, and 5 dB. Overall improvement in percentage recognition rate for all the types of noise was obtained in the case of MFCC, RASTA-PLP, and GFCC methods, and lastly PNCC method. In the SNR −5 dB bar chart, the highest recognition rate in comparison to RASTA-PLP and GFCC methods was determined in the street noise condition.
The proposed method outperformed the RASTA-PLP method by 37.21%, while it was better than the GFCC method by 38.66%. Differently, the highest recognition rate in comparison to MFCC and PNCC methods was determined in the case of car noise. The proposed method was improved by 33.75% in the case of the MFCC method, while it was enhanced by 19.51% more than the PNCC As seen in Figure 4, it was concluded that the lowest robustness against noise acquired for MFCC and RASTA-PLP systems was interchangeable. With some types of noise, the performance of MFCC was higher than RASTA-PLP, and the reverse was true for other circumstances, while in the rest of the cases, the performance varied for each noise level. Replacing the mel triangular filter banks in MFCC processing by the gammatone filter banks in the GFCC system provided significantly measurable improvements in the recognition accuracy for all types of degradation. However, with significant types of noise, the GFCC performance was less than both of these systems at a SNR of −5dB. Using the time-frequency noise suppression technique, mean power normalization, and power function nonlinearity using (1/15) root remarkably improved the recognition accuracy in the PNCC systems in comparison to the last three systems. The recognition accuracy of the enhanced proposed PNCC dramatically outperformed all the systems for all noise types, specifically at low SNRs, whereas at high SNR values, the recognition accuracy was almost unchanged.
Moreover, the proposed robustness technique against the noise did not degrade the system performance in the case of the undistorted speech waveform. Thus, the blurring effect from the large-time power averaging filter was almost unnoticeable. Figure 5 demonstrates an in-depth analysis of the percentage improvement rate of the proposed method in comparison to the other methods at low SNRs, such as −5 dB, 0 dB, and 5 dB. Overall improvement in percentage recognition rate for all the types of noise was obtained in the case of MFCC, RASTA-PLP, and GFCC methods, and lastly PNCC method. In the SNR −5 dB bar chart, the highest recognition rate in comparison to RASTA-PLP and GFCC methods was determined in the street noise condition.
The proposed method outperformed the RASTA-PLP method by 37.21%, while it was better than the GFCC method by 38.66%. Differently, the highest recognition rate in comparison to MFCC and PNCC methods was determined in the case of car noise. The proposed method was improved by 33.75% in the case of the MFCC method, while it was enhanced by 19.51% more than the PNCC method. In the SNR 0 dB bar chart, the highest percentage enhancement in recognition rate in comparison to MFCC and PNCC methods was obtained in the subway noise condition.
Appl. Sci. 2019, 9, x; doi: FOR PEER REVIEW www.mdpi.com/journal/applsci comparison to MFCC and PNCC methods was obtained in the subway noise condition. The proposed method was better than the MFCC method by 49.75%, while it outperformed the PNCC method by 14.4%. Otherwise, the highest recognition rate compared to RASTA-PLP and GFCC methods was determined in the case of car noise. The proposed method was improved by 46.98% in the case of the RASTA-PLP method, while it was enhanced by 40.02% in the case of the GFCC method. In the SNR 5 dB bar chart, the highest recognition rate in comparison to MFCC, RASTA PLP, and PNCC methods was determined in the subway noise condition. The proposed method was better than the MFCC method by 55.72%, while it was better than the RASTA-PLP method by 47.87% and the PNCC method by 8.16%. Differently, the highest recognition rate in comparison to the GFCC method was determined in the case of exhibition noise. It was improved by 24.85% more than the GFCC method.  The proposed method was better than the MFCC method by 49.75%, while it outperformed the PNCC method by 14.4%. Otherwise, the highest recognition rate compared to RASTA-PLP and GFCC methods was determined in the case of car noise. The proposed method was improved by 46.98% in