Signal to Noise Ratio Estimation / Legacy Software / Software / Home

The signal-to-noise ratio (SNR) is an important feature in determining the quality of audio data. This is particularly important in speech recognition technology since it is well known that recognition performance is strongly influenced by the SNR. Unfortunately, in most applications the SNR cannot be easily derived since the noise energy is not known. Further, the question arises as to what is "signal" and what is "noise". For example, would a cough or breath noise be considered part of the "signal" in spontaneous speech? Does it convey information? With these problems in mind, we must define a statistically oriented method which makes a best estimate of the SNR given the a priori knowledge of the speech data. One such method which uses a short-term analysis of the speech signal to statistically characterize the signal and the noise.
Signal to noise ratio formula

Figure 1. A speech signal whose SNR is approximately 30 dB.

The challenge is to compute the signal and noise energies without any a priori knowledge about the data in the audio file. Consider a typical speech signal, such as that shown to the right. The method used for estimation of the signal's SNR is based on a histogram analysis of energy. Ideally, we would expect to see two major modalities in the energy histogram as shown in Figure 2.

These two modalities correspond to the nominal noise energy and nominal signal plus noise energy, respectively. From the cdf shown in Figure 2, we can define thresholds which select the percentage of data points which we expect to correspond to the signal plus noise energy and the noise energy. Typically we use thresholds of 80% signal and 20% noise (85%/15% and 95%/15% are also popular choices). These values have been derived by experienced speech researchers based on analyses of many types of data.

With this methodology we define the estimated SNR based on the energy levels corresponding to the points in the cdf that satisfy our thresholds:

Signal to noise ratio formula

Probabilty density function and cumulative distribution function

Figure 2. An energy probability density function (pdf) and the corresponding
cumulative distribution function (cdf).

There is one detail that we have overlooked to this point: how do we get the short-term measurements. This requires that we decide on an optimum window and frame duration to yield consistent and accurate SNR estimates for the given data set. For speech signals, we typically use a 30 msec window duration and a 20 msec frame duration. Also preemphasis and a Hamming window are used. The pre-emphasis filter is given by

Pre-emphasis formula

where µ is typically around 0.95. The default window used is rectangular, but can optionally be set to a Hamming window. The Hamming windows is used to smooth abrupt discontinuities at the frame boundaries.

You can download the following from our site:

Software: Signal-to-Noise ratio source code in compressed gzip format. This release includes a sample data file, sample scripts, and installation instructions.
Tutorial: An overview of the theory behind this approach.