The signal-to-noise ratio (SNR) is an important feature in determining the
quality of audio data. This is particularly important in speech recognition
technology since it is well known that recognition performance is strongly
influenced by the SNR. Unfortunately, in most applications the SNR cannot be
easily derived since the noise energy is not known. Further, the question
arises as to what is "signal" and what is "noise". For example, would a cough
or breath noise be considered part of the "signal" in spontaneous speech?
Does it convey information? With these problems in mind, we must define a
statistically oriented method which makes a best estimate of the SNR given
the a priori knowledge of the speech data. One such method which uses a
short-term analysis of the speech signal to statistically characterize the
signal and the noise.
Figure 1. A speech signal whose SNR is approximately 30 dB.
The challenge is to compute the signal and noise energies without any
a priori knowledge about the data in the audio file.
Consider a typical speech signal, such as that shown to the right. The
method used for estimation of the signal's SNR is based on a histogram
analysis of energy. Ideally, we would expect to see two major modalities
in the energy histogram as shown in Figure 2.
These two modalities
correspond to the nominal noise energy and nominal signal plus noise
energy, respectively. From the cdf shown in Figure 2, we can define
thresholds which select the percentage of data points which we expect
to correspond to the signal plus noise energy and the noise energy.
Typically we use thresholds of 80% signal and 20% noise (85%/15% and
95%/15% are also popular choices). These values have been derived by
experienced speech researchers based on analyses of many types of data.
With this methodology we define the estimated SNR based on the energy
levels corresponding to the points in the cdf that satisfy our
thresholds:
|
|
Figure 2. An energy probability density function (pdf) and the corresponding
cumulative distribution function (cdf).
|
|
There is one detail that we have overlooked to this point: how do we get
the short-term measurements. This requires that we decide on an optimum
window and frame duration to yield consistent and accurate SNR estimates
for the given data set. For speech signals, we typically use a 30 msec
window duration and a 20 msec frame duration. Also preemphasis and
a Hamming window are used. The pre-emphasis filter is given by
where
µ
is typically around 0.95. The default window used is rectangular, but can
optionally be set to a Hamming window. The Hamming windows is used to smooth
abrupt discontinuities at the frame boundaries.
You can download the following from our site:
-
Software:
Signal-to-Noise ratio source code in compressed gzip format. This
release includes a sample data file, sample scripts,
and installation instructions.
-
Tutorial:
An overview of the theory behind this approach.
|