3.1.1 Overview:
What is a Feature?
Speech recognition begins when a person speaks into a microphone or
telephone. This act of speaking produces a sound pressure wave
which forms an acoustic signal. The microphone or telephone
receives the acoustic signal and converts it to an analog signal
that can be understood by an electronic device. Finally, to store
the analog signal on a computer (a digital device), it must be
converted to a digital signal.
A
speech recognizer
is a program running on a computer that tries to understand or
"decode" a digital signal. However, the signal, as first captured
by the microphone or telephone, contains information in a form that the
recognizer cannot yet decode. Only certain attributes or
features
of a person's speech are helpful for decoding.
These features allow the recognizer to differentiate among the
phonemes
(patterns of vowels and consonants) that are spoken for each
word. They must be numerically measured and stored in a form
the recognizer can process. We call this form a
feature vector.
The process of taking these measurements is known as
feature extraction.
Other names for feature extraction include
front-end processing,
digital
signal processing,
and
signal modeling.
In modern speech recognition systems, feature extraction
typically includes the process of converting the signal to a
digital form (i.e., signal conditioning), measuring some
important characters of the signal such as energy or frequency
response (i.e., signal measurement), augmenting these
measurements with some perceptually-meaningful derived
measurements (i.e., signal parameterization), and statistically
conditioning these numbers to form observation vectors.
Historically, we have referred to this entire process as
feature extraction. For more details on this process,
see our notes on
signal modeling in speech recognition.
|
|
|
Energy
is an example of a feature or attribute of speech that is useful to
the recognizer. The figure below plots the change in energy values
for the phonemes of the word, "engineer," over a time period of 500 ms.
Spectrograms
provide another way of viewing the speech signal, plotting changes
in the energy of the signal at specific frequency values over time.
The spectrogram below was computed for each of the phonemes of the
word "engineer." The lighter yellow areas indicate higher energy in
the vocal frequencies. Note the higher values in the vowel sounds
which are sustained longer than the consonant sounds.
The energy values for the spectrogram are extracted by computing a
Fourier Transform,
a mathematical technique that allows one to compute
the frequency spectrum of the signal given a small amount of data, or
window.
This process is also known as converting from the
time domain
to the
frequency domain.
|