/ Features / Fundamentals / Production / Tutorials / Software / Home

3.1.1 Overview: What is a Feature?

Speech recognition begins when a person speaks into a microphone or telephone. This act of speaking produces a sound pressure wave which forms an acoustic signal. The microphone or telephone receives the acoustic signal and converts it to an analog signal that can be understood by an electronic device. Finally, to store the analog signal on a computer (a digital device), it must be converted to a digital signal.

A speech recognizer is a program running on a computer that tries to understand or "decode" a digital signal. However, the signal, as first captured by the microphone or telephone, contains information in a form that the recognizer cannot yet decode. Only certain attributes or features of a person's speech are helpful for decoding. These features allow the recognizer to differentiate among the phonemes (patterns of vowels and consonants) that are spoken for each word. They must be numerically measured and stored in a form the recognizer can process. We call this form a feature vector.

The process of taking these measurements is known as feature extraction. Other names for feature extraction include front-end processing, digital signal processing, and signal modeling. In modern speech recognition systems, feature extraction typically includes the process of converting the signal to a digital form (i.e., signal conditioning), measuring some important characters of the signal such as energy or frequency response (i.e., signal measurement), augmenting these measurements with some perceptually-meaningful derived measurements (i.e., signal parameterization), and statistically conditioning these numbers to form observation vectors. Historically, we have referred to this entire process as feature extraction. For more details on this process, see our notes on signal modeling in speech recognition.

Energy is an example of a feature or attribute of speech that is useful to the recognizer. The figure below plots the change in energy values for the phonemes of the word, "engineer," over a time period of 500 ms.

Section 3.1.1: What is a Feature?

Spectrograms provide another way of viewing the speech signal, plotting changes in the energy of the signal at specific frequency values over time. The spectrogram below was computed for each of the phonemes of the word "engineer." The lighter yellow areas indicate higher energy in the vocal frequencies. Note the higher values in the vowel sounds which are sustained longer than the consonant sounds.

Section 3.1.1: What is a Feature?

The energy values for the spectrogram are extracted by computing a Fourier Transform, a mathematical technique that allows one to compute the frequency spectrum of the signal given a small amount of data, or window. This process is also known as converting from the time domain to the frequency domain.

Glossary / Help / Support / Site Map / Contact Us / ISIP Home