/ Features / Fundamentals / Production / Tutorials / Software / Home

3.1.2 Overview: Frame-Based Processing

Each feature used by the recognizer must be calculated a user-specified number of times per second and computed over some time interval. The first quantity, which we define as the number of seconds between feature vector calculations, is called the frame duration. The second quantity, the number of seconds of data used to calculate a feature, is called the window duration. A typical frame duration in speech recognition is 10 ms, while a typical window duration is 25 ms. This means that every 10 ms a set of features are computed using a window of 25 ms of data centered around the current frame. This process is demonstrated in the figure shown to the right.

The measurements are taken over a set of samples. The number of samples used is determined by several factors. One factor is the sampling rate. As an example, for an 8 Khz sampling rate with a frame duration of 10 ms, measurements would be taken over 80 samples to produce one feature vector. Note this assumes we use only frame duration to determine the number of samples used in the measurements.

In practice, however, to get a smoother representation of the speech data, a window of samples surrounding the frame is incorporated in the measurements. Since the window incorporates samples from surrounding frames, the window size determines the number of samples used to produce a feature vector. The frame duration, however, determines the number of times we produce a feature vector. Our feature extraction tool isip_transform_builder provides an easy to use interface that alleviates the need to program such complicated interactions between the data and analysis window. This chapter is primarily a tutorial on how to use this tool.

The samples incorporated in the window can be taken from a frame preceding the current frame (left alignment), following the current frame (right alignment), or on either side of the current frame (center alignment). The latter is most commonly used. The image shown above illustrates a frame duration of 10 ms with a center alignment window including samples from 5 ms of frame data on either side of the current frame.

A speech waveform for a set of samples is shown at the bottom of the figure. Each 10 ms frame is labeled between the vertical bars. The window surrounding each frame is highlighted by a box with dashed lines. Note the shared data in the windows overlap from frame to frame. The window size chosen depends on the mathematical technique used in the calculation. Some common windowing techniques include Hamming, Hanning, and Gaussian. For more details on windowing techniques see the Window class in the algorithm library of our foundation classes.

Glossary / Help / Support / Site Map / Contact Us / ISIP Home