The explosive growth of speech recognition technology in the
marketplace has resulted from two significant advances over the past
forty years: ubiquitous high-speed computing and the evolution of
statistical methods in science and engineering. Of course, the
impact of computing cannot be underestimated. However, algorithm
advances in speech recognition have quietly revolutionized the way we look
at many signal processing problems, in addition to transforming voice
interfaces from simple one word commands to fluently spoken speech.

In the 1950's and 60's [], approaches to speech recognition were
dominated by attempts to blend linguistic knowledge about sound
production [] with analog electronics []. These simple systems, often based
on analog filter banks [] resulted in impressive demonstrations, but were
not scalable to large problems. In the 1970's, with the advent of
modern computing, many of these techniques were transformed to their
digital equivalents. The field of digital signal processing came of age.
Nevertheless, these approaches, still based on primitive template-matching
technology, were heavily constrained and met with limited success. In the
late 1970's, statistical methods slowly began to emerge. By the mid-1990's,
such techniques had become the dominant approach to speech recognition.

Statistical methods are popular because of their simplicity - we model
variation in the data using well-known statistical models such as
Guassian distributions, and train the systems using standard machine
learning [] principles. The motivation for this approach is shown in
Figure 1, which summarizes the speech recognition problem as a source
channel modeling problem common to communications theory []. The goal
of the speech recognizer is to estimate the message sent by the user
by maximizing the probability of a correct choice:

 W^ = argmax P(W|A)
        W

Here, W^ represents the correct word sequence, or the transcription of
the incoming speech.  The machinery for executing this task is Bayes'
formula of probability theory:

P(W|A) = P(A|W) P(W)
         ------
          P(A)

where A represents the acoustic signal (speech) and W represents the
word sequence spoken. A typical implementation of this
equation is shown in Figure 2. The signal is converted to a sequence
of feature vectors, typically generated 100 times per second using a
frequency domain analysis of the signal. These vectors are then
measured against prestored estimates of their values, and a
probability that they belong to a particular class of sounds is
estimated using a Hidden Markov model (HMM) []. The
likelihood of a word sequence is then computed as the product of
the probabilities of these individual feature vectors.

The representation of this problem using Bayes' equation is of
fundmental importance to understanding modern speech recognition
theory, since it succinctly decomposes the problem into two
sub-problems: acoustic modeling, in which we estimate P(A|W), and
language modeling, in which we estimate P(W).  Acoustic modeling is
the process by which we estimate the probably of a feature vector
given the class of sound we believe it belongs to. This is summarized
in Figure 3. Statistical models that explain varation in sound patterns are
learned using machine learning algorithms from large databases
of speech often consisting of hundreds of hours of speech recordings.

Language modeling is the process by which we estimate the probability
that one word occurs given we have seen N previous words. This is
summarized in Figure 4. Language models are trained on vast amounts of
text data that models the types of phrases people are likely to speak
in a given application. Though a vocabulary of several thousand words
will often cover most of a person's typical phrases, recognition
systems must have vocaularies approaching one hundred thousand
words to adequately cover realistic tasks such as audio indexing
of broadcast news.

A statistical speech recognition approach has some added benefit that
is demonstrated in Figure 5. One significant feature of this approach
is that knowledge in the system is organized in a hierarchical
fashion. For example, a system might consist of a level that looks for
two-word sequences (known as a bigram), a level that understands how
to convert a word to phonemes (known as a lexicon), and a level that
converts phonemes to sequences of acoustic vectors (referred to as the
acoustic models). This last level is typically implemented using
three-state Hidden Markov models as shown in Figure 5. At each state,
we often use a mixture of Gaussian distributions [] as the statistical
model. The next level up, the lexicon, resembles the pronunciation
portion of a standard dictionary.

Language models are composed of two types: networks and
N-grams. Networks attempt to predict word sequences one word at a
time. They are most often implemented as state machines in which each
transition outputs a word. These are extremely popular for command and
control applications, such as navigating a sequence of menus on a
desktop computer. N-grams, on the other hand, attempt to enumerate
common sequences of words. We typically use trigram (a sequence of
three words) in a high performance system. Such language models can be
very powerful but require large amounts of memory to process.

Another significant feature of the system is that information detected
by the signal is actually time-aligned against the speech signal as
shown in Figure 5. The system in essence is implemented as a hierarchy
of state machines in which each input vector forces a transition from
one state to the next (some systems allow transitions with no input
data). Hence, every input vector is assigned a slot in the final
hypothesis. When recognition is finished, it is easy to backtrack and
determine at what point in the speech signal each event was
detected. Such localization of information makes it easy to implement
advanced applications such as audio indexing, since the audo portions
of a signal corresponding to text data matched from a search can be
easily located and replayed.

A system that combines all of these features can be quite large. It is
common for high performance systems to use several million free
parameters, and require hundreds of megabytes of memory to process
data.  The most intensive part of this system is the search algorithm
that is used to find the most probable word sequence. This is depicted
in Figure 6. A traditional dynamic programming-based approach is used
to find an optimal solution quickly. However, because the search space
is so large (consider how many words can appear as the first word in a
sentence, then consider how many words can follow this first word) a
beam search [] approach must be employed in which candidate word
sequences that have a low probability are discarded early in the
search.