The explosive growth of speech recognition technology in the marketplace has resulted from two significant advances over the past forty years: ubiquitous high-speed computing and the evolution of statistical methods in science and engineering. Of course, the impact of computing cannot be underestimated. However, algorithm advances in speech recognition have quietly revolutionized the way we look at many signal processing problems, in addition to transforming voice interfaces from simple one word commands to fluently spoken speech. In the 1950's and 60's [], approaches to speech recognition were dominated by attempts to blend linguistic knowledge about sound production [] with analog electronics []. These simple systems, often based on analog filter banks [] resulted in impressive demonstrations, but were not scalable to large problems. In the 1970's, with the advent of modern computing, many of these techniques were transformed to their digital equivalents. The field of digital signal processing came of age. Nevertheless, these approaches, still based on primitive template-matching technology, were heavily constrained and met with limited success. In the late 1970's, statistical methods slowly began to emerge. By the mid-1990's, such techniques had become the dominant approach to speech recognition. Statistical methods are popular because of their simplicity - we model variation in the data using well-known statistical models such as Guassian distributions, and train the systems using standard machine learning [] principles. The motivation for this approach is shown in Figure 1, which summarizes the speech recognition problem as a source channel modeling problem common to communications theory []. The goal of the speech recognizer is to estimate the message sent by the user by maximizing the probability of a correct choice: W^ = argmax P(W|A) W Here, W^ represents the correct word sequence, or the transcription of the incoming speech. The machinery for executing this task is Bayes' formula of probability theory: P(W|A) = P(A|W) P(W) ------ P(A) where A represents the acoustic signal (speech) and W represents the word sequence spoken. A typical implementation of this equation is shown in Figure 2. The signal is converted to a sequence of feature vectors, typically generated 100 times per second using a frequency domain analysis of the signal. These vectors are then measured against prestored estimates of their values, and a probability that they belong to a particular class of sounds is estimated using a Hidden Markov model (HMM) []. The likelihood of a word sequence is then computed as the product of the probabilities of these individual feature vectors. The representation of this problem using Bayes' equation is of fundmental importance to understanding modern speech recognition theory, since it succinctly decomposes the problem into two sub-problems: acoustic modeling, in which we estimate P(A|W), and language modeling, in which we estimate P(W). Acoustic modeling is the process by which we estimate the probably of a feature vector given the class of sound we believe it belongs to. This is summarized in Figure 3. Statistical models that explain varation in sound patterns are learned using machine learning algorithms from large databases of speech often consisting of hundreds of hours of speech recordings. Language modeling is the process by which we estimate the probability that one word occurs given we have seen N previous words. This is summarized in Figure 4. Language models are trained on vast amounts of text data that models the types of phrases people are likely to speak in a given application. Though a vocabulary of several thousand words will often cover most of a person's typical phrases, recognition systems must have vocaularies approaching one hundred thousand words to adequately cover realistic tasks such as audio indexing of broadcast news. A statistical speech recognition approach has some added benefit that is demonstrated in Figure 5. One significant feature of this approach is that knowledge in the system is organized in a hierarchical fashion. For example, a system might consist of a level that looks for two-word sequences (known as a bigram), a level that understands how to convert a word to phonemes (known as a lexicon), and a level that converts phonemes to sequences of acoustic vectors (referred to as the acoustic models). This last level is typically implemented using three-state Hidden Markov models as shown in Figure 5. At each state, we often use a mixture of Gaussian distributions [] as the statistical model. The next level up, the lexicon, resembles the pronunciation portion of a standard dictionary. Language models are composed of two types: networks and N-grams. Networks attempt to predict word sequences one word at a time. They are most often implemented as state machines in which each transition outputs a word. These are extremely popular for command and control applications, such as navigating a sequence of menus on a desktop computer. N-grams, on the other hand, attempt to enumerate common sequences of words. We typically use trigram (a sequence of three words) in a high performance system. Such language models can be very powerful but require large amounts of memory to process. Another significant feature of the system is that information detected by the signal is actually time-aligned against the speech signal as shown in Figure 5. The system in essence is implemented as a hierarchy of state machines in which each input vector forces a transition from one state to the next (some systems allow transitions with no input data). Hence, every input vector is assigned a slot in the final hypothesis. When recognition is finished, it is easy to backtrack and determine at what point in the speech signal each event was detected. Such localization of information makes it easy to implement advanced applications such as audio indexing, since the audo portions of a signal corresponding to text data matched from a search can be easily located and replayed. A system that combines all of these features can be quite large. It is common for high performance systems to use several million free parameters, and require hundreds of megabytes of memory to process data. The most intensive part of this system is the search algorithm that is used to find the most probable word sequence. This is depicted in Figure 6. A traditional dynamic programming-based approach is used to find an optimal solution quickly. However, because the search space is so large (consider how many words can appear as the first word in a sentence, then consider how many words can follow this first word) a beam search [] approach must be employed in which candidate word sequences that have a low probability are discarded early in the search.