Title: Hybrid HMM/SVM Architectures for Speech Recognition
Authors: A. Ganapathiraju, J. Hamaker and J. Picone

Speech recognition can be viewed as a pattern recognition problem
where we desire each unique sound to be distinguishable from all other
sounds. Unfortunately, the measurements we use to classify the signal
exhibit extreme amounts of overlap in the feature space. Traditionally
statistical models, such as Gaussian mixture models, have been used to
"represent" the various modalities for a given speech sound. To
improve separability of these representations in the feature space,
acoustic units such as context-dependent phones, which use information
about the preceding and following sounds, have been employed. Such
detailed statistical models often fall prey to the problem of
overtraining, and provide only modest gains in recognition performance
while significantly increasing the complexity and the parameter count
associated with a system. For example, even after training on 100
hours of conversational speech data, it is often observed that mixture
components model particular rarely-occurring artifacts corresponding
to a single speaker or isolated event.

In Hidden Markov Models (HMM), the most successful classification
paradigm for speech recognition to-date, parameters are traditionally
estimated using a Maximum Likelihood (ML) criterion. Extensions of the
HMM paradigm involving discriminative training techniques use
techniques such as Maximum Mutual Information (MMI) and Minimum
Classification Error (MCE). Many of these techniques fall under the
general principle of risk minimization. Empirical risk minimization is
one of the most commonly used optimization procedures in machine
learning.

A Support Vector Machine (SVM) is a new approach to machine learning
that learns to classify discriminatively. SVMs are based on the fact
that any data can be transformed into a very high dimensional feature
space where simple linear hyperplanes can be constructed for
classification. Though this task seems daunting (especially when the
dimension of the feature space is a few thousand), the theory of
kernels gives an elegant solution to this problem and makes it
computationally feasible even for large tasks.

What is interesting about this approach is that in the process of
developing these decision surfaces, one gains great insight into the
nature of the overlap between classes, and can identify data that is
likely to be outliers rather than at the edges of the region of
discrimination.

SVMs are a fundamentally new approach to acoustic
modeling. Preliminary experiments have been promising. For
example, on a phone classification experiment involving the six most
confused phone pairs for the OGI Alphadigits, SVMs provided a
significant improvement over HMMs. In this paper, we will present
results on a recently developed hybrid system involving SVMs and HMMs,
and demonstrate that this appears to be a promising approach
for acoustic modeling.