Title: A Hybrid ASR System Using Support Vector Machines
Authors: Aravind Ganapathiraju, Jonathan Hamaker, Joseph Picone
Support Vector Machines (SVM) is a new class of machine learning
technique that learns to classify discriminatively. This paradigm has
gained significance in the past few years with the development of
efficient training algorithms. SVMs are based on the fact that any
data can be transformed into a very high dimensional feature space
where it can be classified using a simple hyperplane. Though this task
seems daunting (especially when the dimension of the feature space is
a few thousand), the theory of kernels gives an elegant solution to
this problem and makes it computationally feasible even for large
tasks. Like neural network techniques, SVMs are implicitly static
classifiers. One would need to handle the dynamic nature of data
using a hybrid method built on a dynamic model like hidden Markov
models (HMM).
In this research, we have developed a hybrid HMM/SVM system to
recognize continuous speech much like the hybrid connectionist
systems. The motivation for this is based on the fact that HMMs give
us an elegant method to handle the dynamics of speech and SVMs provide
us with powerful classifiers of static data. SVMs are used to generate
the posterior probability which is converted to a scaled likelihood
before it is processed using the standard dynamic programming approach
used in HMMs. The mapping from distances to posterior probabilities
can be achieved in several ways. We have explored the idea of fitting
Gaussians to each of the class conditionals before we compute the
posterior. Another approach we have used is to directly estimate a
sigmoid that does this mapping. We have integrated the SVMs into
the search engine that comes with the publicly available ISIP ASR
Toolkit.
A novel approach we are pursuing uses the sufficient statistics
(Fisher scores) generated by the HMMs for classification. This
technique is motivated by the fact that the Fisher scores encode the
evolution or dynamics of the observation sequence while the other
kernels encode the static characteristics of a particular model.
While the choice of kernel is usually empirical, Fisher kernels have
the advantage of being based directly on the sufficient statistics of
the model. For HMMs these statistics are the gradient of the
likelihood with respect to each of the model parameters.
Since dimensionality is not a problem when optimizing SVMs, one could
classify based on multiple frames of data like connectionist systems.
with large vocabulary tasks training SVMs on frame level data could
mean several hundreds of thousands of training examples for each
classifier. to avoid dealing with large sets of training data, we
could use segment level data. the later idea has been pursued in
preliminary experiments. Another interesting fallout of this work
is the ability of the SVMs to identify mislabeled training data.
This is an important feature since it provides us with a nice way
of handling inaccurate training data especially since training data
for the SVM classifiers is generated by an HMM system that could
introduce errors.
The system has been evaluated on the OGI alphadigits corpus and
performs at 8.9% WER as compared to 11.3% using a 8 mixture crossword
triphone HMM system and 9.8% using a 8 mixture syllable HMM system. We
are presently porting the technology to conversational speech
(SWITCHBOARD). In the next few months, we will have results based on
the Fisher kernel method and would have benchmarked the use of
multi-frame data.