This tutorial is first in a sequence of three tutorials that
introduces you to the control flow of our hybrid approaches to
application of support vector machines (SVMs) and relevance
vector machines (RVM) to continuous speech recognition system.
Next month, I will explain in detail how to train and evaluate
an SVM system. In the following mont h, I will provide a similar
tutorial for the RVM system. Both of these systems will be
released with version r00_n12 of our
production system.
You can monitor the progress of this release on our
asr mailing list.
This tutorial provides an overview of the control flow for our
hybrid system approach to implementing risk minimization-based
approaches such as SVMs and RVMs. A detailed description of the
software interface will be covered in next month's tutorial.
The theory behind these implementations are covered in two
dissertations:
- A. Ganapathiraju,
Support Vector Machines for Speech Recognition,
Ph.D. Dissertation, Department of Electrical and Computer
Engineering, Mississippi State University, January 2002.
- J. Hamaker,
Sparse Bayesian Methods for Continuous Speech Recognition,
Ph.D. Dissertation Proposal, Department of Electrical and
Computer Engineering, Mississippi State University, March 2002.
A number of conference publications and seminars are also
available on our
publications web site.
Our hybrid system approach is a two-pass system that essentially
uses an HMM system to generate the N-best lists and alignments,
and an SVM (or RVM) system to rescore these results. These are
described below.
1. HMM System:
Our baseline HMM system is described in detail in our
speech recognition system tutorial.
A summary is shown in the figure to the right.
There are three three basic steps given below that are required to
build the HMM-based system:
feature extraction,
training, and
N-Best list generation.
Click on the links provided to learn more about these steps.
Below we will focus on how we integrate output from the HMM with
the SVM or RVM system.
2. HMM/SVM Hybrid System:
In this section, we will cover the process of intergration of
the HMM output to the SVM based hybrid system. The integration
of RVM based hybrid system to the HMM output is also based on a
similar model. The two basic steps to build the SVM-based hybrid
system are: SVM training, and SVM-based N-best list
rescoring. Both SVM training and SVM-based N-best list rescoring
require input(s) from the HMM system in one way or
another. These requirements are discussed in two sections given
below.
- SVM Training:
Training the support vector models (SVMs) consists of a
sequence of four steps: Segmental Feature Generation,
Segmental Feature Selection, Core SVM Training, and
Posterior Creation. For simplicity, lets assume that we
are training SVMs corresponding to each monophone
contained in the phone-set of the database under
consideration. The output at the end of these sequence of
four steps is a set of posteriors corresponding to each
of the monophones. A block diagram overview of the flow
of SVM training is provided in the figure shown below.
i. Segmental Feature Generation on Symbol (Phone)
Basis:
The first step in SVM Training is to extract the
segmental features from the mel-cepstral (MFCC) training
data using the corresponding segmental information
(symbol-alignments). The segmental information is
generated using the baseline HMM system. A typical
segmental feature vector is extracted from the
mel-cepstral feature vectors that corresponds to one of
the symbol alignments. The details of this process is
documented extensively in Aravind Ganapathiraju's
dissertation. The segmental feature vectors
corresponding to all the instances of alignment of a
specific symbol (phone) in the training data are
concatenated together. Thus, the concatenated segmental
feature vectors represents the in-class data for a
specific symbol. Following this procedure, in-class data
for all the symbols (monophones) is generated. Next,
out-of-class data is generated.
ii. Segmental Features Selection:
In order to train a SVM for a symbol, we require both
in-class data and out-of-class data corresponding to
that symbol. In this step, the out-of-class data for
each of the symbols is selected from in-class data of
rest of the other symbols.
The training data for a symbol is constructed by
selecting equal amount of segmental features from
in-class data (data belonging to the same symbol),
and out-of-class data (data belonging to other
symbols) for that specific symbol. Further, half of
the out-of-class data is randomly selected from the
phonetically similar symbols (phone) set, and the
remaining half is randomly selected from rest of the
symbols (phones). This process is illustrated as
feature selection in the figure shown to the
right.
iii. Core SVM Training:
Next, as shown in figure to the right, given the
in-class and out-of-class data, Support Vector Models
corresponding to each symbol (phone) in the
symbol-set are trained using one vs. all training
scheme. The output of the SVM training is the set of
support vectors corresponding to each of the symbols.
|
|
iv. Posterior Creation:
Finally, for each of the SVMs trained in the previous
step, a posterior probability is estimated by fitting
a sigmoid to a set of SVM-distances. This set of
distances corresponding to each of the SVM is
computed by classifying a cross-validation set on the
SVMs.
|
- SVM-based N-best List Rescoring:
In this section, we will describe the process of
SVM-based N-best list rescoring. The figure below gives
the overview of decoding process used in the HMM/SVM
hybrid system. First, the segmental features are
extracted from the mel-cepstral test data (MFCC) using
the segmental information from the baseline HMM
system. In the second step, the N-best list output from
the baseline HMM system is re-scored to get the final
hypotheses.
i. Segmental feature generation on an utterance basis:
Instead of generating segmental features on a symbol
basis, as done during the training process, the
segmental features are generated on an utterance
basis. First the symbol alignments corresponding to
all the utterances in the test database are generated
using the baseline HMM system. Then, the segmental
features for the utterances in the test database are
extracted using the corresponding segmental
information. Note that unlike SVM training, no
feature selection is required during the
decoding. Next, the process of decoding is performed.
ii. N-best rescoring:
In the first pass of decoding, the N-best lists
corresponding to each of the utterance in the test
database are generated by the baseline HMM
system. These N-best lists are then re-scored using
the SVM models (posteriors) in the second pass to get
the output hypotheses. The rescoring process is
documented extensively in Aravind Ganapathiraju's
dissertation.
|
|
The overview of the structure and implementation of
HMM/SVM and HMM/RVM systems presented in this tutorial
will guide the interface of these systems' implementation
in the ISIP Foundation Classes (IFCs). The next tutorial
will cover the details of this interface and will include
examples on how to train/decode using the SVM/HMM hybrid
system. A similar tutorial on RVM/HMM system will follow
later.
|