IFLIR: IMPROVED MOTION DETECTION USING FORDWARD-LOOKING INFRARED
Speech recognition can be viewed as a pattern recognition problem
where we desire each unique sound to be distinguishable from all other
sounds. Traditionally statistical models, such as Gaussians mixture
models, have been built to "represent" the various units of speech.
However the lack of knowledge about the true underlying distribution
has forced us to look at alternate techniques that focus on
"discrimination" instead of "representation".
Hidden Markov Models (HMM) have been the most successful
classification paradigm for speech recognition. Traditionally the
model parameters are estimated using the Maximum Likelihood (ML)
criterion. Likewise, estimation techniques like Maximum Mutual
Information (MMI) and Minimum Classification Error (MCE) have been
developed for discriminative estimation of the model parameters. The
effort in estimating parameters using the discriminative techniques is
significantly greater than ML estimation. There are other classifiers
like neural networks whose parameters are estimated discriminatively.
However these systems cannot be easily used directly to model the
dynamic nature of speech. In such cases hybrid systems are used.
Support Vector Machines (SVM) is a new class of machine learning
technique that learns to classify discriminatively. This paradigm has
gained significance in the past few years with the development of
efficient training algorithms. SVMs are based on the fact that any
data can be transformed into a very high dimensional feature space
where it can be classified using a simple linear hyperplane. Though
this task seems daunting (especially when the dimension of the feature
space is a few thousand), the theory of kernels gives an elegant
solution to this problem and makes it computationally feasible even
for large tasks. Like neural network techniques, SVMs are implicitly
static classifiers. One would need to handle the dynamic nature of
data using a hybrid method built on a dynamic model like HMMs.
Preliminary experiments on classifying speech data at the frame and
phone level have been very encouraging. SVMs outperform most other
non-linear classifiers including neural networks and Bayes
classifiers. Another interesting fallout of this work is the ability
of the SVMs to identify mislabeled training data. This is an
important feature since it provides us with a nice way of handling
inaccurate training data.
In the proposed research, I will develop a hybrid HMM/SVM system to
recognize conversational speech. HMMs give us an elegant method to
handle the dynamics of speech and SVMs provide us with powerful
classifiers of static data. SVMs will be used to generate the
probability of the data given the model which will then be processed
using a dynamic programming approach commonly employed for
HMMs. Another approach that will be pursued is the use of Fisher
kernels.