Although speech recognition has been a well studed field over the past few
decades, it can be difficult to find tutorials that offer clear and useful
insight into the actual process of running a task. This website aims to provide
resources for new researchers to begin working on speech recognition
experiments.
There are several different open-source speech recognisers available to the
public. ISIP, Carnegie Melon's Sphinx, and Cambridge's HTK are some of the most
popularly used systems. For our intents and purposes these tutorials focus on
using HTK since a fair amount of documentation exists and there are additional
resources available should someone running these tutorials have problems.
As the figure to the right shows, speech recognition can be broken down into a
few key processes. In these tutorials we'll focus on three key steps:
data preparation, acoustic modeling (i.e. training), and decoding (i.e.
obtainin results).
- Data Preparation:
Before doing anything, you must first ensure everything is in the correct
format and that you have all of the proper files for your experiment.
In this phase, you'll accomplish a few things including building a language
model for your task, converting your audio files into features to train
acoustic models, building a lexicon (i.e. reducing your
dictionary to a list of used words), generating lists of monophones or
triphones that exist in your data, and creating transcription files that are
needed to both train acoustic models and also later for decoding your results.
It's worth noting that the acoustic features mentioned above are most often
vectors of Mel Frequency Cepstral Coefficients (MFCC's).
- Acoustic Modeling / Training:
Once we have everything in the right format, we can begin to train our
acoustic models. The most common graphical structure for these are Hidden
Markov Models (HMM's). To train these acoustic models we use a multivariate
Gaussian distribution where each variable is represented by one of the
MFCC features. HTK uses the Baum-Welch algorithm (i.e. similar to the EM
algorithm) to estimate and maximize the means and covariances of the
models. Typically researchers expand these distributions to incorporate
Gaussian mixture models as well.
If you're not comfortable with some of this terminology, Lawrence Rabiner
has written a very useful explanation in A Tutorial on Hidden Markov
Models and Selected Applications in Speech Recognition.
- Decoding:
After we've finished training our HMM acoustic models, we use them to generate
a set of transcriptions for our testing data. We compare the generated results
to the actual transcriptions to determine the word error rate (WER) of the
system.