Continuous Density HMMs
The discrete HMM incorporates a discrete probability density function, captured in the matrix , to describe the probability of outputting a symbol:
Signal measurements, or feature vectors, are continuous-valued N-dimensional vectors. In order to use our discrete HMM technology, we must vector quantize (VQ) this data - reduce the continuous-valued vectors to discrete values chosen from a set of M codebook vectors. Initially, most HMMs were based on VQ front-ends. However, recently, the continuous density model has become widely accepted.
Let us assume a parametric model of the observation pdf:

The likelihood of generating observation  in state  is defined as:

Note that taking the negative logarithm of  will produce a log-likelihood, or a Mahalanobis-like distance. But what form should we choose for ?
Let's assume a Gaussian model, of course:

Note that this amounts to assigning a mean and covariance matrix to each state - a significant increase in complexity. However, shortcuts such as variance-weighting can help reduce complexity.

Also, note that the log of the output probability at each state becomes precisely the Mahalanobis distance (principal components) we studied at the beginning of the course.
Mixture Distributions
Of course, the output distribution need not be Gaussian, or can be multimodal to reflect the fact that several contexts are being encoded into a single state (male/female, allophonic variations of a phoneme, etc.). Much like a VQ approach can model any discrete distribution, we can use a weighted linear combination of Gaussians, or a mixture distribution, to achieve a more complex statistical model.
Mathematically, this is expressed as:

In order for this to be a valid pdf, the mixture coefficients must be nonnegative and satisfy the constraint:

Note that mixture distributions add significant complexity to the system: m means and covariances at each state.

Analogous reestimation formulae can be derived by defining the intermediate quantity:

The mixture coefficients can now be reestimated using:

the mean vectors can be reestimated as:

the covariance matrices can be reestimated as:

and the transition probabilities, and initial probabilities are reestimated as usual.

The Viterbi procedure once again has a simpler interpretation:

and

The mixture coefficient is reestimated as the number of vectors associated with a given mixture at a given state:

State Duration Probabilities
Recall that the probability of staying in a state was given by an exponentially-decaying distribution:

This model is not necessarily appropriate for speech. There are three approaches in use today:

· Finite-State Models (encoded in acoustic model topology)
(Note that this model doesn't have skip states; with skip states, it becomes much more complex.)

· Discrete State Duration Models ( parameters per state)

· Parametric State Duration Models (one to two parameters)

Reestimation equations exist for all three cases. Duration models are often important for larger models, such as words, where duration variations can be significant, but not as important for smaller units, such as context-dependent phones, where duration variations are much better understood and predicted.
Scaling in HMMs
As difficult as it may seem to believe, standard HMM calculations exceed the precision of 32-bit floating point numbers for the simplest of models. The large numbers of multiplications of numbers less than one leads to underflow. Hence, we must incorporate some form of scaling.

It is possible to scale the forward-backward calculation (see Section 12.2.5) by normalizing the calculations by:

at each time-step (time-synchronous normalization).

However, a simpler and more effective way to scale is to deal with log probabilities, which work out nicely in the case of continuous distributions. Even so, we need to somehow prevent the best path score from growing with time (increasingly negative in the case of log probs). Fortunately, at each time step, we can normalize all candidate best-path scores, and proceed with the dynamic programming search. More on this later...

Similarly, it is often desirable to trade-off the important of transition probabilities and observation probabilities. Hence, the log-likelihood of an output symbol being observed at a state can be written as:

or, in the log prob space:

This result emphasizes the similarities between HMMs and DTW. The weights,  and  can be used to control the importance of the "language model."
An Overview of the Training Schedule
Note that a priori segmentation of the utterance is not required, and that the recognizer is forced to recognize the utterance during training (via the build grammar operation). This forces the recognizer to learn contextual variations, provided the seed model construction is done "properly."

What about speaker independence?
Speaker dependence?
Speaker adaptation?
Channel adaptation?


Seed Model Construction
Recognize
Backtrace/Update
Replace Parameters

Build Grammar