Training Discrete Observation HMMs
Training refers to the problem of finding  such that the model, , after an iteration of training, better represents the training data than the previous model. The number of states is usually not varied or reestimated, other than via the modification of the model inventory. The apriori probabilities of the likelihood of a model, , are normally not reestimated as well, since these typically come from the language model.

The first algorithm we will discuss is one based on the Forward-Backward algorithm (Baum-Welch Reestimation):

Also,  denotes a random variable that models the transitions at time  and  a random variable that models the observation being emitted at state  at time . The symbol "·" is used to denote an arbitrary event.

Next, we need to define some intermediate quantities related to particular events at a given state at a given time:

where the sequences , , , and  were defined previously (last lecture).
Intuitively, we can think of this as the probability of observing a transition from state  to state  at time  for a particular observation sequence, , (the utterance in progress), and model .

We can also make the following definition:

This is the probability of exiting state . Also,

which is the probability of being in state  at time . Finally,

which is the probability of observing symbol  at state  at time t.

Note that we make extensive use of the forward and backward probabilities in these computations. This will be key to reducing the complexity of the computations by allowing an interactive computation.
From these four quantities, we can define four more intermediate quantities:

Finally, we can begin relating these quantities to the problem of reestimating the model parameters. Let us define four more random variables:

We can see that:

What we have done up to this point is to develop expressions for the estimates of the underlying components of the model parameters in terms of the state sequences that occur during training.

But how can this be when the internal structure of the model is hidden?
Following this line of reasoning, an estimate of the transition probability is:

Similarly,

Finally,

This process is often called reestimation by recognition, because we need to recognize the input with the previous models in order that we can compute the new model parameters from the set of state sequences used to recognize the data (hence, the need to iterate).

But will it converge? Baum and his colleagues showed that the new model guarantees that:

Since this is a highly nonlinear optimization, it can get stuck in local minima:
We can overcome this by starting training from a different initial point, or "bootstrapping" models from previous models.

Analogous procedures exist for the Viterbi algorithm, though they are much simpler and more intuitive (and more DP-like):

and,

These have been shown to give comparable performance to the forward-backward algorithm at significantly reduced computation. It also is generalizable to alternate formulations of the topology of the acoustic model (or language model) drawn from formal language theory. (In fact, we can even eliminate the first-order Markovian assumption.)

Further, the above algorithms are easily applied to many problems associated with language modeling: estimating transition probabilities and word probabilities, efficient parsing, and learning hidden structure.

But what if a transition is never observed in the training database?


Global
Maximum
Local
Maximum