LECTURE 41: DISCRIMINATIVE TRAINING

MMIE AND MLE ARE SIMILAR AND YET DIFFERENT

In MLE, only the correct model is updated during training. In MMIE, all models are updated during training, even with one training sample.
The greater the prior information on the class assignment, the more effect it has on the MMIE estimator.
If the assumption of the underlying distribution is correct, MMIE and MLE should converge to the same result. However, in practice, MMIE must produce a lower likelihood for the true class assignment (underlying distribution).
MMIE and MLE are consistent estimators, but MMIE has greater variance. MMIE tries not only to increase the likelihood of the correct class, but decrease the likelihood of the incorrect class.
MMIE is computationally expensive. Why?
How do we estimate the probability of the class assignment for the incorrect classes?

Experimental results: CU/HTK word error rates on eval97sub and eval98 using h5train00sub training:

MMIE	%WER
Iteration	eval97sub	eval98
0 (MLE)	46.0	46.5
1	43.8	45.0
2	43.7	44.6
3	44.1	44.7

The results in Table 3 show that again the peak improvement comes after two iterations, but there is an even larger reduction in WER: 2.3% absolute on eval97sub and 1.9% absolute on eval98. The word error rate for the 1-best hypothesis from the original bigram word lattices measured on 10% of the training data was 27.4%. The MMIE models obtained after two iterations on the same portion of training data gave an error rate of 21.2%, so again MMIE provided a very sizeable reduction in training set error.