In MLE, only the correct model is updated during training.
In MMIE, all models are updated during training, even with one
training sample.
The greater the prior information on the class assignment, the more
effect it has on the MMIE estimator.
If the assumption of the underlying distribution is correct, MMIE
and MLE should converge to the same result. However, in practice,
MMIE must produce a lower likelihood for the true class assignment
(underlying distribution).
MMIE and MLE are consistent estimators, but MMIE has greater variance.
MMIE tries not only to increase the likelihood of the correct class,
but decrease the likelihood of the incorrect class.
MMIE is computationally expensive. Why?
How do we estimate the probability of the class assignment for
the incorrect classes?
Experimental results: CU/HTK word error rates on eval97sub and
eval98 using h5train00sub training:
MMIE
%WER
Iteration
eval97sub
eval98
0 (MLE)
46.0
46.5
1
43.8
45.0
2
43.7
44.6
3
44.1
44.7
The results in Table 3 show that again
the peak improvement comes after two
iterations, but there is an even larger reduction in
WER: 2.3% absolute on eval97sub and 1.9% absolute on eval98.
The word error rate for the 1-best hypothesis from
the original bigram word lattices measured on
10% of the training data was 27.4%. The MMIE models obtained after
two iterations on the same portion of training data gave an error rate
of 21.2%, so again MMIE provided a very sizeable reduction in
training set error.