- issues
- which unit do we apply MCE to -- sentence, words or phones
- zero-one loss function inappropriate for high WER
- search space reduction needed -- N-best paradigm
- word/string level loss
- define discriminant function for target class (a word in
this case)
- define the misclassification measure and loss as defined
previously
- use N-best paradigm and accumulate statistics for each class
- MCE will push the models that are part of the misrecognized word
away from the models that form part of the correct words
- computations
- discriminant function related to the probability of the
state sequence
- loss related to discriminant function
- using chain rule and GPD update parameters of the models