issues
- which unit do we apply MCE to -- sentence, words or phones
- zero-one loss function inappropriate for high WER
- search space reduction needed -- N-best paradigm
word/string level loss
- define discriminant function for target class (a word in this case)
- define the misclassification measure and loss as defined previously
- use N-best paradigm and accumulate statistics for each class
- MCE will push the models that are part of the misrecognized word away from the models that form part of the correct words
computations
- discriminant function related to the probability of the state sequence
- loss related to discriminant function
- using chain rule and GPD update parameters of the models