• issues
    • which unit do we apply MCE to -- sentence, words or phones
    • zero-one loss function inappropriate for high WER
    • search space reduction needed -- N-best paradigm
  • word/string level loss
    • define discriminant function for target class (a word in this case)
    • define the misclassification measure and loss as defined previously
    • use N-best paradigm and accumulate statistics for each class
    • MCE will push the models that are part of the misrecognized word away from the models that form part of the correct words
  • computations
    • discriminant function related to the probability of the state sequence
    • loss related to discriminant function
    • using chain rule and GPD update parameters of the models