LECTURE 33: SMOOTHING N-GRAM LANGUAGE MODELS

WHY IS SMOOTHING SO IMPORTANT?

A key problem in N-gram modeling is the inherent data sparseness.
For example, in several million words of English text, more than 50% of the trigrams occur only once; 80% of the trigrams occur less than five times (see SWB data also).
Higher order N-gram models tend to be domain or application specific. Smoothing provides a way of generating generalized language models.
If an N-gram is never observed in the training data, can it occur in the evaluation data set?
Solution: Smoothing is the process of flattening a probability distribution implied by a language model so that all reasonable word sequences can occur with some probability. This often involves broadening the distribution by redistributing weight from high probability regions to zero probability regions.