WHY IS SMOOTHING SO IMPORTANT?
- A key problem in N-gram modeling is the inherent data sparseness.
- For example, in several million words of English text, more than
50% of the trigrams occur only once; 80% of the trigrams occur less
than five times
(see
SWB data
also).
- Higher order N-gram models tend to be domain or application specific.
Smoothing provides a way of generating generalized language models.
- If an N-gram is never observed in the training data, can it occur
in the evaluation data set?
- Solution: Smoothing is the process of flattening a probability
distribution implied by a language model so that all reasonable
word sequences can occur with some probability. This often involves
broadening the distribution by redistributing weight from high
probability regions to zero probability regions.