/ Language Modeling / Tutorial Book / Tutorials / Software / Home

6.3.3 N-Gram Modeling: The N-Gram Language Model File

Open the TIDigits n-gram language model file and glance over it. The file can be easily read and hand edited since it's in text format. After the main file header, you'll see a sub-header called "/research/isip/data/" and underneath you'll see the total number of n-grams defined in this file.

     \data\
     ngram 1=13
     ngram 2=141
     ngram 3=1465

You'll see three different totals. The first number is the total of unigrams, the second is the number of bigrams and the third is the total number of trigrams. This file is a trigram language-model file, but why does it contain bigrams and unigrams?. Look at the number of unigrams. This number is the total of words, including !SENT_DELIM. It follows that the number of trigrams should be 13*13*13 or 2,197. The file, however, indicates that the total is only 1,465. In the event that a trigram encountered in an utterance is not included in the trigram language model file, a "back-off" occurs. Instead of using the probabilities from a trigram, the recognizer uses probablility of a bigram. The total number of bigrams in this file is 141 instead of 13*13 or 169. In the event that a bigram encountered in test data is not included in the language-model file, a second back-off occurs to the list of unigrams.

The rest of the file contains definitions of the unigrams, bigrams, and trigrams.

     \1-grams:
     -0.6322352      !SENT_DELIM
     -99     !SENT_DELIM     -1.953935
     -1.154977       EIGHT   -99
     -1.163263       FIVE    -99
     -1.15191        FOUR    -99
     -1.163263       NINE    -99
     ......

     \2-grams:
     -1.04211        !SENT_DELIM EIGHT       0.5087579
     -1.04211        !SENT_DELIM FIVE        0.5005441
     -1.04211        !SENT_DELIM FOUR        0.5117977
     -1.043239       !SENT_DELIM NINE        0.5016403
     -1.040983       !SENT_DELIM OH  0.5050113
     -1.043239       !SENT_DELIM ONE 0.5002577
     ......

     \3-grams:
     -0.5573978      !SENT_DELIM EIGHT !SENT_DELIM
     -1.256368       !SENT_DELIM EIGHT EIGHT
     -1.151632       !SENT_DELIM EIGHT FIVE
     -1.151632       !SENT_DELIM EIGHT FOUR
     -1.29776        !SENT_DELIM EIGHT NINE
     -1.151632       !SENT_DELIM EIGHT OH
     ......

Notice that the trigrams have only one number associated with them (the number to the left of the trigram). This number represents the probability of the trigram. The bigrams and unigrams, however, have two numbers assigned to them. The number to the left, as with the trigrams, is the n-gram's probability. The number to the right is the back-off weight and is used when determining which n-gram to use in case a back-off occurs.

In the TIDigits examples we've seen thus far, the word sequence probabilities remain virtually constant since sequences of digits have no real language structure. In a LVRS, the language structure becomes much more complex. Within several million words of English text, more than 50% of trigrams occur only once and 80% of trigrams occur less than five times. This sparseness of words causes a problem in N-gram modeling. Hence, smoothing is sometimes necessary to provide users a way of generating broader language models. Smoothing is the process of flattening a probability distribution implied by a language model so that all reasonable word sequences can occur with some probability.

For more information about smoothing click here.

Glossary / Help / Support / Site Map / Contact Us / ISIP Home