/ Language Modeling / Tutorial Book / Tutorials / Software / Home

6.3.2 N-gram Modeling: Training an N-gram Language Model

We use N-gram models trained by an SRI (StanfordResearch Institute) toolkit. In this section, you'll learn how to trainan n-gram language model. The SRI toolkit must be downloaded and installed in order to run the following examples. The tools can be downloaded from the SRI Website. Follow the installation instructions given here. To train the N-gram language model files, we'll use the tool ngram-count.

First, go to the directory:

$ISIP_TUTORIAL/sections/s06/s06_03_p02/

Run the command

ngram-count -text tidigits_trans_word_text.text -order 3 -lm tidigits_word_ngram.lm

Expected Output:

warning: no singleton counts
GT discounting disabled
warning: no singleton counts
GT discounting disabled
warning: discount coeff 1 is out of range: 1.41549
warning: discount coeff 6 is out of range: 1.36563
BOW denominator for context "NINE <s>" is zero; scaling probabilities to sum to 1
BOW denominator for context "THREE <s>" is zero; scaling probabilities to sum to 1
BOW denominator for context "SIX <s>" is zero; scaling probabilities to sum to 1

The -text parameter specifies the file containing the plain text transcriptions. For this example, the transcription file is tidigits_trans_word_text.text The -order parameter tells the tool what type of n-gram model to create. In this case, we're creating a trigram language model. The final parameter, -lm, specifies the name of the trained language model file to output, which in this case is tidigits_word_ngram.lm.

The N-gram models generated by this tool must be modified to follow the ISIP format. Two modifications need to be made to the original file generated by SRI toolkit. First, add an Sof header at the beginning of the file. Next, The following example shows the file before and after the addition of the header.

Before Header
\data\
ngram 1=13
ngram 2=141
ngram 3=682

After Header
@ Sof v1.0 @
@ NGramModel 0 @
format = "NGRAM_ARPA";

\data\
ngram 1=13
ngram 2=141
ngram 3=682

Next, replace the tags "< s >" and "< /s >" in the original generated file with !SENT_DELIM. Look at the example below.

Before Modification
\1-grams:
-0.6322352 </s>
-99 <s> -1.953935

After Modification
\1-grams:
-0.6322352 !SENT_DELIM
-99 !SENT_DELIM -1.953935

Save the modified file as tidigits_word_ngram.sof.

Take a look at the N-gram model file before and after its two modifications. The file is now usable in ISIP's production system.

Glossary / Help / Support / Site Map / Contact Us / ISIP Home