6.3.2 N-gram Modeling: Training an N-gram Language Model![]() We use N-gram models trained by an SRI (StanfordResearch Institute) toolkit. In this section, you'll learn how to trainan n-gram language model. The SRI toolkit must be downloaded and installed in order to run the following examples. The tools can be downloaded from the SRI Website. Follow the installation instructions given here. To train the N-gram language model files, we'll use the tool ngram-count. First, go to the directory:
warning: no singleton counts GT discounting disabled warning: no singleton counts GT discounting disabled warning: discount coeff 1 is out of range: 1.41549 warning: discount coeff 6 is out of range: 1.36563 BOW denominator for context "NINE <s>" is zero; scaling probabilities to sum to 1 BOW denominator for context "THREE <s>" is zero; scaling probabilities to sum to 1 BOW denominator for context "SIX <s>" is zero; scaling probabilities to sum to 1The -text parameter specifies the file containing the plain text transcriptions. For this example, the transcription file is tidigits_trans_word_text.text The -order parameter tells the tool what type of n-gram model to create. In this case, we're creating a trigram language model. The final parameter, -lm, specifies the name of the trained language model file to output, which in this case is tidigits_word_ngram.lm. The N-gram models generated by this tool must be modified to follow the ISIP format. Two modifications need to be made to the original file generated by SRI toolkit. First, add an Sof header at the beginning of the file. Next, The following example shows the file before and after the addition of the header.
Next, replace the tags "< s >" and "< /s >" in the original generated file with !SENT_DELIM. Look at the example below.
Save the modified file as tidigits_word_ngram.sof. Take a look at the N-gram model file before and after its two modifications. The file is now usable in ISIP's production system. |
![]() ![]() ![]() ![]() ![]() |