Overview Downloads  Tutorials
HTK Tutorials
Tutorials

TI Digits Short: Dictionary Preparation & Phone Lists

In this portion we make sure that the dictionary we will use is in the correct format. We also need to generate a lexicon (the list of words that exist in the data - i.e. a subset of the dictionary) and a pronounciation dictionary (i.e. the words in the lexicon broken down into their constituent phones).

Building the dictionary is somewhat different from the official tutorial in the HTK book. The official tutorial refers to the "beep" dictionary but does not explain how to prepare it. Here we have used the dictionary "sw-ms98-lexicon-original.text" from ISIP's Switch-Board task as the source (included in the HTK_tutorialTemplate). We can not proceed directly with this dictionary though because we first have to convert it to a more suitable format. We use a simple perl script to do this.

Procedure

  1. Convert the dictionary to the proper format and store it as sw_dic:

    • Go to the directory isip/exp/htk_tutorial/data_preparation/dictionary
    • From the command line type:
      convert_dic.perl sw-ms98-lexicon-original.text sw_dic

  2. Now, we produce the lexicon, wlist, with a Perl script, prompts2wlist (prompts2wlist_uc is the exact same program but produces a wlist consisting of only uppercase words). We capture the words by using the transcriptions of the testing and training data. In the case of TI Digits, the words in testing and training are all the same (i.e. zero, one, two, three, etc.) so we can just use either the test transcriptions or training transcriptions. In any other experiment you should concatenate the training transcriptions and testing transcriptions to capture all possible words in wlist:

    • From the same directory (i.e. dictionary) type:
      prompts2wlist_uc ../trans/trans_list_test.text wlist

    • Open wlist and add "SENT-START" and "SENT-END" to the bottom of the list. We use these to model the beginnings and ends of each utterance.

  3. Now we should narrow down the words in the dictionary to a list of words and their constituent monophones (dict) that are used in the data. From these words we then produce a list of monophones (monophones1) that are found in the data: