TI Digits Short: Dictionary Preparation & Phone Lists
In this portion we make sure that the dictionary we will use is in the correct format. We
also need to generate a lexicon (the list of words that exist in the data - i.e. a subset of
the dictionary) and a pronounciation dictionary (i.e. the words in the lexicon broken down
into their constituent phones).
Building the dictionary is somewhat different from the official tutorial in the HTK book.
The official tutorial refers to the "beep" dictionary but does not explain how to prepare it.
Here we have used the dictionary "sw-ms98-lexicon-original.text" from ISIP's Switch-Board task
as the source (included in the HTK_tutorialTemplate). We can not proceed directly with this
dictionary though because we first have to convert it to a more suitable format. We use a
simple perl script to do this.
Procedure
- Convert the dictionary to the proper format and store it as sw_dic:
- Go to the directory isip/exp/htk_tutorial/data_preparation/dictionary
- From the command line type: convert_dic.perl sw-ms98-lexicon-original.text
sw_dic
- Now, we produce the lexicon, wlist, with a Perl script, prompts2wlist (prompts2wlist_uc
is the exact same program but produces a wlist consisting of only uppercase words). We capture
the words by using the transcriptions of the testing and training data. In the case of TI Digits,
the words in testing and training are all the same (i.e. zero, one, two, three, etc.) so we can
just use either the test transcriptions or training transcriptions. In any other experiment you
should concatenate the training transcriptions and testing transcriptions to capture all possible
words in wlist:
- From the same directory (i.e. dictionary) type:
prompts2wlist_uc ../trans/trans_list_test.text wlist
- Open wlist and add "SENT-START" and "SENT-END" to the bottom of the list. We use these
to model the beginnings and ends of each utterance.
- Now we should narrow down the words in the dictionary to a list of words and their
constituent monophones (dict) that are used in the data. From these words we then produce a
list of monophones (monophones1) that are found in the data:
- From the same directory (i.e. dictionary) type:
HDMan -m -w wlist -n ../../train/monophones1 -l dlog dict sw_dic names
***You might encounter an error which says several words in sw_dic are not in order. Simply
delete those words from sw_dic and repeat the above command. Once done, you should have dict and
monophones1 in the directory***
- From monophones1 we now create monophones0 (the same list of monophones without "sp".
To do this, simply find monophones1 in the train directory, copy it, and rename it monophones0.
Open monophones0 and delete the line with "sp".
Data Preparation
|
Training
|
Decoding
|
|