/ Recognition / Fundamentals / Production / Tutorials / Software / Home

4.2.4 Network Decoding: Recognition Using Word-Internal Context-Dependent Phones

The previous section introduced the concept of speech recognition using phones and discussed the concept of context-independent phones. This section will introduce context-dependent phones and discuss how they differ from context-independent phone models.

The experiment below decodes a single utterance using context-dependent phones. Go to the directory $ISIP_TUTORIAL/sections/s04/s04_02_p04/.

cd $ISIP_TUTORIAL/sections/s04/s04_02_p04/

and run the following command:

isip_recognize -parameter_file params_decode.sof -list $ISIP_TUTORIA./databases/lists/identifiers_test.sof -verbose ALL

This will produce the following output:

Command: isip_recognize -parameter_file params_decode.sof -list /ftp/pu./projects/speech/software/tutorials/production/
fundamentals/current/example./databases/lists/identifiers_test.sof -verbose ALL
Version: 1.23 (not released) 2003/05/21 23:10:45
  
  loading audio database: $ISIP_TUTORIA./databases/db/tidigits_audio_db_test.sof
  
  *** no symbol graph database file was specified ***
  
  *** no transcription database file was specified ***
  
  loading front-end: $ISIP_TUTORIAL/recipes/frontend.sof
  
  loading language model: $ISIP_TUTORIAL/models/winternal_phone_models/compare/lm_winternal_jsgf_8mix.sof
  
  loading statistical model pool: $ISIP_TUTORIAL/models/winternal_phone_models/compare/smp_winternal_8mix.sof
  
  *** no configuration file was specified ***
  
  opening the output file: $ISIP_TUTORIAL/sections/s04/s04_02_p04/results.out
  
  processing file 1 (ah_111a): $ISIP_TUTORIA./databases/sof_8k/test/ah_111a.sof
    
    hyp:    ONE ONE ONE 
    score:  -9122.6484375   frames: 138
  
  processing file 2 (ah_1a): $ISIP_TUTORIA./databases/sof_8k/test/ah_1a.sof
    
    hyp:    ONE 
    score:  -5187.28173828125   frames: 79

    .....

Notice that the context_mode parameter of the parameter file has been set to SYMBOL_INTERNAL in order to indicate to the recognizer that the models being used are word internal models. The rest of the parameters are the same as for the previous recognition experiments.

Unlike context-independent phones, each context-dependent phone takes into account its surrounding context. In other words, the system determines how each phone is affected by surrounding phones. Consider the phone 'iy'. In a context-dependent recognition system, this phone will be in the form: <left context - iy - right context>. This form is called a triphone since there are three parts. The phone 'iy' might have any other phone for its left and right context. Consequently, the number of possible phones grows tremedously. In English, there are about 46 context-independent phones. For a context-dependent system, the number of possible triphones becomes 46 x 45 x 45 (93,150). Fortunately, this is an upper bound. In practice, the training of a recognition system will eliminate most of these triphones.

The increased number of phones causes the phone level of the language model to become more complex. The picture to the right illustrates the language model for a context-dependent recognition system. This language model contains only one word, zero, and two pronounciations. The first and last context dependent phones consists only of two parts because the word boundaries are not crossed while considering the contexts for word-internal context-dependent system. This type of phone is called a diphone. In the next section, we will see a type of recognition where the context before and after the word is considered.

Glossary / Help / Support / Site Map / Contact Us / ISIP Home