April / Monthly / Tutorials / Software / Home

Conversational speech tends to have a lot more variability than words spoken in isolation. This variability involves various noises, alternative pronunciations and unseen words. The important point to note here is that in conversational speech, people generally use pronunciations that deviate from their dictionary citations. A majority of the state-of-the-art ASR systems do not model such alternative pronunciations. Instead, these ASR systems make a hard decision by assuming one pronunciation for each of the words. In this tutorial, we introduce a new feature of the production system, known as network training, that gives users an ability to train pronunciation models directly. In fact, any level in a user-defined hierarchy of grammars, including language models, can be trained.

The network trainer implemented in the production system extends the concept of Baum-Welch (or Viterbi) training to any level of a user-defined hierarchy of networks, or grammars. Each of the levels in this hierarchy are represented as finite state machines using the Java Speech Grammar Format. Within the hierarchical network framework, each instance in the network is recursively expanded into a sub-network of instances. The system is capable of training this entire network, or selectively training any level in this network.

Now, let us consider a simple example consisting of a three level system. These three levels correspond to: (1) word-level: the utterance transcription, (2) phone-level: a lexicon that expands the words into phones, and (3) state-level: acoustic models (HMM's) that expand the phone models into state sequences. The utterance in our example contains two words: "ONE TWO." In supervised training, we force the recognizer to align a sequence of acoustic models to this transcription. This transcription is converted to a highly constrained grammar that will form the highest level in the hierarchy of grammars used by the production system.

Next, consider a single word lexicon with two pronunciations for the word "ONE":

ONE w ah n
ONE hh w ah n

Both these pronunciations will be allowed during network training since we do not know which one actually occurred in the data. Network training will determine the probability that each one generated the data, and update the accumulated counts maintained during the training pass.

Let's begin by constructing the acoustic models using the JSGF format (state-level). Here is a typical three-state HMM model for the phone hh:

  #JSGF V1.0

  // Define the grammar name
  //
  grammar network.grammar.hh;

  // Define the rules
  //
  public <hh> = <ISIP_JSGF_1_0_START> /0/ <S1> /0/
  <S2> /0/ <S3> /0/ <ISIP_JSGF_1_0_TERM>;

  <S1> = /0/ S1+;
  <S2> = /0/ S2+;
  <S3> = /0/ S3+;

Next, we need to define the lexicon (phone-level). This is done by processing the data above (pronunciations for the word "ONE") through a model creation utility called isip_model_creator. The resulting network is as follows:

grammar = {
  #JSGF V1.0

  // Define the grammar name
  //
  grammar network.grammar.ONE;
  
  // Define the rules
  //
  public <ONE> = <ISIP_JSGF_1_0_START> ( ( /0/ w /0/ an /0/ n ) | ( /0/hh /0/ w /0/ ah /0/ n ) ) /0/ <ISIP_JSGF_1_0_TERM>;
};

The third level consists of the word-level, which essentially defines the language model. This is generated dynamically during training, and allows the user to insert optional silences between words or at the ends of files, to account for arbitrary amounts of silence. The network created in this process looks as follows:

Example Network

These models are then trained using the main recognition system utility isip_recognize. This utility includes an HMM trainer that implements both the Baum-Welch training algorithm (preferred in this case) and the Viterbi algorithm. For a more extended tutorial that describes how to use the recognizer, see production system tutorials.