final paper on Applications of Decision Trees in Phonetic Recognition submitted to fulfill the requirements for ECE 8993: Fundamentals of Speech Recognition submitted to: Dr. Joseph Picone Department of Electrical and Computer Engineering 413 Simrall, Hardy Rd. Mississippi State University Mississipppi State, Mississippi 39762 May 9, 1998 submitted by: Julie Ngan MS in Computer Engineering Candidate Department of Electrical and Computer Engineering Mississippi State University 429 Simrall, Hardy Rd. Mississippi State, Mississippi 39762 Tel: 601-325-8335, Fax: 601-325-3149 Email: ngan@isip.msstate.edu Figure 1. The word chip represented using different contexts: monophones, biphones, and triphones. Figure 2. A decision tree for ih [1]. ABSTRACT In general speech, the pronunciation of a word or phone depends heavily on the context. The uneven distribution of phones and contexts presents problems for unseen contexts when performing parameter estimation for the acoustic model. In order to improve the robustness of the speech recognition systems, it will require the generation of phonetic equivalence classes and the use of parameter sharing or smoothing in the model. In this paper, we will first address the problem of context modeling and unseen data. To solve this problem, we will describe a decision tree based top-down clustering procedure which would make use of linguistic knowledge and training data to determine the acoustic similarities of phonetic contexts. Further, an in depth discussion of the construction of the decision tree and how it is used in phonetic clustering will be presented. 1. Introduction The powerfulness of a speech recognition system depends highly on the number of different words it can recognize and the complexity of the grammar. However, as the number of permitted variations increases, the perplexity and confusability of the task rises. Maintaining high levels of accuracy needs improvements to both the acoustic models, which provide information about which words best match the speech signal, and the language model, which provides prior knowledge of likely word sequences. An important part of speech recognition is to decode the extracted features of a speech signal through language models into words that are meaningful. These language models which unknown utterances are compared must be representative of general speech and be robust enough to capture natural variations between speakers. In general speech, the pronunciation of a word or phone depends heavily on the context. The uneven distribution of phones and contexts presents problems for unseen contexts when performing parameter estimation for the acoustic model. Further, simple pattern matching techniques are not suited to recognition of continuous speech because segment boundaries are difficult to detect. In order to improve the robustness of the speech recognition systems, it will require the generation of phonetic equivalence classes and the use of parameter sharing or smoothing in the model. The use of phonemes in language models is popular since they can be easily adjusted to speaker variations. 2. Context dependent phonetic models One of the ways in which a systematic variability can be captured is through the use of context dependent phonetic models. If the position of the articulatory can be inferred from the identity of the phone then the co-articulator effects could be predicted from the identities of the surrounding phones [1]. Hence models specific to both a phone and the phonetic context in which it occurs capture these effects and give more accurate performance. These context dependent phonetic models can be labelled using monophone, biphone, and/or triphone context. For example, the word chips can be represented as using different phonetic contexts as illustrated in Figure 1. The use of triphones and higher context presents problems due to the fact that the number of models increases exponentially with the number of available phones and the context specified. However, in any realistic training set, some triphones will occur very really, some even never. This is quite obvious since in general English, most words consists of a mixture of vowels and consonants. Seldom do we find a sequence of all vowel or all consonant triphones. For example, despite the relatively large size of the Wall Street Journal Database, only 22,804 of a possible 95,221 triphones appear in the SI284 training data and only 14,545 appear at least ten times [1]. There are several methods to approach this problem. One method is to model all triphones present in the training set separately, and then to apply smoothing techniques to overcome the problem of sparse training data. One way this can be accomplished [2] is to use interpolation between the less and more specific models (with the interpolation weights chosen using deleted interpolation [3]). Another method is used when there is insufficient data to train a given model, it is possible to back-off and use a less specific model for which there is enough data. For example, a biphone model could substitute for a triphone which had only a few examples in the training data. If there were few occurrences of that biphone, a monophone model could be used instead [1]. In order to increase the robustness of the system, we can also explicitly share models or parts of models between different contexts [4]. This is possible since the acoustic realizations of a phone occurring in different contexts are often very similar. This method also ensures that all the system parameters are well trained while maintaining the model's context dependency. Clustering at the phoneme and state levels defines a reduced set of models to be trained and reduce the number of parameters of the system. Further, the training material can be more efficiently used by training rarely seen states together with more robust ones. Basically, state-tying [5] allows to group together triphones which are acoustically similar but not necessarily often seen. The consequence is that more triphones can be modeled, which leads to a higher coverage of the text set lexicon [6]. 3. Decision trees for clustering Statistical decision trees have recently emerged as a versatile and data-driven classification tool for complex, non-linearly separable data. It has previously been used for statistical modeling of natural language [7], and for the generation of spelling-to-sound rules. For the problem of sparse data, the accuracy of a recognition system will be severely compromised unless better estimates can be found for the parameters of the unseen models. The use of back-off models might improve the performance of the system to an extend. However, many of the models might have been backed-off to use biphone or monophone models. The use of a top down clustering procedure based on decision trees avoids the problem of unseen models by using linguistic knowledge together with the training data to decide which contexts (including the unseen ones) are acoustically similar [1]. A decision tree for each phoneme selects which of a set of models is used in each context. A tree state is a particular state of an HMM for a given phone for which different phonetic contexts have been clustered by means of a decision tree [8]. Because of limited data, and a need for parsimony, the contexts are combined into equivalence classes, and a model is made for each class. Thus, the model is chosen by traversing the tree. starting from the root node then selecting the next node depending upon the answer to a simple question about the current context. For binary decision trees, these questions will normally be answered with a simple yes/no concerning membership of particular sets of phones. For example, using the decision tree illustrated in Figure X, the root question is answered by checking to see if the immediately preceding phone (the left context) is a nasal (n, en, m, n, em, or ng). If the actual context was m-ih+t the next question to be asked would concern whether the following phone was a liquid (l, el, r, w, y, or hh). Since t is not a member of this set and the answer no results in a terminal node, the model labelled C would be used in this context [1]. 4. decision tree construction To construct a binary decision tree, a collection of data, which consists of all the samples for a particular phone is used. The decision tree is constructed in a locally optimal faction starting from a single root node representing all contexts. As each node is created, an optimal question is chosen to split the data into two subsets, which is further split into smaller subsets. The optimal question is based on a goodness-of-split evaluation function, which is derived from a probabilistic measure that is related to the homogeneity of a set of label strings. Some of the more common splitting algorithms are information gain, gain ratio, Bayesian splitting, and maximum entropy. The splitting is recursively done until the number of samples at a node falls below a threshold, or the goodness of the best split falls below a threshold. The result is a binary tree in which each terminal node represents one equivalence class of contexts [9]. Instead of using stopping criteria to restrict the over growing of the decision tree, it is possible to allow the decision tree to grow to full. The resulting tree is then pruned back to provide a less specific tree. During the construction of the decision tree, there are a number of goals [1]: 1. Each leaf must have a minimum number of examples to ensure that the parameters of the final models can be estimated accurately. 2. A finite set of questions can be used to divide each node. This constrains the way each node may be divided but allows the incorporation of expert knowledge needed to predict contextual similarity when little or no data is available to determine which contexts are acoustically similar. 3. Hidden Markov models should be able to accurately capture the variability of the terminal nodes. The construction of decision tree can be outlined as follows: 1. Start with all samples at the root node. 2. While there are untested nodes, do 2.1 Select some untested node n. 2.2 Evaluate the goodness of the split function for all possible questions at this node. 2.3 If a stopping criterion is met, declare this node as a terminal node, else 2.4 Associate the question with the highest value of the goodness of the split function value with this node. Make two new successor nodes. All samples that answer positively to the question are transferred to the left successor and all other samples are transferred to the right successor. So far, simple questions where each question is applied to one element of the context at a time is assumed. However, it is possible to construct complex questions which deal with several context elements at once [DT of P 5]. 5. Using decision trees during recognition The terminal nodes of a tree for a phone correspond to the different allophones of the phone, where a Hidden Markov model is constructed for each. During recognition, a sequence of phones is constructed by concatenating the phonetic base forms of the words. For each phone in this sequence, the appropriate decision tree is used and the path in the tree corresponding to the context provided by the phone sequence is traced. This leads to a terminal node, and the Markov model associated with this node is used. Therefore, by concatenating the Markov models for each phone, a Markov model for the entire word sequence is obtained [DT of P]. 6. COnclusions Acoustic models used in continuous speech recognition systems should account for variations in pronunciation arising from contextual effects. Decision trees have proven to be a robust and effective method to provide state-tying techniques for contextual information. Further, unlike the bottom-up smoothing approach, decision trees allow for generalization of unseen data, which provide better coverage for problems of sparse data. References [1] J. Odell, The Use of Context in Large Vocabulary Speech Recognition, Ph.D. thesis, University of Cambridge, Cambridge, UK, 1995. [2] K. Lee and H. Hon, "Speaker-Independent Phone Recognition Using Hidden Markov Models," IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 37, no 11, pp. 1641-1648, 1989. [3] F. Jelinek and R. Mercer, "Interpolated Estimation of Markov Source Parameters from Sparse Data," Pattern Recognition in Practice, pp. 381-397, North-Holland, 1980. [4] S. Young, "The General Use of Tying in Phoneme-Based HMM Speech Recognisers," Proceedings ICASSP `92, pp. 569-572, San Fransisco, CA, 1992. [5] S. Young and P. Woodland, "The Use of State-Tying in Continuous Speech Recognition," In Proc. EUROSPEECH `93, vol. 3, pp. 2203-2206, Berlin, Germany, September, 1993. [6] C. Dugast, P. Beyerlein, and R. Haeb-Umbach, "Application of Clustering Techniques to Mixture Density Modelling for Continuous-Speech Recognition," ICASSP `95, pp. 524-527, Detroit, MI, May 1995. [7] L. Bahl, P. Brown, P. de Souza, and R. Mercer, "A Tree-Based Language Model for Natural Language Speech Recognition," IEEE Transactions on ASSP, vol. 37, no. 7, pp. 1001-1008, July 1998. [8] L. Bahl, P. Brown, P. e Souza, R. Mercer, and M. Picheny, "Acoustic Markov Models Used in the Tangora Speech Recognition System," Proc. ICASSP `88. pp. 497-500, New York, NY, April 1998. [9] L. Bahl, P. de Souza, P. Gopalakrishnan, D. Nahamoo, and M. Picheny, "Decision Trees for Phonological Rules in Continuous Speech," IEEE Transactions on ASSP, pp. 185-188, July, 1989.