homework solutions for: Homework #7: Phone Number Recognition submitted to: Dr. Joseph Picone ECE 8993 Fundamentals of Speech Recognition June 3, 1998 submitted by: Jonathan Hamaker Institute for Signal and Information Processing Department of Electrical and Computer Engineering Mississippi State University Box 9571, 216 Simrall, Hardy Rd. Mississippi State, Mississippi 39762 Tel: 601-325-8335, Fax: 601-325-3149 Email: hamaker@isip.msstate.edu Figure 1. Data flow for telephone recognition experiments Figure 2. A simple connected digit grammar for telephone number recognition Figure 3. Algorithm for validation of telephone number hypothesis Table 1: Results of test cases for telephone number recognition system. Those marked with a "-" are those which did not produce a valid phone number. Spoken String Speech Rate Hypothesized String 8310 fast 6012 8310 medium - 8310 slow 6001038856 8335 fast - 8335 medium 6533856 8335 slow - 9338310 fast - 9338310 medium 6013833956 9338310 slow - 3258335 fast 6023302 3258335 medium 6033352356 3258335 slow - 2059338310 fast 2013339922 2059338310 medium - 2059338310 slow - 6013258335 fast - 6013258335 medium - 6013258335 slow - Introduction In previous assignments we have tackled topics dealing with front-ends, HMM modeling, and classification. In this assignment we put a good number of those concepts together to experiment with a full speech recognition system (absent the training process). We integrate an audio input system, a feature extraction system, a decoder, and a grammar-specific post-processing system for the recognition of phone numbers. Problem Statement Using the ISIP recognizer [1], build a system that recognizes spoken telephone numbers. The system must accommodate 4, 7, and 10 digit strings. The system must use as many constraints about telephone numbers as possible. For acoustic models, use the ISIP context-dependent phone models currently packaged as part of the ISIP recognizer demo. The system must also have its own language model and an interface to an audio system. Methodology The flowchart for the phone number system demo is shown in Figure 1. We use the DAT machine audio interface (narecord) and a signal detection system for recording the input data. The signal detector is optimized for a certain type of speech and is prone to failure so the user is also limited to 20 seconds for input of the telephone number. This provides more than enough time for the speaker to input a ten digit number. The raw data is converted to MFCC format files using the cparam and cview programs. These MFCC files are created using a 10 msec frame and 25 msec window. In all, there are 12 mel-scaled cepstral coefficients and log energy plus delta features, and delta-delta features for each frame of data. These steps constitute what is commonly referred to as the front-end of a speech recognition system. Next is the actual recognition portion of the process. In this phase, we use the ISIP recognizer [1] to decode the utterance and a post processor to determine if the string of numbers spoken is a valid phone number. For this task, we use a set of crossword triphone models trained on the OGI Alphadigit corpus. We use a simple bigram grammar for the decoding process as shown in Figure 2 as this is what the current version of the decoder is limited to. With a finished version of the decoder we would have used a compiled grammar which had the rules for telephone numbers compiled into it. Since this type of grammar decoding is not available to us at this time, we use a utility to post-process the decoder output to determine if it is a valid phone number. The algorithm for determining a successful phone number is shown in Figure 3. Results The results for this experiment are abysmal. The sentence error rate is 100% and the word error rate is well over 100% using standard NIST scoring. The insertion rate for words is the key issue causing the poor results. For a sentence, "EIGHT THREE ONE ZERO", spoken at a normal pace, the recognizer hypothesized "SIX ZERO OH ONE OH THREE EIGHT EIGHT FIVE SIX". There are a few reasons that could obviously cause this problem. The first is a mismatch in speaking and channel conditions. The models were trained on telephone quality data while the test samples were taken over a high quality audio system. The second, and perhaps most important reason is that this version of the decoder did not allow a word insertion penalty to be applied. A similar type of error was seen when running alphadigit experiments with the same model set. One had to set the word insertion penalty to around -100 to get reasonable results. Otherwise, the recognizer would hypothesize as many as three times as many words as was actually spoken. A last possible source of the errors is a mismatch between the MFCC files used to train the models and the MFCC files generated. We have found that the delta features we generated are suspect. A listing of the results for all test cases are shown in Table 1. References [1] N. Deshmukh, A. Ganapathiraju, J. Hamaker, and J. Picone, "A Public Domain Decoder for Large Vocabulary Conversational Speech Recognition," submitted to the IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, Arizona, USA, May 1999. # check lengths of numbers # if ((length(hypothesis) != 4) && (length(hypothesis) != 7) && (length(hypothesis) != 10)) { output error exit } # loop over phone number possibilities # if (length(hyp) == 4) { if ((hypothesis[0] eq "NINE") || (hypothesis[0] eq "ZERO") || (hypothesis[0] eq "OH")) { output error exit } else { output hypothesis } } elsif (length(hypothesis) == 7) { if ((hypothesis[0] eq "ONE") || (hypothesis[0] eq "ZERO") || (hypothesis[0] eq "OH")) { output error exit } else { output hypothesis } } elsif (length(hypothesis) == 10) { if ((hypothesis[0] eq "ONE") || (hypothesis[0] eq "ZERO") || (hypothesis[0] eq "OH")) { output error exit } else { output hypothesis } }