Table 1 The Results from the ISIP recognizer. EE 8993: Speech Recognition Homework Assignment #7 ISIP Recognizer July 30, 1998 submitted to: Dr. Joseph Picone Department of Electrical and Computer Engineering 413 Simrall, Hardy Rd. Mississippi State University Box 9 571 MS State, MS 39762 submitted by: Julie Ngan Department of Electrical and Computer Engineering Mississippi State University Box 9571 Mississippi State, Mississippi 39762 Tel: 601-325-8335 Fax: 601-325-3149 email: ngan@isip.msstate.edu I. Problem Definition Using the ISIP recognizer, we are to build a system that recognizes spoken telephone numbers. This system must accommodate 4, 7, and 10 digit strings. You must use as many constraints about telephone numbers as you can. For acoustic models, use the ISIP context-dependent phone models currently packaged as part of the demo. You will need to build your own language model, and to find a way to interface audio to the system. II. Overview Speech-to-text conversion is an essential part for speech recognition research. A speech-to-text system requires an acoustic model, a linguistic model, and a decoder. The acoustic model is used for the recognizer to convert input speech signals to a sequence of features describing the signal. The acoustic model is then III. Experiment Description This experiment is to use the ISIP recognizer to build a system that would recognize spoken telephone numbers. To begin with the experiment, three audio raw files are recorded using the DAT machine by using the command narecord -s 8000 . The three audio raw files consist of spoken telephone digits of length 4, 7, and 10. Once the files are recorded, they are converted to wav format. The number of frames can then be determined by doing a wc -l on the wav file. The wav file is passed through the feature extraction program to generate feature file for the decoder using the commands cparam -m -w 25 -p 12 -d -g -e -H NIST and cview -h -n 39 > . This generates a 39-dimensional feature for each frame of the wav file. The file is parsed to the format the decoder required to produce the input file. Before we can run the decoder using the input file, we have to generate a grammar and a lexicon file. The grammar file consists of a list of possible bigrams and their associated probabilities. Since we are processing only telephone numbers, the grammar file consists only of bigrams of each digit followed by a silence and a silence followed by each digit. The lexicon file consists of all the acceptable words used by the decoder and their phonetic pronunciations. Then the decoder can be executed using the command: nice -19 /ftp/pub/resources/courses/ece_8993_speech/homework/1998/utilities/decoder/trace_projection/bin/i386_SunOS_5.6/trace_projector -p data/input_files/params.text -n 5 -c 3 -g 2 -demo After the program started running on the command line, the number of frames in the file is keyed in and the decoded results will be outputted to the output file specified in the parameter file. IV. Experimental Results The three audio files we have recorded are the digits, `7134,' `324-7134,' and `601-324-7134.' Table 1 shows the actual recorded digits and the output from the recognizer. It is found that the recognizer does not accurately recognize the recorded digits. This can be explained by the fact that the ISIP decoder is tailored to recognize telephone conversation data. The audio file recorded from a DAT machine has different signal to noise ratio and features and cannot be recognized as if they are telephone data. V. References file frame size recorded digits recognized digits four_digit/digit_4.raw 462 7134 52292 seven_digit/digit_7.raw 629 3247134 922225229999 ten_digit/digit_10.raw 718 6013247134 52992292799