homework solutions for: Homework #8: Language Modeling submitted to: Dr. Joseph Picone ECE_8993 Fundamentals of Speech Recognition September 24, 1998 submitted by: Janna Shaffer Department of Electrical and Computer Engineering Mississippi State University Box 9571, 216 Simrall, Hardy Rd. Mississippi State, Mississippi 39762 Email: janna@cs.msstate.edu Results The following model was built with a vocabulary of the 1000 most frequent words and using the entire data set. evallm : perplexity -text ../hamaker/data/all_data.text Computing perplexity of the language model with respect to the text ../hamaker/data/all_data.text Perplexity = 69.12, Entropy = 6.11 bits Computation based on 17455054 words. Number of 3-grams hit = 17455053 (100.00%) Number of 2-grams hit = 1 (0.00%) Number of 1-grams hit = 0 (0.00%) 8256284 OOVs (32.11%) and 0 context cues were removed from the calculation. These models were built using different kept portions (80% of all the data) and then evaluated on the held-out data corresponding to each unique kept portion. evallm : perplexity -text data/20_percent_00.text Computing perplexity of the language model with respect to the text data/20_percent_00.text Perplexity = 94.05, Entropy = 6.56 bits Computation based on 3480540 words. Number of 3-grams hit = 3116271 (89.53%) Number of 2-grams hit = 333716 (9.59%) Number of 1-grams hit = 30553 (0.88%) 1657462 OOVs (32.26%) and 0 context cues were removed from the calculation. evallm : perplexity -text data/xaa Computing perplexity of the language model with respect to the text data/xaa Perplexity = 94.87, Entropy = 6.57 bits Computation based on 4363160 words. Number of 3-grams hit = 3875987 (88.83%) Number of 2-grams hit = 444714 (10.19%) Number of 1-grams hit = 42459 (0.97%) 2060282 OOVs (32.07%) and 0 context cues were removed from the calculation. evallm : perplexity -text data/xac Computing perplexity of the language model with respect to the text data/xac Perplexity = 67.56, Entropy = 6.08 bits Computation based on 4366169 words. Number of 3-grams hit = 4366168 (100.00%) Number of 2-grams hit = 0 (0.00%) Number of 1-grams hit = 1 (0.00%) 2066872 OOVs (32.13%) and 0 context cues were removed from the calculation. evallm : perplexity -text data/xab Computing perplexity of the language model with respect to the text data/xab Perplexity = 68.06, Entropy = 6.09 bits Computation based on 4374071 words. Number of 3-grams hit = 4374070 (100.00%) Number of 2-grams hit = 1 (0.00%) Number of 1-grams hit = 0 (0.00%) 2060120 OOVs (32.02%) and 0 context cues were removed from the calculation. APPENDIX Histograms for unigrams, bigrams, and trigrams for the entire data set.