final paper on Good-Turing Smoothing of Language Models submitted to fulfill the requirements for ECE 8993: Fundamentals of Speech Recognition submitted to: Dr. Joseph Picone Department of Electrical and Computer Engineering 413 Simrall, Hardy Rd. Mississippi State University Mississippi State, Mississippi 39762 December 5, 1998 submitted by: Janna Shaffer MS in Electrical and Computer Engineering Candidate Department of Electrical and Computer Engineering Mississippi State University 429 Simrall, Hardy Rd. Mississippi State, Mississippi 39762 Email: janna@cs.msstate.edu ABSTRACT Language models are used extensively in state-of-the-art speech recognition systems. A language model is used to help determine the probability of a certain word sequence. In order to determine these probabilities, a knowledge of the entire problem space is necessary. However, in speech recognition this is an unreasonable, if not impossible task. Inevitably, the speech recognizer will run across a word that it has never seen before. Words that are not in the speech recognizer's vocabulary present a very real problem for constructing language models. If a word is not known to the model, the probability of that word occurring will be zero. Therefore, a method to guard against zero probabilities must be utilized. Good-Turing smoothing is one technique for achieving this. Good-Turing smoothing provides a way to estimate the probability of an unseen event. It is especially useful in constructing n-gram language models. This paper will focus on the use of Good-Turing Smoothing in language models while providing a general framework from which to build this knowledge. 1. Introduction In 1941, A. M. Turing, in a discussion with I.J. Good, suggested a formula to estimate probabilities of species in a mixed population of various species [1][2]. The resulting mathematical formulae proposed by Good [1] lay the foundation for what is now referred to as Good-Turing smoothing. These formulae along with the motivation behind them, will be explained in detail later in this paper. 1.1. The Basic Idea By use of a simple example put forth by Sampson [3], the Good-Turing estimate, and in effect, Good-Turing smoothing, can be explained in a rather straightforward manner. Suppose you are sitting in your garden one day and decide to count and log the different species of birds that have come into your garden. You have a particularly boring day and stay out in your garden for nearly 5 hours to record the first thousand birds you see. All total, you see 30 species of birds including 212 sparrows, 109 robins, and 58 blackbirds. Now you want to know the probability that the next bird seen will be a blackbird. You would say, 58/1000, i.e. 0.058, right. Well, this would be wrong. What happens if the next bird into your garden is a nightingale? You did not see a nightingale in your previous five hours in the garden. If the estimate of 0.058 holds for blackbirds, then by the same reasoning, the probability of the next bird being a nightingale would be 0.00, or nonexistent. This is an obvious underestimate for the nightingales, and thus an overestimate for the blackbirds. 1.2. The Problem in Relation to Speech The bird in the garden example can be mapped back into the speech domain. Many basic ingredients to speech and linguistics research such as words, word sequences, etc., are essentially infinite in number [4]. In other words, given any amount of speech or text, there are bound to be new words that are not represented in the sample given. Therefore, if the probability of one of the new words is to be estimated, it will have a value of zero. This is the case we are trying to avoid. Good-Turing smoothing does not allow for these zero probabilities. 2. GOOD-TURING SMOOTHING Now that the need for smoothing has been established, we will investigate how Good-Turing smoothing actually works. 2.1. Mathematical Motivation The following mathematical formulae represent the reasoning that I.J. Good followed in his 1953 paper on population frequencies [1]. This is the mathematical foundation from which Good-Turing smoothing arose. Suppose we have an infinite population of various species. We want to randomly draw a sample of size from this population. In this sample drawn, let there be distinct species each represent exactly times. The equation below represents this random sample. (1) From this random sample, we want to estimate the unknown probabilities. We'll call the population frequency that is represented times in the sample, and will be the expected value of . The expected value for this frequency, also known as the Good-Turing estimate, is shown in equation (2). (2) (3) (4) The values of and are smoothed values. There are many different methods for smoothing these values, but Good does not put forth a specific method that must be performed. 2.2. Specific Algorithm: Simple Good- Turing Smoothing Simple Good-Turing (SGT) smoothing was proposed by Gale and Sampson [4] as a way to alleviate the often difficult task of smoothing in regions of vastly different accuracies, which is often required by the Good-Turing methods. It uses a straight line in conjunction with a rule for switching from Turing estimates which are more accurate at low frequencies. SGT takes advantage of the fact that and that increases to one as increases. The first assumption is based upon the fact that we are cutting the probability of the observed cases to make room for the probability of the unseen cases. The second assumption is that we will want to take less and less probability away as increases since the higher is, the better it is measured [4]. These properties will certainly hold for large values of , but may not be as accurate for smaller values. Therefore, just using the Turing estimate will be more accurate than a simple smooth. Gale and Sampson proposed the simplest smooth: a line. They used a downward sloping log-log line that will satisfy the priors on as long as . This smooth is called the Linear Good-Turing (LGT) estimate [4] and is described by equation (5). (5) The rule for choosing between the Turing estimates and the LGT estimate is as follows: use the Turing estimate until the difference between it and the LGT exceeds 1.65 times the standard deviation of the Turing estimate. The variance of the Turing estimate is approximated by equation (6). (6) 3. IMPLEMENTATION IN SPEECH Good-Turing smoothing is used in speech to smooth the language models generated from a particular data set. As discussed earlier, almost no data set, especially in large vocabulary situations, can represent every word that will be seen by the application. Therefore, by smoothing the language model with Good-Turing smoothing, there will be no zero probabilities and thus no errors in word accuracy due to zero probabilities. 4. Other smoothing techniques Good-Turing Smoothing is not the only technique available for the smoothing of language models. We will briefly discuss two different strategies for language model smoothing. These, however, are not the only alternatives to Good-Turing smoothing. 4.1. Add-One Smoothing Add-one smoothing is a crude, often unsuccessful method for smoothing language models, but is worth mentioning for it serves as a good theoretical basis for newer smoothing techniques[7]. The idea behind the add-one smoothing technique is to add one to the sparse matrix of n-gram counts. In doing this, the counts of tokens observed must be increased to account for the addition of new items. The add-one smoothing technique is valuable because it introduces a simple way of looking at smoothing. It shows that smoothing can simply be thought of as a type of discounting. In other words, some non-zero probabilities will be lowered to make up for the probabilities that are initially zero. 4.2. Witten-Bell Discounting Witten-Bell discounting is much more successful than add-one smoothing, yet remains somewhat simple[8]. It is slightly less complex than Good-Turing smoothing. Witten-Bell discounting views zero-probabilities in a different way from either add-one or Good-Turing smoothing techniques. It uses the information gained from n-grams observed only once to estimate the count of n-grams that have never been seen. 5. Summary Good-Turing smoothing of language models allows for an estimate to be made for seen and unseen objects. One such method for presented in this paper is the linear Good-Turing estimate. The theory and mathematics involved in the derivation of this estimate were provided. Also, two different techniques of zero-probability prediction were briefly reviewed: add-one smoothing and Witten-Bell discounting. References [1] I.J. Good, "The Population Frequencies of Species and the Estimation of Population Parameters," Biometrika, vol. 40, no. 3, 4, pp. 237-264, December 1953. [2] A. Nadas, "On Turing's Formula for Word Probabilities," IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-33 No. 6, pp.1414-1416, December 1985. [3] G. Sampson, "Good-Turing Frequency Estimation" http://www.cogs.susx.ac.uk/ users/geoffs/RGoodTur.html. [4] W. Gale, "Good-Turing Smoothing Without Tears," AT&T Bell Laboratories Technical Report, Murray Hill, NJ, 1994. [5] R. Rosenfeld, "Statistical Language Modeling and N-grams," January 1997. [6] T. Kemp, A. Jusek, "Modelling Unknown Words in Spontaneous Speech." IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 530-533, 1996. [7] W. Gale, K. Church, "What is Wrong with Adding One?" In N. Oostdijk and P. de Haan (Eds.), Corpus-based Research into Language. Rodopi, Amsterdam, 1994. [8] I.H. Witten, T.C. Bell, "The Zero-frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression." IEEE Transactions on Information Theory, Vol. 37 No. 4, pp.1085-1094, 1991.