final paper on The EM Algorithm and its Applications in Speech Recognition submitted to fulfill the requirements for ECE 8993: Fundamentals of Speech Recognition submitted to: Dr. Joseph Picone Department of Electrical and Computer Engineering 413 Simrall, Hardy Rd. Mississippi State University Mississippi State, Mississippi 39762 August 17, 1998 submitted by: Audrey Le MS in Computer Engineering Candidate Department of Electrical and Computer Engineering Mississippi State University 429 Simrall, Hardy Rd. Mississippi State, Mississippi 39762 Tel: 601-325-8335, Fax: 601-325-3149 Email: le@isip.msstate.edu ABSTRACT The purpose of this paper is to provide a simple but complete explanation of the Expectation-Maximization (EM) algorithm in hope that beginning students in speech recognition can comprehend a mathematically rigorous concept. In this paper, we present the EM algorithm and give an overview of its applications in speech recognition. In the first section we describe an estimation problem in maximum-likelihood (ML) technique and introduce how EM can be used to overcome this problem. We then present the algorithm and discuss the properties of the algorithm as well as the strengths and weaknesses of the algorithm. We conclude the paper with the various applications of this algorithm in speech recognition. 1. INTRODUCTION Maximum-likelihood has become a predominant method of estimation in many areas due to its asymptotic optimal properties. However, for many problems, no analytical solution of the likelihood equation can be found, and we have to resort to numerical optimization techniques. There are many direct numerical optimization techniques available such as the Newton-Refashion method or the gradient method. However, using these methods to obtain the likelihood function is expensive in computation and memory. In addition, these methods are difficult to implement and present various problems such as instability and convergence. In 1997, Dumpiest and al. introduced an alternative method to the direct numerical optimization method. This alternative method is called the Expectation-Maximization algorithm, often known as the EM algorithm. The EM algorithm is an iterative technique for maximum-likelihood approximation for incomplete data problems [1]. Since the introduction of this method, it has been used successfully in a wide variety of applications. In particular, it has been extremely successfully in the speech processing area with applications such as parameter estimation of Gaussian Mixture Densities [2] and Hidden Markov Models [3]. 2. The EM ALGORITHM 2.1. The ML Estimation Problem Let be the likelihood function. The maximum likelihood problem is to find that maximizes . (1) For many problems, it is not possible to find a theoretical expression for , and we must try an iterative approach. 2.2. The EM Algorithm One such iterative approach is the EM algorithm. The algorithm attempts to estimate the complete data from incomplete data through iteratively estimating the likelihood function and finding the set of parameters that maximizes the function. Let and be two sample spaces that are related by a many-to-on transformation. Let be this a many-to-one transformation. Let be the samples in and be the samples in where is hidden and is observable. and are then related by the equation below. (2) If is the distribution of characterized by , then the distribution of , , is also characterized by . This is mathematically expressed in the equation given below. (3) The maximum-likelihood function of is given in the question below. (4) It is easy to estimate the parameters were the complete data available. Most of the time, however, only partial data are available, and estimation of the parameters becomes complex. In addition, since only incomplete data is available in practice, direct optimization methods can not be used to compute the complete data likelihood. A reasonable way to get around this problem is to estimate the complete data likelihood from the incomplete data then use the likelihood obtained from the incomplete data to maximize the parameters [4]. Since estimating the complete data requires and is what we are trying to estimate, it seems that the problem can not be solved. [1] suggests we resort to an iterative method. First, we estimate the complete data given some initial value of . Second, we maximize this function over . Last we iterate the previous two steps until we can not do any better. The best estimate of given a current value of the parameters and is the conditional expectation given below. (5) From the equation above, the expectation step, known as the E-step, and the maximization step, known as the M-step, of the algorithm are given below, respectively. E-step: (6) M-step: (7) where indicates the iteration. Derivation of this algorithm can be found in [1]. 2.3. Properties of EM 2.3.1. Monotonous Increase of the Likelihood Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon. 2.3.2. Convergence Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon. 2.3.3. Computational Efficiency Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon. 3. APPLICATIONS IN ASR 3.1. Gaussian Mixture Densities 3.1.1. Problem Definition Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon. 3.1.2. Solution Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon. 3.2. HMMs and the Baum-Welch Algorithm 3.2.1. Problem Definition Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon. 3.2.2. Solution Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon. 4. Summary Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon. 5. Acknowledgements Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon. References [1] A. Dempster, N. Laird, D. Rubin, "Maximum-likelihood from Incomplete Data Via the EM Algorithm," J. of the Royal Statistical Society, vol. 39 (B), pp.1-38, 1977. [2] R. Redner and H. Walker, "Mixture Densities, Maximum-Likelihood and the EM Algorithm," SIAM Review, vol. 26 (2), 1984. [3] L. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. IEEE, vol. 77 (2), pp. 257-286, 1989. [4] Z. Ghahramami and M. Jordan, "Learning from Incomplete Data," Technical Report AI Lab Memo No. 1509, CBCL Paper No. 108, MIT AI Lab, MA, USA, August 1995. [5] L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice Hall, Englewood Clifts, New Jersey, USA, 1993. Appendix A Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon.