final paper on

The EM Algorithm and its Applications in Speech Recognition

submitted to fulfill the requirements for

ECE 8993: Fundamentals of Speech Recognition

submitted to:

Dr. Joseph Picone
Department of Electrical and Computer Engineering
413 Simrall, Hardy Rd.
Mississippi State University
Mississippi State, Mississippi 39762
August 17, 1998
submitted by:

Audrey Le
MS in Computer Engineering Candidate
Department of Electrical and Computer Engineering
Mississippi State University
429 Simrall, Hardy Rd.
Mississippi State, Mississippi 39762
Tel: 601-325-8335, Fax: 601-325-3149
Email: le@isip.msstate.edu
ABSTRACT
The purpose of this paper is to provide a simple but complete explanation of the Expectation-Maximization (EM) algorithm in hope that beginning students in speech recognition can comprehend a mathematically rigorous concept. In this paper, we present the EM algorithm and give an overview of its applications in speech recognition. In the first section we describe an estimation problem in maximum-likelihood (ML) technique and introduce how EM can be used to overcome this problem. We then present the algorithm and discuss the properties of the algorithm as well as the strengths and weaknesses of the algorithm. We conclude the paper with the various applications of this algorithm in speech recognition.
1.  INTRODUCTION
Maximum-likelihood has become a predominant method of estimation in many areas due to its asymptotic optimal properties. However, for many problems, no analytical solution of the likelihood equation can be found, and we have to resort to numerical optimization techniques. 
There are many direct numerical optimization techniques available such as the Newton-Refashion method or the gradient method. However, using these methods to obtain the likelihood function is expensive in computation and memory. In addition, these methods are difficult to implement and present various problems such as instability and convergence. 
In 1997, Dumpiest and al. introduced an alternative method to the direct numerical optimization method. This alternative method is called the Expectation-Maximization algorithm, often known as the EM algorithm. The EM algorithm is an iterative technique for maximum-likelihood approximation for incomplete data problems [1]. Since the introduction of this method, it has been used successfully in a wide variety of applications. In particular, it has been extremely successfully in the speech processing area with applications such as parameter estimation of Gaussian Mixture Densities [2] and Hidden Markov Models [3].
2.  The EM ALGORITHM
2.1.  The ML Estimation Problem
Let be the likelihood function. The maximum likelihood problem is to find  that maximizes .
(1)
For many problems, it is not possible to find a theoretical expression for , and we must try an iterative approach.
2.2.  The EM Algorithm
One such iterative approach is the EM algorithm. The algorithm attempts to estimate the complete data from incomplete data through iteratively estimating the likelihood function and finding the set of parameters that maximizes the function.
Let  and  be two sample spaces that are related by a many-to-on transformation. Let  be this a many-to-one transformation. Let  be the samples in  and  be the samples in  where  is hidden and  is observable.  and  are then related by the equation below.
(2)
If  is the distribution of  characterized by , then the distribution of , , is also characterized by . This is mathematically expressed in the equation given below.
(3)
The maximum-likelihood function  of  is given in the question below.
(4)
It is easy to estimate the parameters  were the complete data available. Most of the time, however, only partial data are available, and estimation of the parameters  becomes complex. In addition, since only incomplete data is available in practice, direct optimization methods can not be used to compute the complete data likelihood. A reasonable way to get around this problem is to estimate the complete data likelihood from the incomplete data then use the likelihood obtained from the incomplete data to maximize the parameters  [4].
Since estimating the complete data requires  and  is what we are trying to estimate, it seems that the problem can not be solved. [1] suggests we resort to an iterative method. First, we estimate the complete data given some initial value of . Second, we maximize this function over . Last we iterate the previous two steps until we can not do any better. 
The best estimate of  given a current value  of the parameters and  is the conditional expectation given below.
(5)
From the equation above, the expectation step, known as the E-step, and the maximization step, known as the M-step, of the algorithm are given below, respectively.
E-step:
(6)

M-step:
(7)
where  indicates the  iteration. Derivation of this algorithm can be found in [1].
2.3.  Properties of EM
2.3.1.  Monotonous Increase of the Likelihood
Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon.
2.3.2.  Convergence
Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon.
2.3.3.  Computational Efficiency
Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon.
3.  APPLICATIONS IN ASR
3.1.  Gaussian Mixture Densities
3.1.1.  Problem Definition
Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon.
3.1.2.  Solution
Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon.
3.2.  HMMs and the Baum-Welch Algorithm
3.2.1.  Problem Definition
Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon.
3.2.2.  Solution
Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon.
4.  Summary
Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon.
5.  Acknowledgements
Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon.
References
[1]	A. Dempster, N. Laird, D. Rubin, "Maximum-likelihood from Incomplete Data Via the EM Algorithm," J. of the Royal Statistical Society, vol. 39 (B), pp.1-38, 1977.
[2]	R. Redner and H. Walker, "Mixture Densities, Maximum-Likelihood and the EM Algorithm," SIAM Review, vol. 26 (2), 1984.
[3]	L. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. IEEE, vol. 77 (2), pp. 257-286, 1989.
[4]	Z. Ghahramami and M. Jordan, "Learning from Incomplete Data," Technical Report AI Lab Memo No. 1509, CBCL Paper No. 108, MIT AI Lab, MA, USA, August 1995.
[5]	L. Rabiner and B.-H. Juang, Fundamentals of Speech Recognition, Prentice Hall, Englewood Clifts, New Jersey, USA, 1993.
Appendix A
Recent progress on conversational speech recognition has been painstakingly slow, and has resulted in measurable reductions in word error rate primarily through intricate pronunciation modeling and hand-tailoring of the lexicon.