LECTURE 01: MAXIMUM ENTROPY

The principle of maximum entropy argues that the best stochastic model for the data is the one that maximizes entropy over the set of distributions that are consistent with the evidence (constraints). This does not mean that we always use a maximum entropy distribution which could be a Gaussian in the continuous case and a uniform distribution for the discrete case. We need to factor in the known constraints about the data. It is the lack of assumptions about the underlying model that differentiates maximum entropy models from other models like maximum likelihood. Maximum entropy models have been used extensively in a wide-range of areas including language models and grammatical parsers.

In this talk we will present the maximum entropy framework in a ground-up form -- start with the motivation for using this model and build our way up to applying the framework to solve a real problem (estimating the probabilities of bigram language model). The probability distributions that result from this method are exponential in nature. The distribution contains one factor per constraint that we place on the data. The ease with which new knowledge can be added to the modeling paradigm is one of the most compelling reasons to use maximum entropy models. However, maximum entropy comes with its bag of problems. The iterative procedure (generalized iterative scaling) to estimate the parameters of the model is typically very expensive. Issues concerning this problem will also be discussed.

Additional items of interest: