The principle of maximum entropy argues that the best stochastic model
for the data is the one that maximizes entropy over the set of
distributions that are consistent with the evidence
(constraints). This does not mean that we always use a maximum entropy
distribution which could be a Gaussian in the continuous case and a
uniform distribution for the discrete case. We need to factor in the
known constraints about the data. It is the lack of assumptions about
the underlying model that differentiates maximum entropy models from
other models like maximum likelihood. Maximum entropy models have been
used extensively in a wide-range of areas including language models
and grammatical parsers.
In this talk we will present the maximum entropy framework in a
ground-up form -- start with the motivation for using this model and
build our way up to applying the framework to solve a real problem
(estimating the probabilities of bigram language model). The
probability distributions that result from this method are exponential
in nature. The distribution contains one factor per constraint that we
place on the data. The ease with which new knowledge can be added to
the modeling paradigm is one of the most compelling reasons to use
maximum entropy models. However, maximum entropy comes with its bag
of problems. The iterative procedure (generalized iterative scaling)
to estimate the parameters of the model is typically very
expensive. Issues concerning this problem will also be discussed.
Additional items of interest: