Maximum Likelihood Classification
Consider the problem of assigning a measurement to one of two sets:
What is the best criterion for making a decision?

Ideally, we would select the class for which the conditional probability is highest:

However, we can't estimate this probability directly from the training data. Hence, we consider:

By definition

and

from which we have

Clearly, the choice of  that maximizes the right side also maximizes the left side.
Therefore,

if the class probabilities are equal,

A quantity related to the probability of an event which is used to make a decision about the occurrence of that event is often called a likelihood measure.

A decision rule that maximizes a likelihood is called a maximum likelihood decision.

In a case where the number of outcomes is not finite, we can use an analogous continuous distribution. It is common to assume a multivariate Gaussian distribution:

We can elect to maximize the log,  rather than the likelihood (we refer to this as the log likelihood). This gives the decision rule:

(Note that the maximization became a minimization.)

We can define a distance measure based on this as:

Note that the distance is conditioned on each class mean and covariance. This is why "generic" distance comparisons are a joke.

If the mean and covariance are the same across all classes, this expression simplifies to:

This is frequently called the Mahalanobis distance. But this is nothing more than a weighted Euclidean distance.

This result has a relatively simple geometric interpretation for the case of a single random variable with classes of equal variances:
The decision rule involves setting a threshold:

and,


If the variances are not equal, the threshold shifts towards the distribution with the smaller variance.

What is an example of an application where the classes are not equiprobable?
Probabilistic Distance Measures
How do we compare two probability distributions to measure their overlap?

Probabilistic distance measures take the form:

where
1.  is nonnegative
2. J attains a maximum when all classes are disjoint
3. J=0 when all classes are equiprobable

Two important examples of such measures are:

(1) Bhattacharyya distance:


(2) Divergence


Both reduce to a Mahalanobis-like distance for the case of Gaussian vectors with equal class covariances.

Such metrics will be important when we attempt to cluster feature vectors and acoustic models.
Probabilistic Dependence Measures
A probabilistic dependence measure indicates how strongly a feature is associated with its class assignment. When features are independent of their class assignment, the class conditional pdf's are identical to the mixture pdf:

When their is a strong dependence, the conditional distribution should be significantly different than the mixture. Such measures take the form:

An example of such a measure is the average mutual information:
 
The discrete version of this is:

Mutual information is closely related to entropy, as we shall see shortly.

Such distance measures can be used to cluster data and generate vector quantization codebooks. A simple and intuitive algorithms is known as the K-means algorithm:

Initialization:	Choose K centroids

Recursion:	1.	Assign all vectors to their nearest neighbor.

2.	Recompute the centroids as the average of all vectors assigned to the same centroid.

3.	Check the overall distortion. Return to step 1 if some distortion criterion is not met.