What Is Information? (When not to bet at a casino...)
Consider two distributions of discrete random variables:
Which variable is more unpredictable?
Now, consider sampling random numbers from a random number generator whose statistics are not known. The more numbers we draw, the more we discover about the underlying distribution. Assuming the underlying distribution is from one of the above distributions, how much more information do we receive with each new number we draw?
The answer lies in the shape of the distributions. For the random variable x, each class is equally likely. Each new number we draw provides the maximum amount of information, because, on the average, it will be from a different class (so we discover a new class with every number). On the other hand, for y, chances are, c=3 will occur 5 times more often than the other classes, so each new sample will not provide as much information.
We can define the information associated with each class, or outcome, as:

Since , information is a positive quantity. A base 2 logarithm is used so that  discrete outcomes can be measured in  bits. For the distributions above,
	
Huh??? Does this make sense?
What is Entropy?
Entropy is the expected (average) information across all outcomes:

Entropy using  is also measured in bits, since it is an average of information.
For example,
	
We can generalize this to a joint outcome of N random vectors from the same distribution, which we refer to as the joint entropy:

If the random vectors are statistically independent:

If the random vectors are independent and indentically distributed:

We can also define conditional entropy as:

For continuous distributions, we can define an analogous quantity for entropy:
     (bits)
A zero-mean Gaussian random variable has maximum entropy (. Why?
Mutual Information
The pairing of random vectors produces less information than the events taken individually. Stated formally:

The shared information between these events is called the mutual information, and is defined as:

From this definition, we note:

This emphasizes the idea that this is information shared between these two random variables.
We can define the average mutual information as the expectation of the mutual information:

Note that:

Also note that if  and  are independent, then there is no mutual information between them.

Note that to compute mutual information between two random variables, we need a joint probability density function.
Entropy in Pattern Recognition
Generalized entropy measures are used to assess the effectiveness of a set of features at pattern classification. The conditional entropy is one such measure:

This is sometimes referred to as the equivocation. We want this measure to be small, meaning the feature vector  greatly reduces the uncertainty about the class identity.

Another way to assess the usefulness of a feature is the average mutual information:

If this measure is large, a given feature contains a significant information about the class outcome.
How Does Entropy Relate To DSP?
Consider a window of a signal:
What does the sampled z-transform assume about the signal outside the window?

What does the DFT assume about the signal outside the window?

How do these influence the resulting spectrum that is computed?

What other assumptions could we make about the signal outside the window? How many valid signals are there?

How about finding the spectrum that corresponds to the signal that matches the measured signal within the window, and has maximum entropy?

What does this imply about the signal outside the window?

This is known as the principle of maximum entropy spectral estimation. Later we will see how this relates to minimizing the mean-square error.