Distance Measures
What is the distance between pt. a and pt. b?

The N-dimensional real Cartesian space,
denoted  is the collection of all N-dimensional
vectors with real elements. A metric, or distance
measure, is a real-valued function with three properties:
:
1. .
2. 
3. 
The Minkowski metric of order , or the  metric, between  and  is:

(the norm of the difference vector).
Important cases are:
1.  or city block metric (sum of absolute values),

2. , or Euclidean metric (mean-squared error),

3.  or Chebyshev metric,

We can similarly define a weighted Euclidean distance metric:

where:
 , , and .
Why are Euclidean distances so popular?
One reason is efficient computation. Suppose we are given a set of  reference vectors, , a measurement, , and we want to find the nearest neighbor:

This can be simplified as follows:
We note the minimum of a square root is the same as the minimum of a square (both are monotonically increasing functions):

Therefore,

Thus, a Euclidean distance is virtually equivalent to a dot product (which can be computed very quickly on a vector processor). In fact, if all reference vectors have the same magnitude,  can be ignored (normalized codebook).
Prewhitening of Features
Consider the problem of comparing features of different scales:

Suppose we represent these points in space in two coordinate systems using the transformation:


System 1:
 and 


System 2:
 and 


The magnitude of the distance has changed. Though the rank-ordering of distances under such linear transformations won't change, the cumulative effects of such changes in distances can be damaging in pattern recognition. Why?

We can simplify the distance calculation in the transformed space:

This is just a weighted Euclidean distance.

Suppose all dimensions of the vector are not equal in importance. For example, suppose one dimension has virtually no variation, while another is very reliable. Suppose two dimensions are statistically correlated. What is a statistically optimal transformation?

Consider a decomposition of the covariance matrix (which is symmetric):

where  denotes a matrix of eigenvectors of  and  denotes a diagonal matrix whose elements are the eigenvalues of . Consider:

The covariance of ,  is easily shown to be an identity matrix (prove this!)
We can also show that:

Again, just a weighted Euclidean distance.

·	If the covariance matrix of the transformed vector is a diagonal matrix, the transformation is said to be an orthogonal transform.

·	If the covariance matrix is an identity matrix, the transform is said to be an orthonormal transform.

·	A common approximation to this procedure is to assume the dimensions of  are uncorrelated but of unequal variances, and to approximate  by a diagonal matrix, . Why? This is known as variance-weighting.
"Noise-Reduction"
The prewhitening transform, , is normally created as a  matrix in which the eigenvalues are ordered from largest to smallest:

where
.
In this case, a new feature vector can be formed by truncating the transformation matrix to  rows. This is essentially discarding the least important features.

A measure of the amount of discriminatory power contained in a feature, or a set of features, can be defined as follows:


This is the percent of the variance accounted for by the first  features.

Similarly, the coefficients of the eigenvectors tell us which dimensions of the input feature vector contribute most heavily to a dimension of the output feature vector. This is useful in determining the "meaning" of a particular feature (for example, the first decorrelated feature often is correlated with the overall spectral slope in a speech recognition system - this is sometimes an indication of the type of microphone).
Computational Procedures
Computing a "noise-free" covariance matrix is often difficult. One might attempt to do something simple, such as:
 and 
On paper, this appears reasonable. However, often, the complete set of feature vectors contains valid data (speech signals) and noise (nonspeech signals). Hence, we will often compute the covariance matrix across a subset of the data, such as the particular acoustic event (a phoneme or word) we are interested in.

Second, the covariance matrix is often ill-conditioned. Stabilization procedures are used in which the elements of the covariance matrix are limited by some minimum value (a noise-floor or minimum SNR) so that the covariance matrix is better conditioned.

But how do we compute eigenvalues and eigenvectors on a computer?
One of the hardest things to do numerically! Why?

Suggestion: use a canned routine (see Numerical Recipes in C).

The definitive source is EISPACK (originally implemented in Fortran, now available in C). A simple method for symmetric matrices is known as the Jacobi transformation. In this method, a sequence of transformations are applied that set one off-diagonal element to zero at a time. The product of the subsequent transformations is the eigenvector matrix.

Another method, known as the QR decomposition, factors the covariance matrix into a series of transformations:

where  is orthogonal and  is upper diagonal. This is based on a transformation known as the Householder transform that reduces columns of a matrix below the diagonal to zero.