Figure 4.  Data set 1 and data set 2 in the transform space.

homework for

Principal Components Analysis


EE 8993: Fundamentals of Speech Recognition


March 11, 1999


submitted to:

Dr. Joseph Picone

Department of Electrical and Computer Engineering
413 Simrall, Hardy Rd.
Mississippi State University
Box 9	571
MS State, MS 39762


submitted by:

Janna Shaffer


I.	Original Data Set Calculations
This homework assignment applies principal component analysis (PCA) to a classification problem. The data sets in Figure 1 were hand drawn using a MATLAB graphical user interface. The means for the two sets are and . The test set consists of the points , , , and.

Figure 1.   Data sets and test set used for PCA

After the data and test sets were defined, the Euclidean distances between each point in the test set and the data set means were calculated. Set membership was based upon this distance. Three out of the four test points were equidistant from the two sets. In these cases, set membership could be with either set. The fourth test point clearly belongs with data set 2. These distances are shown in Table 1.


	Distance from mean 1	Distance from mean 2
x1	3.1623	3.1623
x2	2.8284	2.8284
x3	2.9155	2.9155
x4	3.5355	2.1213

Table 1:  Euclidean Distances Between Test Sets and Data Set Means

II.	Whitening Transformation
In order to find a better classification of the test data, a linear transformation was applied to the data sets and test sets. The transform used in this assignment is called the whitening transform. It is used in order to shift the data into another space in hopes of making the variance of the data more significant. The transform equation is shown in Equation (1).   

(1)

The matrix  is a diagonal matrix of the eigenvalues of the data set.  is a matrix containing the eigenvectors. The characteristics of data set 1 and 2 and their transformations are shown in Figure 2.   and Figure 3.  respectively.


Figure 2.  Characteristics of data set 1
 
 
Figure 3.   Characteristics of data set 2


The  matrix transforms the data sets in such a way that the covariance matrices are diagonal. The  matrix is added to the transform in order to change the scales of the principal components in such a way that the covariance matrices of the data sets are identity matrices.
The whitening transform is not an orthonormal transformation because [1]

(2)
Because this transform is not orthonormal, the Euclidean distances are not preserved [1]. The Euclidean distances for the transformed data sets are shown in Table 2. The set membership for three out of the four test points has changed. The actual plots of the transformed data sets are illustrated in Figure 4. 

	Distance from mean 1	Distance from mean 2
x1	3.8518	4.6664
x2	3.8858	3.5975
x3	3.9424	3.6835
x4	4.8572	2.6981

Table 2:  Euclidean Distances Between Test Sets and Data Set Means


The non-transformed data sets and test set are plotted in Figure 5 with the lines representing the classification of data in the original space and in the transform space. From this plot the set membership of each test point is clear. The division line in the original space is linear while that of the transform space is parabolic. The reason that the transformed space is parabolic is because two transforms are used to divide the space for classification.

Figure 5.   Decision region for non-transform and transform spaces.


III.	REFERENCES 
[1]	K. Fukunaga. Introduction to Statistical Pattern Recognition. 2nd ed. Academic Press, 1990.