Figure 4. Data set 1 and data set 2 in the transform space. homework for Principal Components Analysis EE 8993: Fundamentals of Speech Recognition March 11, 1999 submitted to: Dr. Joseph Picone Department of Electrical and Computer Engineering 413 Simrall, Hardy Rd. Mississippi State University Box 9 571 MS State, MS 39762 submitted by: Janna Shaffer I. Original Data Set Calculations This homework assignment applies principal component analysis (PCA) to a classification problem. The data sets in Figure 1 were hand drawn using a MATLAB graphical user interface. The means for the two sets are and . The test set consists of the points , , , and. Figure 1. Data sets and test set used for PCA After the data and test sets were defined, the Euclidean distances between each point in the test set and the data set means were calculated. Set membership was based upon this distance. Three out of the four test points were equidistant from the two sets. In these cases, set membership could be with either set. The fourth test point clearly belongs with data set 2. These distances are shown in Table 1. Distance from mean 1 Distance from mean 2 x1 3.1623 3.1623 x2 2.8284 2.8284 x3 2.9155 2.9155 x4 3.5355 2.1213 Table 1: Euclidean Distances Between Test Sets and Data Set Means II. Whitening Transformation In order to find a better classification of the test data, a linear transformation was applied to the data sets and test sets. The transform used in this assignment is called the whitening transform. It is used in order to shift the data into another space in hopes of making the variance of the data more significant. The transform equation is shown in Equation (1). (1) The matrix is a diagonal matrix of the eigenvalues of the data set. is a matrix containing the eigenvectors. The characteristics of data set 1 and 2 and their transformations are shown in Figure 2. and Figure 3. respectively. Figure 2. Characteristics of data set 1 Figure 3. Characteristics of data set 2 The matrix transforms the data sets in such a way that the covariance matrices are diagonal. The matrix is added to the transform in order to change the scales of the principal components in such a way that the covariance matrices of the data sets are identity matrices. The whitening transform is not an orthonormal transformation because [1] (2) Because this transform is not orthonormal, the Euclidean distances are not preserved [1]. The Euclidean distances for the transformed data sets are shown in Table 2. The set membership for three out of the four test points has changed. The actual plots of the transformed data sets are illustrated in Figure 4. Distance from mean 1 Distance from mean 2 x1 3.8518 4.6664 x2 3.8858 3.5975 x3 3.9424 3.6835 x4 4.8572 2.6981 Table 2: Euclidean Distances Between Test Sets and Data Set Means The non-transformed data sets and test set are plotted in Figure 5 with the lines representing the classification of data in the original space and in the transform space. From this plot the set membership of each test point is clear. The division line in the original space is linear while that of the transform space is parabolic. The reason that the transformed space is parabolic is because two transforms are used to divide the space for classification. Figure 5. Decision region for non-transform and transform spaces. III. REFERENCES [1] K. Fukunaga. Introduction to Statistical Pattern Recognition. 2nd ed. Academic Press, 1990.