EE 8993: Speech Recognition

Homework Assignment #2
Principal Component Analysis

January 23, 1998


submitted to:

Dr. Joseph Picone

Department of Electrical and Computer Engineering
413 Simrall, Hardy Rd.
Mississippi State University
Box 9	571
MS State, MS 39762


submitted by:

Julie Ngan

Department of Electrical and Computer Engineering
Mississippi State University
Box 9571
Mississippi State, Mississippi 39762
Tel: 601-325-8335
Fax: 601-325-3149
email: ngan@isip.msstate.edu


I.	Problem Definition
Define two sets such that the distributions approximate an ellipse and a pear shape. The elliptical distribution should stretch from lower-left to upper-right at about a  angle and should be longer in that direction than it is wide. Set 2 should look like a pear with the stem pointing at . Set 1 should have a mean of approximately  and Set 2 should have a mean of approximately . Each set contains 100 points.

Define a test set of four points:


Each sample point is then classified as a member of either set using minimum Euclidean distance. Then the two data sets will be analyzed to find the decision regions using Principal Component Analysis, and the sample points will be classified again using the results.

II.	Data Set Generation
The two test sets are generated using the point operation function in xmgr. Points are randomly drawn to obtain the required shapes and then vertically or horizontally shifted to obtain the required means. A plot of the two data sets is shown in Figure 1.

III.	Euclidean Distance and Direct Data Classification
The Euclidean distance between each sample vector and each of the two data set means is calculated by Equation 1:

(1), where 

Each sample points is classified as a member of set 1 or set 2 according to minimum Euclidean distance. The results are shown in Table 1:

Sample point	Euclidean Distance from Data set 1	Euclidean Distance from Data set 2	Classification
x1	3.1623	3.1623	set 2
x2	2.8284	2.8284	set 2
x3	2.9155	2.9155	set 2
x4	3.5355	2.1213	set 2

Table 1 Results of classifying the sample points.
IV.	Direct Decision Region
Because the means are located on the line where , the line  will be equidistant from the means at all time. As a result, the decision region is no more than the  line. Points on the left-hand side of the line belong to set 1, whereas points on the right-hand side of the line belong to set 2. Figure 2 shows a plot of the two data sets with the decision line.
V.	Prewhitening Transformation
The Euclidean distance fits with our physical notion of distance. However, when different dimensions of the vector are not orthonormal, i.e. they are not equally important in our data sets. Using Euclidean distance alone will yield undesirable results unless a linear operation is applied which transforms the vector representations to ones based on orthonormal vectors [1].

This is done by decomposing the covariance matrix  into its eigenvectors and eigenvalues:
(2)
where  denotes a matrix of eigenvectors of  and  denotes a diagonal matrix whose elements are the eigenvalues of . Then the transformation matrix can be written as:
(3)

The values calculated for the two data sets are shown in Table 2:

	Data set 1	Data set 2
Covariance Matrix		
		
		
Table 2 Values calculated for the two data sets.
VI.	Transformed Data Sets
The two data sets are transformed into the new spaces using the respective transform matrices. The results are plotted in Figure 3. The means of the two new data sets are then determined as  and  respectively. The four sample points are transformed into the two transformed spaces as shown in Table 3. The new Euclidean distances between the new means and the new sample points are calculated and the sample points are re-classified as shown in Table 4.


Sample point	New values in transformed space 1	New values in transformed space 2
x1	(-0.0032, -1.2805)	(0.5699, 3.8781)
x2	(0, 0)	(0, 0)
x3	(0.0016, 0.6403)	(-0.2849, -1.9391)
x4	(2.1791, -0.0005)	(-1.3089, 0.4221)

Table 3 The transformed sample points.

Sample point	Euclidean distance from data set 1	Euclidean distance from data set 2	Classification
x1	8.8072	6.2047	set 2
x2	8.7165	5.5012	set 2
x3	8.7415	6.1375	set 2
x4	10.8957	4.1259	set 2

Table 4 The Euclidean distance from the transformed sample points to the transformed means.
VII.	New Decision Region
The new decision region can be found by locating points where the Euclidean distance between the transformed points and the transformed means are equidistant. These points are plotted along with the untransformed data sets and the decision region is shown to be a parabola, as shown in Figure 4.

VIII.	Demonstration of Orthonormality
The orthonormality of data sets is shown by computing the covariance matrices of the transformed data to show identity matrices of , and the plotting of Equation 4 on each set respectively produce a circle, as shown in Figure 5 and Figure 6.

(4)
IX.	REFERENCES
[1]	J. R. Deller, J. G. Proakis, and J. H. L. Hansen. Discrete-Time Processing of Speech Signals. Macmillan Publishing Company, 1993.
Figure 1 The two generated data sets.
Figure 2 Plots of the two data sets with decision region.
Figure 3 The data sets in transformed spaces, set 1 on left, set 2 on right.
Figure 4 The new decision region.
Figure 5 Orthonormalized data set 1.

Figure 6 Orthonormalized data set 2.