MAST90138 Week 5 Lab

Problems:

The iris data contain various measurements (sepal length, sepal width, petal length and petal width) of 50 flowers from each of 3 species of iris flowers. Type help(iris) to learn about the formal of these data.

1. Load the iris data in R (they are already in R). Solution:

data(iris)

2. Do a PC analysis of these data using only the numerical variables, this time using the prcomp command. Using the output of this function, store the eigenvectors of the covariance matrix S in a matrix and the eigenvalues in a vector. Also store the Yik’s in a matrix Y, again using the output of prcomp.

Solution:

PCX=prcomp(iris[,1:4],retx=T)

vec=PCX$rotation

lambda=PCX$sdev^2

Y=PCX$x

3. Draw a screeplot for these data and recall the ψj’s (cumulative proportion of variance explained by each component) from last week. How many components does this suggest you should keep?

Solution:

screeplot(PCX)

cumsum(lambda)/sum(lambda)

The first two PCs together explain 98% of the variability of the data and the screeplot confirms as quick and sharp decrease of the λk’s. This suggests that the first two PCs capture a large fraction of the variability of the original data and that just with these two we may be able to uncover interesting features about the data.

4. What is the weight of each original variable in the linear combination use to create PC1 and PC2? Which variables are the most correlated with each PC (describe PC by PC and support your answer by some calculations)?

Solution:

vec

PC1 PC2 PC3 PC4

Sepal.Length 0.36138659 -0.65658877 0.58202985 0.3154872

Sepal.Width -0.08452251 -0.73016143 -0.59791083 -0.3197231

Petal.Length 0.85667061 0.17337266 -0.07623608 -0.4798390

Petal.Width 0.35828920 0.07548102 -0.54583143 0.7536574

1

PC1 puts weight 0.3613866, -0.08452251, 0.85667061, 0.35828920 on, respectively, the sepal length, the sepal width, the petal length and the petal width. PC2 puts weights -0.6565888, -0.73016143, 0.17337266, 0.07548102 on, respectively, the sepal length, the sepal width, the petal length and the petal width. PC3 puts weights 0.5820299, -0.59791083, -0.07623608, – 0.54583143 on, respectively, the sepal length, the sepal width, the petal length and the petal width. PC4 -puts weights 0.3154872, -0.31972310, -0.47983899, 0.75365743 on, respectively, the sepal length, the sepal width, the petal length and the petal width.

PC1 puts the most weight on the petal length and also some weight on the sepal length and the petal width; all contribute positively to PC1. PC2 puts the most weight on the sepal length and the sepal width, which have a negative effect on PC2. PC3 put most of its weight on all but the petal length and PC4 puts most of its weight on the petal width.

5. The correlation graph showing the correlation between each of the original variable and two PCs is given below. We also provide a table with the values of the correlations between each original variable and each PC. Use this graph and this table to provide more insight into the results of the analysis.

Table 1: Correlations between original variables and the principal components:

Sepal length Sepal width Petal length Petal width

PC1 0.8974018 -0.3987485 0.9978739 0.9665475

PC2 -0.3906044 -0.8252287 0.0483806 0.0487816

PC3 0.19656672 -0.38363030 -0.01207737 -0.20026170

PC4 0.05882002 -0.11324764 -0.04196487 0.15264831

correlations between the Xj’s and PC1 and PC2

Petal.WLeindgthth

Sepal.Length

Sepal.Width

−2 −1 0 1 2

PC1

2

PC2

−1.0 −0.5 0.0 0.5 1.0

Solution:

**

***

* ****** ***

*** ***

* **

* *

* ** ** *** **

** * * **

* * * *

** ****

* *****

* * ***

* setosa

* versicolor * virginica

* **

*

***

*

* ** **

***** ***** ***

** *******

* ****

** * *** *

*** * **** *****

** *** **

***** * *

* **

*

*

−3 −2 −1 0 1 2 3 4

PC1

All four variables are close to the circle of radius 1, which indicates that they are strongly correlated with the first two PCs. We also know that together the first two PCs explain a large fraction of the variability of the data, so the direction of the arrows can be used in conjunction with the scatterplot of the first two PCs to learn the effect of those three original variables on the individuals. In particular, it appears that the setosa tend to be very different from the versicolor and the virginica: they tend to have a larger sepal width than these two. The versicolor and the virginica tend to have larger values of petal width and length and of sepal length. The virginica tend to have larger values of petal width and length than the versicolor.

Going back to the original data using pairs(X,col=c(2,3,4)[class]), we can see that indeed this is the case.

3

PC2

−1.0 −0.5 0.0 0.5 1.0