0:00

Principal Components Analysis and the Singular Value

Decomposition are a really important techniques, both in

the exploratory data analysis phase, and also,

in the more kind of formal modeling phase.

The techniques I've been, can be used in, very easily in both eight, stages.

But I'm going to talk a little bit about kind of,

how it's used in the exploratory phase, and I want to

talk a little bit about, kind of, what, what, what

goes on underneath, and kind of what their underlying basis is.

0:24

So,

if you look, suppose we have some matrix data

here and I just generated some random normal data

here using this code right here, and you see

that the matrix that I plotted here using the

image function, it's not particularly interesting, it looks pretty

noisy and there's no real pattern as we would

expect there, to not be, Now, I could just

run a hierarchical cluster analysis on this data set.

I just, I can do an hierarchical, a cluster

analysis on the rows of the data frame or

the rows of the matrix and the columns of a matrix.

And I can do this using the heatmap function very easily in R.

So when I run the heatmap function, you can see that the cluster analysis

is done, I get the dendrograms printed on both the columns and the rows.

But again, there's no real interesting pattern that emerges, and

that's because there's no real interesting pattern underlying in the data.

1:10

So so that's fine.

But now, what if we add a pattern to the data set.

So let's try to add something and

I, I do it with this code here, with this four loop here.

So I loop through all the rows, and on a random row I

flip a coin, and if it turns out to be a one I just

add a pattern, I just, so, so that five of the columns have a

mean of zero, and another, and the other five have a mean of three.

So, I just kind of add a little shift here across the columns.

So now if I plot the data you see that their at the

right, on the right hand five columns are a little

bit more yellow which means that they have a higher value

and in the left hand side five columns which are, which

are little bit more red which means they've a lower value.

That's because some of the rows have a mean of three in the right hand side

2:04

So now, if I do, if I run a higher, higher group of cluster analysis on

the data, you can see that, that the

2:12

the two sets of columns are easily separated out.

So you can see that the dendrogram on the top

of the matrix which, which is right on the columns.

I see it, it has, it clearly splits into two clusters,

there's five on the left, and there's five on the right.

On the rows, it's not so obvious, because

there's no real pattern that goes along the rows.

And so that kind of get reorganized into

a random pattern and, and that's the picture that emerges from the heat map.

2:39

Now, we can take a look at a closer look at the patterns in the rows

and columns by looking at kind of the

marginal means of the, of the rows and columns.

So for example I can look at ether the ten different column

means or I can plot the 40 row means in this matrix.

So here with this code, that's exactly what I've done

on the left hand plot I've got the original matrix data.

3:10

And and and I've plotted in the middle plot here the mean for each of the rows.

If you look, on, on the y-axis I've got the row number which goes from

one to 40, so that kind of is roughly parallel with the image on the left.

And and on the x-axis I've got the mean of that row.

So for example, you can see that for row ten

the mean is roughly, you know, minus 0.25 or something like that.

And then for row 30 the mean is roughly 1.5.

3:37

And so we see that there's a clear shift in the mean as you go across the rows.

Similarly, if you go across the columns, you can see, across the

ten columns, there's a clear shift in the mean of each column.

So the first ten columns have roughly a mean of zero or close

to it and then the next ten columns have

roughly a mean of two because, there's a shift there.

So using the plots on the, on the, in the middle on the

right you can see a clear pattern, in the rows and the columns there.

4:10

So, some related problems that, that so, closer analysis are, is

useful for kind of identifying these types of patterns, but we

can maybe take a little more, a slightly more formal approach,

that kind of takes advantage of the matrix structure of the data.

And so the base, there are two kinds of

problems you might want to look at and so if

you have a lot of variables and we want,

we want to create a new set of variables that

are uncorrelated and explain as much variance as possible.

So the idea is that we have a lot of different variables.

Suppose we have hundreds or maybe thousands

or tens of thousands of variables in

our data set and the idea is

that they're not all independent measurements of something.

Right?

So a lot of them will be related to each other.

They will be correlated with each other.

So for example, you'll have two measurements

that are like height and weight and so

those will obviously be related to each

other and so they're not all independent kind

of like factors.

And you see the idea is that we want to create a set of variables that is smaller

than the original set of variables that we

have and that are all uncorrelated with each other.

So that they kind of represent different types of variation in your data set.

And similarly, we want this reduced set of variables to explain

as much of the variability in your data set, as possible.

5:25

So another related problem is that if you put all the variables together

in one matrix so like the matrix that we showed in the image.

You want to find the best matrix that's created with

fewer variables, the and, but still explains the original data.

The idea is if you, the more technical term is that you want to find

a lower rank matrix and that somehow

explains the, the original data reasonably well.

So the first goal here is a statistical one and it's,

it's a common problem that's solved by

the method of, of principle components analysis.

And the second goal here is more of a kind of data compression problem,

where you want to find a, a kind

of smaller representation of the original data.

And one way to think about that problem is with the singular value decomposition.

6:10

So the singular value decomposition can be written in mathematical terms, in matrix

terms as if you have a matrix x where each, the, all, we can think of each

column in this matrix as a variable or a, a measurement And each row of

the matrix as an observation, so you might

have many, many observations for a given metric.

So for example, the rows of your matrix

might represent individual people, and each column would represent

a measurement on those people.

So for example, the first column might be the

height, and the second column might be the weight.

6:41

So then the idea is that if you have a

matrix x, that's formatted in this way then the singular value

decomposition or the SVD is a matrix decomposition that can,

where it, which decomposses the original matrix into three separate matrixes.

One is U, one is called D and the other's called V.

6:59

And so they, the column of U are orthogonal

so they're, ind, they're kind of independent of each other.

They're called the left singular vectors and the columns of

V are also orthogonal and they're called right singular vectors.

And then D is a diagonal matrix which contains the singular values.

So that's the basic idea of the singular valued composition.

We'll talk about these components a little bit later on.

Principle components analysis, also usually known as PCA,

is related, uses the single valued composition as a related technique

And the basic idea is that if you were to take

the original data matrix, and subtract the mean of each column

from each, so subtract each, the column mean from each column.

And divide by the column standard deviation, and then, and

then run a SVD on that kind of re-normalized matrix.

The principle components would be equal to the right singular values