0:15

Hello and welcome to this lesson that introduces Dimension Reduction.

Dimension reduction is a very important technique for reducing

the number of features that you're going to use when building a machine learning model.

Unlike feature selection however,

dimension reduction can actually create

new features that are generated from the original set of features.

In addition, techniques such as; principal component analysis,

also give you a measure of the importance of each of these new features.

This important measure is called the explained variance.

So, you can use this explained variance to determine how many of

these new features you actually need to

keep in order to capture the majority of the signal,

that you're trying to model with a machine learning algorithm.

So, in this particular lesson,

you need to be able to explain how the PCA algorithm operates.

You should be able to explain the relationship between individual PCA components and

that explained variance and you should be able to apply

the PCA algorithm by using the scikit learn library.

The activities for this particular lesson include two readings.

The first one is a visual interactive website and I'll show you that in just a second.

The second is a notebook by Jake VanderPlas that shows you

how PCA can be applied to data by using the scikit learn library.

And then lastly, is our own notebook on introduction to dimension reduction.

So, let me just jump straight into this interactive Website.

There's a couple of different examples they provide to help understand

the effect of PCA in creating new dimensions.

First is a two-D visualization.

So, we have our original data and then we have our PCA components overlaid.

And then over here on the right,

we actually have the same data points and these principal component dimensions.

There's also a visualization down here of how

those data points are distributed on the original PCA,

the original features, and then the new features.

And so you can see the second principle component, they're all compact.

That's the idea that you get.

The new components, the first one has most of

the variance and in this case the second one has very little variance.

So, the first feature actually encodes most of the information.

Then there's an example in 3D.

And then there's another example in

much higher dimensional space that actually has more information in it.

So I'm only going to talk real briefly here about the 2D example.

You can play with the 3D example.

This is interactive. So we can move data points around.

And as we do, you notice that the principle components change,

the distribution of data points over here change as well and you see

that the variance are spread and these two components changes as well.

So, if we move these points around a lot,

you can see that now there's not really a good signal.

This is sort of a circle we can't expect that are

new features are going to be well represented in the new feature space.

On the other hand, if I move these data points down,

so they become a line again.

You can see what happens to the new principal components that line up here.

You can see these are almost all completely in line and

the first principal component and

the second principal component has basically zero again.

So, this shows you how the PCA generates

new features that ideally capture as

much of the variance in the first few components as possible.

In this case, we only had two,

so that first new component had most of the variance.

Second thing to look at, is this notebook.

It talks a lot about principal component analysis,

how it works and how you might use it to

improve the results that you're trying to employ.

And so this is a nice notebook that does this.

PCA is used for dimension reduction because when we have the explained variance,

we could say look we don't need to keep all the features we can really

just keep a couple and still have most of the signal.

Of course, most of the content is in the notebook for this week,

which is our dimension reduction notebook.

The dimension reduction and PCA in particular is an unsupervised technique.

We don't use labels to generate this.

We learn from the data,

the distribution and compute the new components or features automatically.

In this particular notebook,

you're going to look at the idea of principal component analysis.

You're going to see how do we actually create features from a data set,

In this case, the iris data set and how can we apply that to machine learning.

So, we're going to see how a signal could be captured from the original features and

how well we can capture it from

a reduced set of features via principal component analysis.

Then, we'll look at a similar idea

of two principal component analysis called factor analysis.

And then we're going to move on to PCA applied to the handwritten digit data

set and you going to see how we can capture this fraction of explained variance,

how that will impact the reconstruction of the original data set and other things,

such as the covariance matrix.

And lastly, we'll have a quick demonstration of

some other dimension reduction techniques just to sort of show you how you can use

those and how they compare in terms of

the capture of the signal and their data reconstruction to PCA itself.

So, first what do we do?

We look at the PCA algorithm.

This is what this does.

We have a data set here.

You can see that it's kind of a long elliptical shape.

The idea is that there's clearly a important component

or feature along this diagonal and then perpendicular to that,

which is the way PCA works with an orthogonal basis that has less signal.

And so we could imagine rotating

this coordinate frame to capture that and that's what we do.

You apply PCA to that data that we just randomly

generated and you could see that now the data is distributed

primarily along this new primary component

and there's less spread along the secondary component.

We then go in and apply this to the iris data set. We look at it visually.

You can see that, when we look at the original feature space,

there are this combination petal length and pedal width,

where there's actually a very compact shape similar to

what we just saw with the ellipse from our random data.

So, we can actually create new features from that data set and when we do that,

you can see that we generate four new features.

The first one has 92.5 percent of the signal.

That's quite a bit. This next one has 5.3.

So, the first two components capture almost 98 percent of the entire signal.

That means those two features probably are all

we need when we actually do machine learning.

The other two are very small, nearly noise.

We can then apply this to machine learning.

We can use SVM on our original data set and we can get a accuracy and a confusion matrix,

and then we can do PCA on that data,

and then do the exact same machine learning and see what we get,

and it turns out that our results are consistent.

We're up here, our confusion matrix was just two down here and versicolor.

And if we scroll down here,

you see you get the exact same results.

So, that showed you that we could reduce the data.

We actually cut the amount of data we were analyzing in half

from four features to two features and yet we got the same result.

That means we're going to be running our analysis

faster and perhaps getting more precise results,

because we're not being affected by noise or

small variance features as we were in the original.

Next, we introduce factor analysis.

Slightly different technique for computing the coefficients.

They no longer need to be orthogonal as they are with principal component analysis.

And when we do this again, it shows us that there's two important features,

just like we saw with the original iris data set.

Next, we move into the digit data set.

We look at some images,

we then apply PCA to them and we can compute a mean image from all of our pixels.

We can actually look at the fraction of explained variance.

This is an interesting plot because it shows us how many features do we need

to retain in order to capture a total amount of the variance in our signal.

So, you can see that with just 40 features,

remember we started with 64,

and with just 40 features we have 99 percent of the original signal.

And with around 20,

we have 90 percent of the original signal.

So, you can see that this data set can actually be compacted into

a much smaller amount of data and still retain most of the signal.

You can actually visualize the components these actually generated components,

the new dimensions if you will and that's them going along here.

You could imagine seeing how when you look at the original data,

this particular component is capturing part of the zero,

maybe a little bit of the two, part of the four,

five, six, eight, and nine.

You can see this one kind of looks like a zero.

This one maybe the two maybe part of the

eight and you could sort of understand what the algorithm is doing.

But the important thing here is that we

don't actually need to know what the algorithm is doing.

It's generating these components mathematically and we're simply looking at them.

Notice that as we get closer to 20 here though,

the fluctuations become more random.

You see less and less structure,

which is telling us in part,

that there's less and less noise here.

And as we get to 40 and beyond,

you can see that these are very little information.

Remember, when we had 40 features,

we had still retained 99 percent of

the signal and this tells us that these features contain very little useful information.

We can also use the PCA to recover our original data.

So what we do here, is we perform PCA on

the data by using different numbers of retained components.

So, one component, two,

et cetera., 10, 20 and 40.

And then we plot the reconstructed data by using just those numbers of components.

So, if we only have one component,

which doesn't have most of the signal and we try to reconstruct

these original images shown here on the top row,

you could see it doesn't do a very good job.

This three doesn't really look like a three,

the four maybe a little bit and not until we get down here to about five and even 10,

do we start to see that, yeah,

this kind of looks like a zero,

one, two, three, et cetera.

When we get out to 40,

definitely you can see that we've captured most of the signal.

In other words, if you compare this row or even this row with 20 to the very first row,

you can see that the representation is fairly well performed.

The other thing you can look at is a covariance matrix,

which relates the different pixels to each other.

So of course the diagonal is highlighted,

but you can see that different pixels are tied to other pixels.

That's an interesting way to sort of try to understand

how the PCA components are actually calculated.

And then lastly, we actually look at

some different techniques for performing dimensional reduction.

Here's some PCA, some of the top 10 components.

Here's another technique called non-negative matrix factorization.

There is fast independent component analysis.

There's mini batch PCA,

there's mini batch dictionary learning and then there's also factor analysis.

So, these are just different ways to show how that all works.

And then we can use those techniques,

some of them, to reconstruct the original data set.

And again, they all do a fairly similar job.

In part, that's just simply because the data that we're

analyzing can be recovered seamlessly in this manner.

With that, I'm going to go ahead and stop.

Hopefully, I've given you a good introduction

to dimension reduction and the importance of PCA.

This is a very fundamental technique that we're often going to want to

use before we apply a subsequent algorithm.

If you have any questions, be sure to ask me in the course forums and good luck.