0:00

All right. In a last video,

we looked at the t-SNE,

our first known linear unsupervised learning method.

Now, it's time to introduce

another powerful non-linear unsupervised learning approach that became

extremely popular in a last years among

practitioners and researchers of machine learning.

Similar to how we've found that both linear regression,

and logistic regression are some special cases

of general colossal fit for neural networks.

It turns out that their beautiful construction of the PCA is nothing

but a very special limited case of a larger class of autoencoder models.

In this video, our objective will be to understand how a simple autoencoder works,

and how it can be used for dimension reduction.

To this end, let's come back to our general diagram of unsupervised learning process.

This diagram of unsupervised learning data flow,

that we already saw illustrates

the very same autoencoder that we want to look at more carefully now.

It has two main blocks,

an autoencoder and decoder.

The encoder takes the input vector X and transforms it into a compressed signal Z.

The decoder takes the compressed signal Z and reconstructs the inputs from it.

If we call these reconstructed inputs X hath then

the objective of the autoencoder is to find such pair of encoder,

decoder that the reconstructed input X hath

would be as close as possible to the original input X.

As we do not want the autoencoder to simply memorize the inputs,

we can make it very difficult for it by making the dimension of

the internal representation less or maybe even much less than the dimension of the input.

Such autoencoder is called under complete.

There is also an over complete version of the autoencoder that works in pair

with the legalization but we will not go into these details here.

Now, after decoding the compressed signal Z we get reconstructed input X hath.

So, for example if the input X is the N dimensional vector,

then the mean squared arrow between X and X hath

can be taken as a loss metric for such unsupervised learning with an autoencoder.

Now, let's equal that we have also hath

the words encoder and decoder in our previous videos,

in an apparently different context,

when we discussed the PCA.

In the PCA approach,

we had the linear encoder Z equals X times V that transformed.

And then, by P data matrix M,

X into an N by K matrix of encoded signal Z.

Then a decoded signal is obtained by multiplying Z by V-transpose.

Now, the question is how are

the linear encoder and decoder that we had in the PCA are related to

the encoder and decoder that

appeared as blocks in our unsupervised learning diagram for the autoencoder?

And the answer is that they are

a special linear implementations of

more generic encoder and decoder components of a general autoencoder.

And the same holds for the PCA itself.

The PCA is a special linear autoencoder,

but this note, you could object that the dear professor,

did not mention any optimization when talking about the PCA,

and yet now he claims that the PCA is a special linear autoencoder,

whose training includes a minimization of some loss function.

How is that possible? Well, it turns out that there

are two mathematically equivalent formulations of the PCA.

One, is what we presented above,

where we viewed the PCA as a projection onto

an orthogonal basis that keeps the most of total variance of the data.

The second interpretation of the PCA is due to

Pearson at was actually presented by him in 1901,

while ahead of the first interpretation that was suggested only in 1933.

Pearson's formulation of the PCA is via a minimization of

the MSE error between their original inputs X and their reconstruction X hath.

For this particular case the MSE arrow,

is referred to as a quadratic distortion measure.

If you want to follow the mathematical details of how minimization of

quadratic distortion measure is equivalent to the variance maximization view of the PCA,

you can consult chapter 12.1.2 in Bishops book.

But here instead of going into the math of this,

I want to talk to you about an implementation of these ideas.

Let's follow our good tradition to see artificial neurons

or even better layers of such neurons everywhere in machine learning.

Like we did with regression and classification tasks, the last week.

Again, all we have to do to make an encoder,

is to replace the final neuron with a few neurons that

will store a low dimensional internal representation of the data.

Now, we can have non-linear encoders with different numbers of hidden layers.

But what happens in a special case when all activation functions here are linear?

Well, in this case as a composition of

linear transforms is just another linear transform,

a multi-layer linear encoder would be equivalent to a single layer linear encoder,

which is precisely, the linear encoder that we had in the PCA.

Now, but if we use non-linear activation to capture properties of complex data,

then the depths of the network matters.

Sometimes, a few layers are needed to build a good re-compact representation.

Now, we can unravel the whole cascade from

the inputs to the internal representation in a reverse order.

This produces a neural decoder,

if we now pipeline the encoder and decoder,

this gives the autoencoder.

If I have just two neurons in the inner-most layer,

their activations can be now used for visualization of your data.

This graph shows you an example of autoencoder with two layers in

both the encoder and decoder with equal numbers of neurons per layer.

In practice, autoencoders are often built with

a gradual decrease of number of neurons per layer,

as we move from the outer layers to inner ones.

Another popular way to represent autoencoders is to

flip the whole graph by 90 degrees and remove individual neurons from the picture.

In this graph, I show a vertically put autoencoder with two neurons in

the inner-most layer and a gradual decrease in the number of neurons in inner-layers.

It also shows the trick of tied weights in middle layers of the encoder and decorder.

This reduces the number of parameters in

a problem and often helps to get a better generalization.

Now, let's recap what we learned in this lesson.

We talked about the PCA and about how it can be used as

a data transformation and dimension reduction tool.

Then we identified some potential issues with the PCA,

as a dimension reduction tool such as,

its heavy reliance on a linear transform of data.

It's discouraging of less important dimensions and

the crowding problem that is inherent in dimensional reduction will be PCA.

Then we introduce the t-SNE and saw how it can

be used for better visualization of your data.

We also discussed whites in basic form,

the t-SNE is not a good dimension reduction method.

Then we made our first introduction to a white class of autoencoder models,

that typically do a much better job in proper tasks of dimension reduction.

And this would be it for dimension reduction methods,

in this first course of specialization.

In your homework for this week,

you will work with the PCA,

t-SNE and the autoencoder and compare

how these different algorithm perform on the stock return data.

As a next topic of this week videos,

let's now talk about other class of

unsupervised learning algorithms namely, clustering methods.