Hello, and welcome back.

In this week, you learn about optimization algorithms

that will enable you to train your neural network much faster.

You've heard me say before that applying machine learning is a highly empirical process,

is highly iterative process.

In which you just had to train a lot of models to find one that works really well.

So, it really helps to really train models quickly.

One thing that makes it more difficult is that

Deep Learning does not work best in a regime of big data.

We are able to train neural networks on a huge data

set and training on a large data set is just slow.

So, what you find is that having fast optimization algorithms,

having good optimization algorithms can really

speed up the efficiency of you and your team.

So, let's get started by talking about mini-batch gradient descent.

You've learned previously that vectorization allows

you to efficiently compute on all m examples,

that allows you to process your whole training set without an explicit formula.

That's why we would take our training examples and stack them

into these huge matrix capsule Xs.

X1, X2, X3, and then eventually it goes up to X, M training samples.

And similarly for Y this is Y1 and Y2,

Y3 and so on up to YM.

So, the dimension of X was an X by M and this was 1 by M. Vectorization allows

you to process all M examples relatively

quickly if M is very large then it can still be slow.

For example what if M was 5 million or 50 million or even bigger.

With the implementation of gradient descent on your whole training set,

what you have to do is,

you have to process your entire training set

before you take one little step of gradient descent.

And then you have to process your entire training sets of

five million training samples again before

you take another little step of gradient descent.

So, it turns out that you can get a faster algorithm if you let gradient descent

start to make some progress even before you finish processing your entire,

your giant training sets of 5 million examples.

In particular, here's what you can do.

Let's say that you split up your training set into smaller,

little baby training sets and these baby training sets are called mini-batches.

And let's say each of your baby training sets have just 1,000 examples each.

So, you take X1 through X1,000 and you call that your first little baby training set,

also call the mini-batch.

And then you take home the next 1,000 examples.

X1,001 through X2,000 and then X1,000 examples and come next one and so on.

I'm going to introduce a new notation I'm going to call

this X superscript with curly braces,

1 and I am going to call this,

X superscript with curly braces, 2.

Now, if you have 5 million training samples total

and each of these little mini batches has a thousand examples,

that means you have 5,000 of these because you know 5,000 times 1,000 equals 5 million.

Altogether you would have 5,000 of these mini batches.

So it ends with X superscript curly braces

5,000 and then similarly you do the same thing for Y.

You would also split up your training data for Y accordingly.

So, call that Y1 then this is Y1,001 through Y2,000.

This is called, Y2 and so on until you have Y5,000.

Now, mini batch number T is going to be comprised of X,

T and Y, T. And

that is a thousand training samples with the corresponding input output pairs.

Before moving on, just to make sure my notation is clear,

we have previously used superscript round brackets I to index in the training set so X I,

is the I training sample.

We use superscript, square brackets

L to index into the different layers of the neural network.

So, ZL comes from the Z value,

the L layer of the neural network and here we are introducing

the curly brackets T to index into different mini batches.

So, you have XT, YT and to check your understanding of these,

what is the dimension of XT and YT?

Well, X is an X by M. So,

if X1 is a thousand training examples or the X values for a thousand examples,

then this dimension should be MX by 1,000 and X2 should also be an X by 1,000 and so on.

So, all of these should have dimension MX by 1,000 and

these should have dimension 1 by 1,000.