0:00

Neural networks are one of the most powerful learning algorithms that

we have today.

In this and in the next few videos, I'd like to start talking about a learning

algorithm for fitting the parameters of a neural network given a training set.

As with the discussion of most of our learning algorithms, we're going to

begin by talking about the cost function for fitting the parameters of the network.

0:38

I'm going to use upper case L to denote the total number of

layers in this network.

So for the network shown on the left we would have capital L equals 4.

I'm going to use S subscript L to denote the number of units,

that is the number of neurons.

Not counting the bias unit in their L of the network.

So for example, we would have a S one, which is equal there,

equals S three unit, S two in my example is five units.

And the output layer S four,

which is also equal to S L because capital L is equal to four.

The output layer in my example under that has four units.

1:17

We're going to consider two types of classification problems.

The first is Binary classification, where the labels y are either 0 or 1.

In this case, we will have 1 output unit, so this Neural Network unit

on top has 4 output units, but if we had binary classification we would have only

one output unit that computes h(x).

And the output of the neural network would be h(x) is going to be a real number.

And in this case the number of output units,

S L, where L is again the index of the final layer.

Cuz that's the number of layers we have in the network so

the number of units we have in the output layer is going to be equal to 1.

In this case to simplify notation later, I'm also going to set K=1 so

you can think of K as also denoting the number of units in the output layer.

The second type of classification problem we'll consider will

be multi-class classification problem where we may have K distinct classes.

So our early example had this representation for y if we have 4 classes,

and in this case we will have capital K output units and

our hypothesis or output vectors that are K dimensional.

And the number of output units will be equal to K.

And usually we would have K greater than or equal to 3 in this case, because

if we had two causes, then we don't need to use the one verses all method.

We use the one verses all method only if we have K greater than or equals

V classes, so having only two classes we will need to use only one upper unit.

Now let's define the cost function for our neural network.

3:03

The cost function we use for the neural network is going to be a generalization

of the one that we use for logistic regression.

For logistic regression we used to minimize the cost function J(theta) that

was minus 1/m of this cost function and then plus this extra

regularization term here, where this was a sum from J=1 through n,

because we did not regularize the bias term theta0.

3:31

For a neural network, our cost function is going to be a generalization of this.

Where instead of having basically just one, which is the compression output unit,

we may instead have K of them.

So here's our cost function.

Our new network now outputs vectors in R K where R might be equal to 1 if we have

a binary classification problem.

I'm going to use this notation h(x) subscript i to denote the ith output.

That is, h(x) is a k-dimensional vector and so this subscript i just

selects out the ith element of the vector that is output by my neural network.

4:08

My cost function J(theta) is now going to be the following.

Is - 1 over M of a sum of a similar term to what we have for

logistic regression, except that we have the sum from K equals 1 through K.

This summation is basically a sum over my K output.

A unit. So if I have four output units,

that is if the final layer of my neural network has four output units,

then this is a sum from k equals one through four of

basically the logistic regression algorithm's cost function but

summing that cost function over each of my four output units in turn.

And so you notice in particular that this applies to Yk Hk,

because we're basically taking the K upper units, and comparing that to the value

of Yk which is that one of those vectors saying what cost it should be.

And finally, the second term here is the regularization term,

similar to what we had for the logistic regression.

This summation term looks really complicated, but all it's doing is it's

summing over these terms theta j i l for all values of i j and l.

Except that we don't sum over the terms corresponding to these bias values

like we have for logistic progression.

Completely, we don't sum over the terms responding to where i is equal to 0.

So that is because when we're computing the activation of a neuron,

we have terms like these.

Theta i 0.

Plus theta i1, x1 plus and so on.

Where I guess put in a two there, this is the first hit in there.

And so the values with a zero there,

that corresponds to something that multiplies into an x0 or an a0.

And so this is kinda like a bias unit and by analogy to what we were doing for

logistic progression, we won't sum over those terms in our regularization term

because we don't want to regularize them and string their values as zero.

But this is just one possible convention, and even if you were to sum over i

equals 0 up to Sl, it would work about the same and doesn't make a big difference.

But maybe this convention of not regularizing the bias term

is just slightly more common.