0:00

One way to learn the parameters of the neural network so that it gives you

Â a good encoding for your pictures of faces is to

Â define an applied gradient descent on the triplet loss function.

Â Let's see what that means.

Â To apply the triplet loss,

Â you need to compare pairs of images.

Â For example, given this picture,

Â to learn the parameters of the neural network,

Â you have to look at several pictures at the same time.

Â For example, given this pair of images,

Â you want their encodings to be similar because these are the same person.

Â Whereas, given this pair of images,

Â you want their encodings to be quite different because these are different persons.

Â In the terminology of the triplet loss,

Â what you're going do is always look at one anchor image and

Â then you want to distance between the anchor and the positive image,

Â really a positive example,

Â meaning as the same person to be similar.

Â Whereas, you want the anchor when pairs are compared to

Â the negative example for their distances to be much further apart.

Â So, this is what gives rise to the term triplet loss,

Â which is that you'll always be looking at three images at a time.

Â You'll be looking at an anchor image,

Â a positive image, as well as a negative image.

Â And I'm going to abbreviate anchor positive and negative as A,

Â P, and N. So to formalize this,

Â what you want is for the parameters of

Â your neural network of your encodings to have the following property,

Â which is that you want the encoding

Â between the anchor minus the encoding of the positive example,

Â you want this to be small and in particular,

Â you want this to be less than or equal to the distance of the squared norm

Â between the encoding of the anchor and the encoding of the negative,

Â where of course, this is d of A,

Â P and this is d of A,

Â N. And you can think of d as a distance function,

Â which is why we named it with the alphabet d. Now,

Â if we move to term from the right side of this equation to the left side,

Â what you end up with is f of A minus f of P squared minus,

Â let's take the right-hand side now,

Â minus F of N squared,

Â you want this to be less than or equal to zero.

Â But now, we're going to make a slight change to this expression,

Â which is one trivial way to make sure this is satisfied,

Â is to just learn everything equals zero.

Â If f always equals zero,

Â then this is zero minus zero,

Â which is zero, this is zero minus zero which is zero.

Â And so, well, by saying f of any image equals a vector of all zeroes,

Â you can almost trivially satisfy this equation.

Â So, to make sure that the neural network doesn't just output zero for all the encoding,

Â so to make sure that it doesn't set all the encodings equal to each other.

Â Another way for the neural network to give a trivial output is if

Â the encoding for every image was identical to the encoding to every other image,

Â in which case, you again get zero minus zero.

Â So to prevent a neural network from doing that,

Â what we're going to do is modify this objective to say that,

Â this doesn't need to be just less than or equal to zero,

Â it needs to be quite a bit smaller than zero.

Â So, in particular, if we say this needs to be less than negative alpha,

Â where alpha is another hyperparameter,

Â then this prevents a neural network from outputting the trivial solutions.

Â And by convention, usually,

Â we write plus alpha instead of negative alpha there.

Â And this is also called, a margin,

Â which is terminology that you'd be familiar

Â with if you've also seen the literature on support vector machines,

Â but don't worry about it if you haven't.

Â And we can also modify this equation on top by adding this margin parameter.

Â So to give an example,

Â let's say the margin is set to 0.2.

Â If in this example,

Â d of the anchor and the positive is equal to 0.5,

Â then you won't be satisfied if d between

Â the anchor and the negative was just a little bit bigger, say 0.51.

Â Even though 0.51 is bigger than 0.5,

Â you're saying, that's not good enough, we want a dfA,

Â N to be much bigger than dfA,

Â P and in particular,

Â you want this to be at least 0.7 or higher.

Â Alternatively, to achieve this margin or this gap of at least 0.2,

Â you could either push this up or push this down so

Â that there is at least this gap of this alpha,

Â hyperparameter alpha 0.2 between

Â the distance between the anchor and the positive versus the anchor and the negative.

Â So that's what having a margin parameter here does,

Â which is it pushes

Â the anchor positive pair and the anchor negative pair further away from each other.

Â So, let's take this equation we have here at the bottom,

Â and on the next slide,

Â formalize it, and define the triplet loss function.

Â So, the triplet loss function is defined on triples of images.

Â So, given three images,

Â A, P, and N,

Â the anchor positive and negative examples.

Â So the positive examples is of the same person as the anchor,

Â but the negative is of a different person than the anchor.

Â We're going to define the loss as follows.

Â The loss on this example,

Â which is really defined on a triplet of images is,

Â let me first copy over what we had on the previous slide.

Â So, that was fA minus fP squared

Â minus fA minus fN squared,

Â and then plus alpha, the margin parameter.

Â And what you want is for this to be less than or equal to zero.

Â So, to define the loss function,

Â let's take the max between this and zero.

Â So, the effect of taking the max here is that,

Â so long as this is less than zero,

Â then the loss is zero,

Â because the max is something less than equal to zero,

Â when zero is going to be zero.

Â So, so long as you achieve the goal of making this thing I've underlined in green,

Â so long as you've achieved the objective of making that less than or equal to zero,

Â then the loss on this example is equals to zero.

Â But if on the other hand,

Â if this is greater than zero,

Â then if you take the max,

Â the max we end up selecting,

Â this thing I've underlined in green,

Â and so you would have a positive loss.

Â So by trying to minimize this,

Â this has the effect of trying to send this thing to be zero,

Â less than or equal to zero.

Â And then, so long as there's zero or less than or equal to zero,

Â the neural network doesn't care how much further negative it is.

Â So, this is how you define the loss on

Â a single triplet and the overall cost function for your neural network can

Â be sum over a training set of these individual losses on different triplets.

Â So, if you have a training set of say 10,000 pictures with 1,000 different persons,

Â what you'd have to do is take your 10,000 pictures and use it to generate,

Â to select triplets like this and then train

Â your learning algorithm using gradient descent on this type of cost function,

Â which is really defined on triplets of images drawn from your training set.

Â Notice that in order to define this dataset of triplets,

Â you do need some pairs of A and P. Pairs of pictures of the same person.

Â So the purpose of training your system,

Â you do need a dataset where you have multiple pictures of the same person.

Â That's why in this example,

Â I said if you have 10,000 pictures of 1,000 different person,

Â so maybe have 10 pictures on average

Â of each of your 1,000 persons to make up your entire dataset.

Â If you had just one picture of each person,

Â then you can't actually train this system.

Â But of course after training,

Â if you're applying this,

Â but of course after having trained the system,

Â you can then apply it to

Â your one shot learning problem where for your face recognition system,

Â maybe you have only a single picture of someone you might be trying to recognize.

Â But for your training set,

Â you do need to make sure you have multiple images of the same person at least for

Â some people in your training set so that you can

Â have pairs of anchor and positive images.

Â Now, how do you actually choose these triplets to form your training set?

Â One of the problems if you choose A, P,

Â and N randomly from your training set subject to A and P being from the same person,

Â and A and N being different persons,

Â one of the problems is that if you choose them so that they're at random,

Â then this constraint is very easy to satisfy.

Â Because given two randomly chosen pictures of people,

Â chances are A and N are much different than

Â A and P. I hope you still recognize this notation,

Â this d(A, P) was what we had written on the last year's slides as this encoding.

Â So this is just equal to this

Â squared known distance between the encodings that we have on the previous slide.

Â But if A and N are two randomly chosen different persons,

Â then there is a very high chance that this will be much

Â bigger more than the margin alpha that that term on the left.

Â And so, the neural network won't learn much from it.

Â So to construct a training set,

Â what you want to do is to choose triplets A,

Â P, and N that are hard to train on.

Â So in particular, what you want is for all triplets that this constraint be satisfied.

Â So, a triplet that is hard will be if you choose values for A, P,

Â and N so that maybe d(A,

Â P) is actually quite close to d(A,N).

Â So in that case,

Â the learning algorithm has to try extra hard to take

Â this thing on the right and try to push it up or take this thing on

Â the left and try to push it down so that there is

Â at least a margin of alpha between the left side and the right side.

Â And the effect of choosing these triplets is that it

Â increases the computational efficiency of your learning algorithm.

Â If you choose your triplets randomly,

Â then too many triplets would be really easy, and so,

Â gradient descent won't do anything because your neural network will just get them right,

Â pretty much all the time.

Â And it's only by using hard triplets that the gradient descent procedure

Â has to do some work to try to push these quantities further away from those quantities.

Â And if you're interested,

Â the details are presented in this paper by Florian Schroff, Dmitry Kalinichenko,

Â and James Philbin, where they have a system called FaceNet,

Â which is where a lot of the ideas I'm presenting in this video come from.

Â By the way, this is also a fun fact about

Â how algorithms are often named in the deep learning world,

Â which is if you work in a certain domain, then we call that blank.

Â You often have a system called blank net or deep blank.

Â So, we've been talking about face recognition.

Â So this paper is called FaceNet,

Â and in the last video,

Â you just saw deep face.

Â But this idea of a blank net or deep blank is

Â a very popular way of naming algorithms in the deep learning world.

Â And you should feel free to take a look at that paper if you want to learn some of

Â these other details for speeding up your algorithm

Â by choosing the most useful triplets to train on,

Â it is a nice paper.

Â So, just to wrap up,

Â to train on triplet loss,

Â you need to take your training set and map it to a lot of triples.

Â So, here is our triple with an anchor and a positive,

Â both for the same person and the negative of a different person.

Â Here's another one where the anchor and positive are of

Â the same person but the anchor and negative are of different persons and so on.

Â And what you do having defined this training sets

Â of anchor positive and negative triples is use

Â gradient descent to try to minimize the cost function J we defined on an earlier slide,

Â and that will have the effect of that propagating to

Â all of the parameters of the neural network in order to learn

Â an encoding so that d of

Â two images will be small when these two images are of the same person,

Â and they'll be large when these are two images of different persons.

Â So, that's it for the triplet loss and how you can train

Â a neural network for learning and an encoding for face recognition.

Â Now, it turns out that

Â commercial face recognition systems are trained on fairly large datasets at this point.

Â Often, north of the million images

Â sometimes not in frequently north of 10 million images.

Â And there are some commercial companies talking about using over 100 million images.

Â So these are very large datasets by models that even balm.

Â So that's it for the triplet loss and how you can use it to

Â train a neural network to operate a good encoding for face recognition.

Â Now, it turns out that today's face recognition systems

Â especially the loss cure commercial face recognition systems

Â are trained on very large datasets.

Â Datasets north of a million images is not uncommon,

Â some companies are using north of 10 million images and some companies

Â have north of 100 million images with which to try to train these systems.

Â So these are very large datasets even by modern standards,

Â these dataset assets are not easy to acquire.

Â Fortunately, some of these companies have trained

Â these large networks and posted parameters online.

Â So, rather than trying to train one of these networks from scratch,

Â this is one domain where because of the share data volume sizes,

Â this is one domain where often it might be useful for you

Â to download someone else's pre-train model,

Â rather than do everything from scratch yourself.

Â But even if you do download someone else's pre-train model,

Â I think it's still useful to know how these algorithms were trained or

Â in case you need to apply these ideas from scratch yourself for some application.

Â So that's it for the triplet loss.

Â In the next video,

Â I want to show you also some other variations

Â on siamese networks and how to train these systems.

Â Let's go onto the next video.

Â