[MUSIC] Hi, in this video we will talk about MLP implementation details. How to do that efficiently, even if back propagation is efficient already. First, let's notice that dense layer can be replaced with matrix multiplication. Here, we have a row x which is our input features. Then, we have two neurons z1 and z2, which have corresponding weights in the first column of the matrix W, and in the second column. And let me remind you how matrix multiplication actually works. To get element z1, you take first row of matrix x, which is a single row. And you take a dot product with the first column, and that is how you get z1. To get the z2, you take the first row of matrix x and multiply it, take a dot product with the second column of matrix W. So if you check that, you can pause the video and verify that is true, that our neurons, are just neurons with the coefficients in the first and second column of matrix W. So it can be re-written as matrix multiplication. So we have an input raw x, we multiply it by the matrix W, and we have our outputs. So this is how a dense layer in MLP works. Why do we need matrix multiplications? It turns out that people can do that efficiently on CPU using BLAS library as well as on GPU using cuBLAS library. So one more thing is that, matrix multiplication even with numpy library is much more efficient than doing that with Python loops. So you should always use matrix multiplication when you can. Let's see how a backward pass for a dense layer works. So if forward pass is easy. We just apply matrix multiplication and here we are, we have our outputs. But the backward pass is a little bit more trickier, because here, we need to find the derivative of our loss function with respect to all our coefficients in matrix W. And let's say, that our loss function is some scalar value on top of z1 and z2, on top of our predictions. Now, we can actually come off with a certain notation for this derivative. Let's say, that the derivative of a scalar loss with respect to a metrics will be a metrics of scalar derivatives. So the element for example with index 1, 1 will be dL/dw1,1, and so forth. So why this notation is convenient? It is convenient because it is easy to do as you deal with this notation. Because SGD actually, just adds these derivative to our weights on the previous situation. And that is how we make our step in gradient descent, right? And if you replace our coefficients with matrix and our gradient step with a matrix as well, then, you come up with matrix summation or subtraction. So it can be done efficiently, as well in non pi, without any pi for the loops. So it's pretty cool to have it in a matrix. One more thing, is that You can apply a chain rule. And you can actually write that chain rule in terms of matrix as well. But let's drill down to that. Let's try to calculate one element of that matrix dL, dw. For example, let's try to find the derivative dL dwij. And to do that, we need to apply a chain rule because our loss function is a function of z1 and z2. So we need to go through them when we apply a chain rule and you can actually see that to get to Wij. You will effectively come only through Zj, you will not go to the nearer Zi because it doesn't use those weights. So effectively, we have something like this, the derivative dl dWij is actually dL dZj multiplied by Xi. This is in turn thanks to how z1 is computed or how zi is computed. Now, you can actually see that this notation can be squeezed into matrix notation. We can actually come up with a gradient vector of our loss with respect to our outputs, this is dL dz. And it is called a gradient vector because each element of this vector is actually. The derivative of loss with respect to every coordinate. And using this factor you can rewrite our dL dw in matrix notation. You can actually see that dL dw is x transpose multiplied by dL dz. You can pause the video and verify that that is true. So we can see there that you can do backward pass for MLP efficiently with matrix multiplication. This is cool because you can do that with non pi and strip away those Python loops. And you can also do that matrix multiplication efficiently on GPU. Now, let's see what happens when you have not one x row, but you actually have a lot of them. For example, two of them because we usually do stahestigrate in the mini-batches, right? So we have a lot of elements. And let's check that our matrix multiplication paradigm still works here. You can actually verify that to get the first neuron for the second sample for the second row in matrix hecks. You have to do just matrix multiplication. So take the second row of matrix x and you take the first column of matrix W. You take a dot product and that is how you get the first neuron for the second sample, it just works. Let's move to the backward pass. Here, the problem is a little bit more difficult. To apply SGD with mini batches, we actually need to take the loss for all the elements in our batch, right? So for SGD, step we have a lost on our batch. Let's identify it as Lb. And we actually need to calculate the derivative of that lost with respect to our parameters W. And to do that, you need to apply that gradient to every output that we have and you just sum them up. So we just applied the rule here, that the derivative of the sum is the sum of the derivatives. Now, let's look at one summoned in that sum. Let's see how we can calculate the derivative of loss function with respect to Wij. And we already know how to do that because we have already done that like two slides before. And let's use this notation to come up with the rule for the mini batch Backward pass. For two samples, you can actually see that to calculate dLb, dWij, you need to apply that known rule two times. So you take dLd, zi1j, xij and so forth. So you can actually see that what we got is really similar to dot product, right? And matrix multiplication is all about dot product. So maybe we can come up with some matrixes that will give us this result in terms of matrix multiplication. And you can actually find those matrices. You can actually see that dLbdw can be compute in a matrix notation just taking x transpose and multiplying it by dLdz. Where dLdz is a known thing, it's a derivative of scalar with the respect of the matrix and we know how to compute that. You just do that element twice. X tranpose is a simple thing as well. You just replace rows with columns and here you are. Now, let's just check that this rule actually works. Let's check it for W3,2. Let's check that if you take the third row from matrix X transpose and take a dot product with the second column of matrix dLdZ. That will yield you the formula that we have come up with. You can pause the video and check that it actually is correct, so it works. Unfortunately, you should also calculate the derivative of loss function with respect to x and this is where it is a little bit tricky. Let's apply a chain rule, so the approach is standard, let's try to apply a chain rule element wise. So let's take for example, object i and let's try to calculate the derivative of a loss on that object with respect to some feature j of that object. So how to do that? Let's apply chain rule. So first to go to the X, we need to go through Z, right? Because a chain rule is just a path in the graph. So let's write that out, and let's notice that dzik or dxij is actually jjk. So it is thanks to the fact, that z is just a linear combination of values futures X. Then, let's replace w with w transpose and let's swap the indices of that element. Then, you can actually see that to calculate our batch loss with respect to X, you need to calculate dLdZ and multiply it by W transpose. Why does that work? Because you can notice that to make an you need to take a sum of losses, of derivatives of our losses with respect to matrix X, for every element, right? For every instance in our batch. And you can also notice that each instance actually gives you only one non-zero row in dLdxij. Because that instance is only dependent on its own features, right? So that means that effectively, when we are doing a sum of those rows, we're not making a sum. We just take all that rows and stack them in a matrix. So that's why this matrix notation really works. So this is pretty cool. You can see that you can apply Backward pass and Forward pass for mini batches or just for one instance, pretty efficiently. You can do that with matrix multiplication and you can do that with numpy. Let's see on one side, just let's summarize what we have come up with. And now to implement that in numpy. The Forward pass for a dense layer is done pretty easily. If Forward pass just gets all your inputs and that is features and waits. And it takes just the dot product. So it takes the matrix multiplication because that is how we do the forward pass. Backward pass is pretty easy as well. And let me remind you that in the interface of backward pass, we also give the incoming gradient. And that is where it becomes pretty good because we need that incoming gradient to calculate dx and dw efficiently. And we actually do write those formulas that you can actually use dL dz and multiply it, either by double the transpose or x transpose to get the derivatives of x or w. So this is implemented pretty efficiently with numpy, as well. And you can notice one more reason why we use dL dZ in backward pass interface. Because otherwise, we would have to calculate, for example, dz dx. And this is something scary because z and x are both matrices. And it's not clear how to calculate the derivative of matrix with respect to the other matrix, it's a no go. So we should have an incoming gradient, and thanks to the incoming gradient, we can do this efficiently with matrix multiplication. To summarize, you can do forward pass for a dense layer with a matrix multiplication. You can do a backward pass with matrix multiplication as well. And this is pretty cool, and this is where GPU comes into play. Because on GPUs, you can crunch matrices pretty fast. What's more, it's easy to code with numpy and as a matter of fact, we have an honor assignment for those of you who want to do that with numpy. In the next video, we will take a quick look at other matrix derivatives. They're scary, but you should know about them. [SOUND]