In this lecture, we're going to start trying to implement this more complex model, the latent factor model in Python. To begin with, we'll just do so using the simple bias-only version of the model. In other words, the model which takes just an offset a user bias and an item bias in order to make a prediction. We'll also show how we can use a library function for gradient descent, in order to make this a little bit easier. This is going to be a fairly challenging lecture showing how we could convert some mathematical code for our derivatives into Python code, which turns out is a fairly difficult thing to get right. Okay. So this is going to be the model we're going to try and implement using gradient descent. So we have this bias only model including an offset Alpha, a user bias and item bias term, as well as regularizer for those biases, and we're going to try and write down the gradient equations and convert those to code and then use some gradient descent library to try and optimize the model. We saw previously that trying to get all of the details about gradient descent direct can be very difficult, so I'm going to try and use some library function to help us a little bit here. Okay. So starting with some basic stuff. What we'd like to do first is just define a few terms or variables that we're going to try and optimize. So some of the code for this is given in previous lectures. I'm not covering here all of the basic code anymore to just read in the data and compute things like the global mean of all of the ratings. Okay. The important parts of this code then are these variables that we're going to try and optimize that's Alpha. The user bias for each user Beta_u which will store in this vector of user biases, was dictionary of user biases and Beta_i, the bias for each item, which will store in this dictionary on item biases. There's all variables. We're going to set Alpha to the rating mean, but that's just an initial value to help our gradient descent algorithm converge a little bit more quickly. Starting it at a good initial value might help us find a good solution later on. But all of these values are going to change and update as we actually run gradient descent. Okay. Next thing we'd like to do is given some estimates of those values. So given some value of Alpha and item bias for each item and a user bias for each user, how can we actually make a prediction? So there's prediction function which takes if the user and an item knows about all the values of our offset user item bias terms. It just makes a prediction by adding those three together. So the offset term plus user bias for that user, plus the item bias for that item. Okay. So this is the first complex part of our function. This is some unpack function. This is something we just need for the specific gradient descent library we're going to try and use, which I'll describe in a little bit. So what that library expects is a single vector of all of our parameters. So to take our parameters which are Alpha user bias for each user and the item bias for each item and just convert that to a vector. So this is a function that takes a vector of all of those parameters and then converts it back to Alpha Beta_u and Beta_i. So it takes some vector Theta. I'm going to say, well, Alpha is going to be the first position in that vector Theta north. Then user biases are going to be the next set of positions going from all of our users. The item biases are going to be the next position starting from position one plus the number of users onwards to the rest of that vector. Okay. So we've taken some vector Theta which is something the library function is going to give us. We've extracted that into Alpha, a bias term for each user and a bias term for each item, based on our knowledge or our choice of where we stored each parameter in that vector. So again, this is something difficult to wrap your head around, but it's something that's often required by library functions for gradient descent which just assume you have a single vector describing all of your parameters. Okay. Now the next function which is also required by our gradient descent library is just going to be a function to implement the cost function. So the gradient descent library is maybe given some current estimate of how good your solution is or current estimate of all the parameters Alpha, Beta_u, and Beta_i, and you have to tell it then what is the cost, or the score, or the goodness of that solution. So this is going to take the current estimate of Theta, and it's going to take the labels, and we're going to tell it how good our solution is. It's also going to take this value Lambda, which is going to be our regularization strength. The first thing we do is unpack that vector Theta, we extract the values of Alpha, Beta_u, and Beta_i. Next thing we do is use those values to make predictions or the data points in our data set or our training set, if you'd like to be more careful about it. Then we compute our mean squared error. Again, using a simple function, we showed how to implement it before. We'll print out, that means good error just for debugging. Having computed the mean squared error, which is the left-hand side of the equation, we're going to add on the regularization components. So Lambda times each user biases squared and Lambda times for each item biases squared, and we'll return that total cost. So that is exactly computing the value of this objective or this equation that I've shown. Okay. So this is the really difficult part which is the code for our derivative. So I showed in a previous lecture how to compute some of those derivatives or, in other words, how to write down a mathematical expression for the derivative. Really the most difficult part though is say, "How do we convert that mathematical expression into a block of code and what's more block of code that's actually efficient?" Okay. So the first thing we're going to do, again, is unpack our vector Theta. So we now have our current estimates for Alpha, Beta_u for each user and Beta_i for each item, and we're going to build some new data structures that will store our derivatives. So we have a derivative for each user, a derivative for each item as well as a derivative for our offset term Alpha. So we need to have one value for each of those derivatives. So all the users, all the items plus the offset term. You could think about taking each user one at a time and trying to write down this derivative expression, which you see in the equation. This would work just fine, but rather inefficient. What's more efficient is to iterate through the entire dataset just once and update each of the relevant derivatives as we go. So if we look at a particular point in the data set, u, i, well, that's going to update the derivative of Beta_u, it's going to update the derivative of Beta_i, and it's also going to update the derivative of Alpha. So it iterates through that dataset. Each time we see a user u and an item i, it will update Alpha, will update Beta_u, derivative of Beta_u, more precisely, and the derivative of Beta_i following this expression that I've given. Okay. After that, we have two more for logs which is going to encode a derivatives of our regularizers for each user and for each item. Finally, we've updated our values for Alpha, Beta_u, and Beta_i. What this library function wants is going to be this vector of derivatives. So it is going to convert those values back to a vector. So we'll take the derivative of Alpha, the derivative of our user bias for all users, and the derivative of our item biases for all items will return that vector of derivatives. So that's the last part of this equation here. Again, this function is fairly difficult to follow. I'm just giving you the rough idea here. It's probably something you want to spend some more time looking at on your own to really understand the relationship between all of the expressions in this function and the original equation I gave for the derivative. Okay. Now before we actually test this thing, let's first just compute our accuracy of a trivial baseline, which would always predicting the mean rating. Okay, and there's the value. Okay. Before we can actually experiment it and test the solution, we're going to need to perform gradient descent using this library function. I'm going to use a gradient descent library called LBFGS, which is available as part of SciPy, as a fairly simple user interface which is going to work as follows. We run this function which is called optimize fmin LBFGS. It's going to take the following components. First of all, there's a cost function we implemented that says, given some estimates of our parameters Alpha, Beta_u, and Beta_i, what is the total cost? So the MSE plus the regularizer. Next thing it takes is the initial values for all of our parameters. So that's going to be the rating mean plus a bunch of zeros of the offset times for each user, and each item we're going to be initialized to zero. Next is this function we implement to compute the derivative, which is the previous function I just described. The next thing is any other arguments would like to be passed to those functions. So if you saw previously, there's functions also needed to see what are the labels and what is our regularization constant Lambda. We're going to pass in those two values to each of those functions every time they're called. That's a very general purpose interface to gradient descent. We have this cost function, we have this derivative, we have some initialization values, and we have some additional arguments that we can pass in. So you can really implement any type of gradient descent using this library. Fairly straightforwardly, the hardest part is it's going to expect our derivatives that we compute or our parameters to be stored in this continuous vector, which we are going to have to pack and unpack using the code I showed. Of course, the hard part is we actually have to implement those cost and derivative functions which are fairly difficult to get right. This is just one library for gradient descent, there are many others. You can look into this one a little bit more by following the reference I've given here. Okay. So let's actually try running this code. It's going to run for several iterations. It's going to print out the mean squared error during each iteration, because we wrote that code inside the cost function to print the MSE at each step. There are going to be various convergence criteria that this algorithm uses to determine when grading descent has found a good enough solution. You could experiment with those convergence criteria if you wanted to try and get a slightly better solution on the one here, but it's going to run for just a few iterations, and then it return some vector or parameter values. Okay. So again, this is a general purpose gradient descent algorithm. It simply requires that we provide this cost function, which is really our f of x and derivative which is really just f prime of x or the derivative of f of x, and you can relatively straightforwardly apply this to other gradient descent problems. It could also have used TensorFlow for the same task. So far, it's already gotten fairly complicated. I've just on the simple model which is Alpha plus a bias term for each user plus a bias term for each item. In the next lecture, we'll try and implement the full-blown version of this model, the latent factor model. But really the ideas will be exactly the same. We'll just need to compute a slightly more complicated cost function and a slightly more complicated derivative function. Okay. So in this lecture, we're just implementing the simple bias only version of our latent factor model. So on your own and this is going to be fairly difficult, if you'd like to try this. But if you've got some experience with TensorFlow during the previous parts of this course, you might try re-implementing this model using the TensorFlow library rather than this gradient descent library, the simple one that I provided here.