Hi again. During the previous lessons,

you have learnt about the simple yet pretty powerful factorization method

based on the single value decomposition.

However, as you remember,

you had to find the way to deal with the missing values of the utility metrics.

You have considered several different strategies with the missing values imputation and

work through a few concrete examples with zero-based on average-based techniques.

Despite some flexibility of the method,

it depends on your decision what a good replacement for the unknown values is,

but you can actually avoid making this decision.

Let's go back to the general matrix factorization.

You have already seen this picture during the past lessons.

Recall that SVD gives us

an optimal solution to the quadratic minimization task with complete matrices.

You can equivalently rewrite it for the case of a general matrix factorization approach.

This formulation assumes that all the elements,

A, I, J, are known,

which is a significant simplification as you have substituted

probably more than 99% of all the data points.

Now, what you can do here is to learn your model,

not from the entire matrix but only from the observed data.

This way, you don't have to make any assumptions about the missing elements.

The only concern is whether there is

enough data for your model to learn any viable relations.

As the number of users and items is typically very large,

there are almost no chances that only by observing a tiny fraction of the data,

the model will uncover the true hidden relationships within it.

In fact, while the model is likely to fit the training data well,

its predictions on the unobserved data can be off by a significant margin.

This problem is known as overfeeding.

In order to minimize this problem,

you regularize your model by an additional penalty term with

the sum of squared Euclidean norms of the model's parameters.

Lambda here is a regularization coefficient

that is typically determined by cross-validation.

I have also introduced a factor of one-half for convenience.

Regularization will help to avoid uncontrollable growth of the parameters and,

therefore, find a more reliable solution.

So, how do we minimize this loss function?

The more straightforward approach is to apply a technique called gradient descent,

also known as a badge gradient method.

Defining the meaning of your loss function L,

you'll start by a random guess,

selecting some initial point.

Then, you find the direction of the steepest descent at that point,

which is opposite as the direction of the gradient of your loss function and make a step.

The step size is controlled by an additional variable called learning rate.

In the simplest case,

it is just a constant.

Repeat steps until convergence. Very simple.

Note that each iteration step or epoch,

you can update latent features of users and items separately.

In order to fully update matrices P and Q,

the algorithm has to run over the entire dataset,

and literally update each row PI and QJ of these matrices.

This task can be quite computationally demanding due to

the way the corresponding partial derivatives of the loss function are computed.

At large scale, this may lead to a slow convergence and high memory load.

The better alternative would be to use a stochastic variant of this method,

a stochastic gradient descent or SGD,

which is more computationally efficient at

the cost of a less straightforward convergence.

The idea of this method is the following.

Instead of making updates with respect to all the observations, at each step,

make a smaller update by considering

only a single interaction between the randomly selected user item pair.

Mathematically speaking, in the update formula,

you substitute the full gradient with its stochastic approximation.

In this case, the direction of the update step

will not perfectly align with the true gradient direction.

However, it is still hoped to approximately point towards the minimum.

The algorithm simply makes

more frequent but less expensive updates of the parameters at every iteration.

It is very easy to implement.

After you have initialized the factor matrices,

P and Q, at each epoch,

you sequentially compute approximate updates to the rows of

the matrices by iterating over the observations in a shuffled order.

Most notably, to update all the rows, PI and QJ,

the algorithm sweeps through the entire dataset in a single pass.

Optimization procedure terminates when either the prediction error

teases to decrease or a maximum number of epochs is reached.

The algorithm scales linearly with respect to the number of

observations and in contrast to badge gradient method.

It works with a small amount of memory.

Stochastic gradient descent can be also used in

the online settings when you need to quickly respond new users or new items.

For example, in case of a new user,

the previous algorithm can be simplified.

At each epoch, you only need to iterate over

the ratings of a new user to get an updated latent feature vector.

The complexity of such an update is

proportional to the number of ratings provided by a new user.

This is basically the gradient-based version of the folding-in.

Actually, you can also apply here a conventional folding-in approach.

Firstly, recall how the folding-in formula looks like in the SVD case,

how concise it is.

Following the same ideas as previously,

let's write the approximate definition of the row update.

This time, the matrix Q does not have

any special features like the alter gonality of columns.

And you have to take special care of the inverses in the folding-in procedure.

In fact, what you have here is simply the least squares problem,

and its solution is known.

The size of the square matrix Q transposed by Q,

is defined by the number of latent features,

which is typically small.

An overall complexity is then increased only by

the relatively small cubic term related to matrix inverse.

From the numerical perspective,

it is also a good idea to add an additional regularization within the inverse.

As the matrix Q is arbitrary,

Q transposed by Q can be close to degenerate,

and the regularization will help to improve

the numerical stability of the computation of its inverse.

Remember this formula. In the upcoming lessons,

you will see another optimization approach named alternating least squares,

which uses the same technique to drive the entire optimization procedure.

To sum up the lesson, you are now familiar with the basic matrix factorization technique.

You know what a gradient descent is and how this is computed.

You can explain the difference between

the gradient descent approach and its stochastic alternative.

Finally, you know how to implement

simple SGD algorithm and perform incremental updates for online.