0:00

[MUSIC]

Â In this video, we'll see different tricks that are very useful for

Â training Gaussian process.

Â And the first one is, what you should do when you see noisy observations?

Â As youremember the mean would go exactly through the data points, and

Â also the covariance, and actually variance would be zero at the data points.

Â And if you fit the Gaussian process to a noisy signal

Â like this you will get this quickly changing function.

Â But actually you can see that there is something like a parabola here,

Â and also some noise component.

Â So let's modify our model in a way that it will have some

Â notion of the noise in the data.

Â The simplest way to do this is to add the independent Gaussian

Â noise to all random variables.

Â So have some new random variable f hat.

Â It would be equal to the original random process f of x plus some new

Â independent Gaussian noise.

Â This means that, we'll independently sample it for

Â each point of our space rd, that is different axis.

Â We'll say that the mean of the noise is 0, and the variance would be s squared.

Â In this case, the mean of the random process was to be 0.

Â Since we have the sum of two means,

Â the mean of f of x and epsilon, and those sum up to 0.

Â And the covariance would change in the following way.

Â The new covariance would be the old covariance K of xi minus xj,

Â plus s squared, the variance of the noise.

Â As an indicator that the points xi and xj are the same.

Â This happened since there's no covariance between

Â the noise samples in different positions.

Â And if we fit the model using this covariance matrix using this kernel,

Â we will get the following result.

Â So as you can see, we don't have the 0 variance data points anymore.

Â And also the mean function became a bit smoother.

Â However, this still isn't the best we can have.

Â We can change the parameters of the kernel a bit, and

Â find the optimal values for them, in this special case.

Â If for example we have the length scale equal to 0.01,

Â the covariance will drop really quickly to 0 as we move away from the points.

Â And so the prediction would look like this.

Â And this is like the complete garbage.

Â If we take the length scale to be equal to 10, then it would be too high.

Â And the prediction will change through really slowly.

Â And it would be 0 almost however and

Â the variance would be like two the basic

Â the variance of the prior process.

Â So here we select the l to be 2, somewhere in the middle, and

Â we'll have the process like this.

Â It actually has some drawbacks, since as you can see, at the position -3 and

Â 3, the process could create and starts to reverse its prediction to 0.

Â So maybe we could change some other parameters a bit better,

Â like just letting sigma squared or x squared.

Â And maybe we'll be able to feed Gaussian processing better and

Â actually it turns out that we can so this automatically.

Â We can have all our parameters for

Â the Gaussian kernel will have sigma squared parameter,

Â l, and s squared, so three parameters.

Â We're going to tune them by maximizing the likelihood.

Â So we take our data points, f of x1, f of x2 sub 1, f of xn and

Â maximize the probability of this data process to observe given the parameters.

Â Since everything is Gaussian for Gaussian process, it will be also Gaussian

Â with mean 0 and the covariance matrix c as we have seen in the previous video.

Â If you write down what the probability just fraction is equal to read carefully.

Â You will see that you can optimize this value using simply the gradient ascent.

Â And using this you will be able to automatically find the optimal values for

Â the variance sigma squared, the variance s squared, and the length scale parameter l.

Â So if you run this procedure, you will get something like this.

Â So we estimated l to be 2, which actually is true.

Â However, we spent some time doing it by hand, and

Â this value was selected automatically.

Â Also, we were able to estimate that the variance of the process

Â should be 46.4 and the variance of the noise should be 0.7.

Â As you can see, also on the boundaries, the prediction became a bit better.

Â So the process doesn't reverse it's direction very quickly.

Â Let's see how the fitting of this process works for different data points.

Â In this case, I tried to fit the Gaussian process simply into a noise.

Â In this case, the Gaussian process estimated that the s

Â squared parameter versus the noise should be 0.79.

Â It really believes that all the data

Â that I gave him is signal noise.

Â If I try fit a Gaussian process into a data that I sampled

Â without noise, it will quickly understand it and

Â will put the noise variance parameter to almost 0.

Â So in this case it was like 5 times 10 to the power of -17,

Â which is actually really close to 0.

Â If however I have the process that has some signal but it also has noise,

Â you automatically find that the noise variance should

Â be somewhere in-between 0 and some larger variables.

Â In this case it estimated to be 0.13.

Â All right, now let's see how Gaussian process can be

Â applied to classification problems.

Â Previously we saw how they can be use for regression, and for

Â classification it is a bit harder.

Â So have two labels, two possible labels plus 1 or -1.

Â We can use latent process f of x, this will show something like,

Â how sure we are in predicting this or that label.

Â And if we fit somehow the latent process f of x, we will be able to do

Â predictions by passing the latent process through a sigmoid function.

Â So the probability of the label y given f will be simply

Â an 1 over 1 + exponent of -yf,

Â which is the sigmoid function of the products y times f.

Â So to train this model, you will first have to estimate the latent process.

Â You'll have to estimate the probability of the latent process

Â in some arbitrary points given the labels that we already know.

Â So y1 of x1 for example could be plus 1,

Â y of x could be -1, and so those are just binary decisions.

Â So we estimated the latent process and

Â then we could use it to compute the predictions.

Â We could do this by marginalizing the general probability of labels and

Â the latent process.

Â This would be just the simple intro code,

Â the probability of the label given the latent process.

Â At time the probability of the latent process and

Â it is integrated over all possible latent processes.

Â So the mess here is a bit complex, so I'll skip it for

Â now and lets just see how the condition works.

Â So the first step as I said is estimation of the latent process.

Â So in this case I have the latent points marked as crosses here,

Â some have the value plus 1, some have the value -1.

Â And if we fit the latent processes look like this.

Â So as you can see, as we go to the area where all points have the labels +1,

Â the values for the latent process would be positive.

Â And for

Â the negative examples the probability of the process will be negative.

Â And here are our predictions,

Â I just took the latent process and [INAUDIBLE].

Â So as you can see it is almost one in the positions where there

Â are many positive points nearby.

Â The same happens for the negative examples, and

Â in the points where the targets change from plus 1 to -1 for example.

Â The variance would be high and the prediction would be,

Â won't be so certain in this points.

Â So for example, somewhere in between -1 and -2,

Â the value of the prediction would be somewhere around 0.5.

Â It is almost absolutely not sure about the prediction.

Â