So, this week we're finally going to apply

all this multivariate calculus to help us fit functions to data. This is really handy.

It allows us to make sense of our data,

to start doing statistics testing,

and to start to really apply all of this stuff we've been doing into the real world.

If you have some big bundle of data,

the first thing you need to do is to clean it up.

And there are other courses on this.

But, the first thing you need to do is clean it up to get it into

shape to start doing the maths with.

That means figuring out what the sensible thing to do is with

things like partially-empty data entries,

zeros, those sorts of things,

to figure out if you could do any dimensionality reduction

by binning or grouping the data together,

eliminating duplicates, garbage data,

and all those sorts of things.

Often, the easiest thing to do is to tag all of your data so that

you can reorder them and put them back in when you've changed the order,

then sort them, and look for funny values,

and then figure out what to do with those,

to get rid of them altogether or to replace them with something sensible.

And then, once you've cleaned it up,

you can start graphing it,

take averages, look at standard deviations,

and all those sorts of things.

Of course in doing this, it really helps if you're interested in the data.

You need it to be kind of like a person that you want to

figure out and understand until it becomes an old friend.

If over time as you collect more data,

when it starts to change you want to get to the point where you're going

to notice those changes actually.

So, you really want to get it intimate and friendly with your data.

Once you've graphed it in a sensible way,

you'll often get a plot like this guy here,

which is a simple x-y plot of some data.

This data seems to plot like a straight line.

If you know something physically about the processes involved in generating the data,

or if you have some hypothesis as to how the variables are related,

then you can try and fit that model to the data.

Alternatively, you can just try fitting something sensible based on how it looks like

a straight line to this data here, for example.

Now, I could model my straight line y here,

as being a function of the i observations x_i and a vector a of the fitting parameters.

In the case of a straight line equals y = mx + c,

then the parameters in the vector a would be

the gradient m and the intercept c of the straight line.

So here, I've plotted the optimal straight line for this data.

It happens to have a gradient of 215 gigapascal, GPa,

on an intercept of 0.3 GPa for c. I could also find the mean of x,

x-bar and the mean of y, y-bar,

which are the geometric center of mass of that dataset.

Now, in order to find the optimal value of m and c,

let's first define a residual r,

which we define as the difference between

the data items y_i and the predicted location of those on the line y,

which would be mx + c. So,

r is y_i - mx_i - c. Then,

I can take a measure of the overall quality of

the fit being a quantity I'll call chi-squared,

which is the sum of the squares of the residuals r.

I do this so that I penalise both data that are above and data that are below the line.

I don't want the pluses and minuses to net off against each other.

Also, I really want to badly

penalize data items and that are a long way away from the line.

And then, I'm going to try and find the best chi-squared possible, the one that's lowest.

I'm doing a minimisation.

So now, it's worth plotting out what chi-squared is going to

look like for lots of different possible values of m and c,

which is what I've done on the contour plot here.

In the middle at about 215 and near intercept of zero,

I find my minimum,

and the further I get away from those, the worse it is.

I note that in terms of chi-squared it's slanted,

there seems to be some kind of tradeoff.

The biggest c gets the lower the optimum value of the gradient m and vice versa.

And if I look on this plot as the scale is quite obvious,

if I made the line steeper on the original fit,

then in order for it to fit well,

the intercept is going to have to get smaller.

Actually, I'm pivoting about the center of mass.

And also, this shallow trough here in

the chi-squared value is going to be really quite shallow.

So, this is actually going to be quite tricky problem for steepens to send algorithms.

It's going to be easy to get down the size,

but it's going to be difficult to get down

the bottom of the valley to find the actual minimum.

But nevertheless, it looks like it's going to be quite an okay problem to solve.

It has one easy to spot minimum,

and therefore we can find it.

And note that to do this with any precision,

if I do it simply by doing lots of computations here like I've

done for different m's and c's and finding the minimum that way,

plotting it all out and finding the minimum on this graph,

to do this, I have to do a lot of maths.

In MATLAB, this contour plot took about 200,000 computations to make.

So even for a simple problem like this,

we really do want to find an algorithm that's

gonna let us get there a bit more efficiently.

Now, the minimum is going to be found when the gradient of chi-squared is zero.

So, if we just take the gradient of chi-squared with

respect to the fitting parameters and set it to zero,

that's going to be our solution.

Now, the neat thing is that in this particular case,

we can actually solve this problem explicitly.

So that's what we're going to do in this video.

But then, we'll go on to see how to do it by linear decent.

And then, we'll find a better algorithm,

and then we'll explore how to do this sort of

problem where it's not so easy to do explicitly.

So, if we differentiate the first row with respect to m,

then the first thing to worry about is all the sums over the data items i.

But actually, it turns out that we don't need to worry about these sums

because we're not differentiating the x_i or the y_i themselves.

So, they'll just sit benignly in the sum.

You got mx_1 + mx_2 + mx_3 is the same as if we

differentiate that with respect to m we'll just get x _1 + x_2 + x_3.

So we don't have to worry about those sums.

And then, it's easy, right?

We differentiate a square,

that drops in power by 1 and we multiply by 2

and then we take the differential of the inside of the bracket

with respect to m is minus x_i.

We can then take that minus 2 outside of the sum all together in fact.

For the second row,

it's easier because the differential of the inside of

the bracket with respect to c is just minus 1.

So, we just get the two down for the power and

the minus sign and so it all looks quite easy.

Keeping on looking at the second row,

then the sum of c times the number of data items we can take out of the sum altogether,

and then we've got the sum of the y_i's and the sum of m times the x_i's.

And if we divide that through by the number of data items,

we get our result that c is going to be y-bar minus m times x-bar,

y-bar and x-bar being the average.

We can carry on in that way and generate an answer to m,

which I'm just going to leave here.

I don't think there's any point in showing old math to you blow-by-blow.

It's a bit trickier to see.

I'm not going to go through actually the maths and the derivation,

but you can also find estimates for the uncertainties in c and m which I've put up here,

which I'll call sigma c and sigma m. And it's very important actually when you're doing

a fit to get an idea of

the uncertainties in those fitting parameters and to quote those in your fits.

I'm going to leave those here in case you need to use them.

So, coming back to our fitted data,

we can plot it out again here.

Now, the amazing thing here is just how

accurate this sort of fitting really is, it's really cool.

We've got quite noisy data and a gradient of 215,

but the uncertainties only is nine,

is about five percent. It's really amazing.

Now, you should always plot your fit and visually compare just as a sanity check.

We can see why this is a good idea here.

This is Anscombe's famous quartet.

You can find that the graph on Wikipedia.

All these four data sets have the same chi-squared, means,

best fit lines and uncertainties,

and the fitting problems but also very different data.

And if I had two cases,

probably fitting a straight line is just the wrong thing to do.

The bottom left, if you remove the flight data point,

the gradient's different and the intercept.

It's only the top left where the fit is actually doing the right thing altogether.

And there is another subtlety actually,

if we go back and look at c, the intercept,

we can see that the intercept depends on the gradient m. So,

what we said earlier when we looked at a plot of chi-squared.

Now, there's a way to recast the problem,

which is to look at the deviations from the center of the mass and

the data of x-bar instead and then the intercept b.

Now, the c is the location of the center of mass y, y-bar.

And then, the constant term in the fit b,

that constant b, doesn't depend on the gradient anymore and neither,

therefore, does its uncertainty include a term from the uncertainty m. In fact,

if I plot out the contour plot for chi-squared,

when I do that I find that it isn't slanted.

It's a nice sort of circular looking thing.

So I've removed the interaction between m and the constant term.

So it's a mathematically much more reasonable well postulated problem.

So, that's the essence of regression of how to fit a line to some data.

And this is a totally really useful life skill.

Whatever your professional job,

what we'll do in the next couple of videos is look at

how to do this in more complicated cases

with more complicated functions and how to extend the idea of regression to those cases.

The main thing really that we've defined here that it's important to

remember is this goodness of fit estimator chi-squared,

the sum of the square, and the deviations of the fit from the data.

And chi-squared is going to be really useful to us going forward.