In the previous video, we have understood what is a regression

analysis in order to deal with metric data or basically interval data.

We're going to dig deep more into regression analysis and

consider the different types of things we can do with regression analysis.

So, just to recap.

Under linear regression, we tried to measure the effect of your independent

variables, remember the xs, on your dependent variable, which is the y, and

then you have certain error terms, which are the es.

The first idea is to estimate the unknown or

the unobserved variables, A and B, which are the intercepts and that coefficients.

So, what method do you use in order to estimate these A and B terms?

Usually the most simple method is called ordinary least squares or OLS.

So, the idea behind ordinary least squares is to find the value of a and

b for which the sum of square of the inner terms, is least.

So the biggest concept is, how do you minimize the inners overall,

in order to capture the real effect of the xs on the ys?

Now, once you have these a and b terms estimated,

how do you go and dig deeper into these coefficients?

So first concept is to how do you assess the fit of the model.

That is how effectively have you captured the causal effect of the xs on the ys?

So this fit depends on a few things.

The most important thing is called the R square which is the amount

of variance of y which is explained through the regression and

in terms of those independent variables which are the xs.

Usually R squared takes the value between 0 and 1 and

R square increases when adding an independent irrelevant variable

over the set of variables which you already have.

Of course, the more variables you have as your independent variables,

you expect the model to fit better.

Unfortunately, there might be situations of overfiting

the data when you have some irrelevant independent variables.

Also it's important to look at how your predictive accuracy in the model

is and in terms of your R square,

whether you can predict the model better with more set of independent variables.

So how do you do predictions?

Once you know the as and bs we can predict ys for any set of values of x.

How do you do that?

So basically let us again think about the regression equation which is y and

then you have in terms of all the independent variable, xs and then error term.

Basically what prediction does is you try to predict how changing one or

two values of the xs affects your ys.

Why do you do predictions?

Mainly because you want to know how does your model do out of sample.

That is, how well does your model predict observations which are not

there in your sample which is out of your estimation sample.

Usually, what we do in case of out of sample predictions,

is use a hold out sample, and compare the predicted values of the dependent

variables with the actual values.

You can also use prediction to understand a what if analysis.

So, basically for example,

what will be the sales of y when we set the prices of x to a certain level?

The third use of prediction analysis is to understand optimization.

That is again for example, what price or price range of excess

are expected to give you certain level of sales or certain level of profits?

The third use of predictions is for hypothesis testing.

So, remember for categorical data, we only talked about how to do hypothesis testing.

Again for metric data,

we didn't want to know what do you do in order to test your hypothesis.

So again, let's look back into the regression equation,

that is y as a function of your xs and also your a and

b which are your coefficients and your intercept.

Now the question is the hypothesis testing is going to measure whether

there's a statistically significant effect of an independent variable,

which is your x, on your dependent variable which is your y.

For example, let's think about x as price and y as your sales.

In this case,

your null hypothesis again is going to be one where you do not expect any effect.

That is, your coefficient, maybe your b is equal to 0.

You usually use a T statistic that is the statistic from a T distribution

in order to measure this hypothesis testing.

And with certain degrees of freedom.

So basically again, your degrees of freedom measures how many

independent terms you have compared to the number of observations you have.

You also have to measure the standard error of this coefficient.

Finally, as we talked about earlier for categorical variable,

we have to evaluate the p value based on the statistic.

If the p value is really small, that is less than your degree of significance.

Remember the alpha we discussed, you reject the null and so

you conclude that there is evidence of a causal effect of the x variable on your y,

and the effect is significant.

There are certain issues, however, with linear regression,

which you have to be careful when running this kind of a model.

First one is multicollinearity.

What is multicollinearity?

When independent variables are correlated with one another, then including

all these independent variables together might actually lead to biased results.

So, that's why you have to be careful that your independent variables

are not multicollinear.

Second, in this case, you have to delete independent variables,

which are probably very much correlated with one another.