0:15

In this session, you're going to look at regression analysis which is a common

technique used in many different contexts.

But here we're going to talk about it from the point of view of cause effect

analysis.

When you're looking at process improvement and we're going to look at it from

the point of view of an example that we'll look at later on in this session.

But first what is regression all about?

The idea of regression is to generate an equation

that describes a relationship between a y and many xs.

And you can have simple linear regressions which is 1x and

you can have multiple linear regression which has multiple access.

Multiple independent variables having an affect on a dependent variable.

Regression is mainly used when you have continuous independent variable and

continuous dependent variables.

It can be used for different types of data, other than continuous independent

variables and continuous dependent variables.

So it can be used for discrete kind of dependent variable.

It can be used for discrete kind of independent variables, but

those are different techniques that you would use.

You could use it based on what we'll be learning in this session.

You could use it to work with discrete independent variables.

You can trick regression to treat discrete independent variables as

a part of the regression equation.

And we are not gonna look at it in this particular session, but

there are ways in which you can do that as well.

So just know that it's not restricted to the kind of example that we're going to

look at in this case.

So let's take a look at what we

would see when we are interpreting the results of a regression model.

So there are four main things that you would look at when

you get the results from regression.

So when you do get the output from regression,

first thing you would look at is, is the model significant?

Is the p-value for the model significant?

And if you remember, and we'll emphasize it some more later on, but

when we say p-value we're saying is the significance value

of the overall regression less than the alpha value?

P less than alpha, reject the null hypothesis and

we'll look at the p-value for the F-statistics.

What you will see later on,

is that regression also has within its results an ANOVA table.

It's a ANOVA table that describes its results, and

that's where you'll be looking at the p-value for the f-statistic.

Very similar to what you might remember from analysis of variance.

Very similar to that ANOVA table, because this will also be an ANOVA table.

So that's the first thing you look at,

overall model significance based on F-statistic and p-value.

That will tell you whether, the model is statistically significant or not.

Then you move on to the R-squared, and that's talking about something we can

think about more from a practical perspective and

we'll see what the R-squared means.

But that would be the next thing and that's what it's also called, goodness

of fit and next thing we would look at, is the independent variable coefficient.

So if you have multiple independent variables although the model

may be significant, the overall model may be significant,

what you might find is that you might find the individual independent variable

coefficients to not always significant.

There might be multiple Independent variable coefficients, and

you wanna see whether each one of them is significant.

For that, we also have the same rule of p-values, looking at p-values and

seeing comparing them to alpha value that you have.

Just to talk about that briefly, it uses a t distribution.

Overall model significance uses an f distribution, the intercepts and

the coefficients are looked at based on a t distribution.

And finally, you want to look at the t-statistic for each

of the coefficients and its p-values and you wanna see how significant they are.

So going back to the same idea of whether each of

the coefficients is having a significant impact on the dependent variable.

4:20

So what is a p-value, just to revisit that, just to rehash that,

what you may have already seen earlier.

The fact that if we get a p-value that's less than alpha,

we reject a null hypothesis.

So kind of a cheesy statement here, if p-value is low then null value must go.

It's a good way to remember this whole idea of,

what do we mean by how do we us p-values?

If the p-value is low the null hypothesis must go.

You reject the null hypothesis.

How is the p-value computed?

We use the f-distribution for the overall regression equation.

Sometimes it's computed using the z-distribution,

sometimes it's computed using the t-observed or the t-distribution, so

many different ways of computing the p-value.

It comes back to the same rule though.

What is R squared?

When we're talking about regression we refer to something called the R squared.

It's a very practical way of thinking about,

how good is the regression equation in a practical sense.

So you come up with regression results, you come up with an equation,

you come up with a statistically significant model.

The next thing you look at is, is the R squared high?

What do we mean by high?

R squared values go from 0 to 1, and what they're basically telling you is what

percentage of the dependent variable can be explained by the independent variable?

What percentage of the variation in the dependent variable can be explained by

the independent variables that you have in the equation?

So that's what we're looking at.

What percentage of the variation in Y can be explained by

the variation in all the different Xs.

So you have the total explained variation and

you divide that by the total variation that gives you your R squared.

So it's a ratio of your total variation that's explained divided by the total

variation, and that's why it goes between 0 and 1, the highest value being 1.

And more formally stated is the proportion of variation in the dependent variable.

That can be explained by the independent variable.

So that's your r squared.

What you also see in the output for any multiple linear regression kind of model.

In fact, you'll see it even for a simple linear regression model, but

it's not as important, it's not as significant there.

When you have multiple independent variables that you're throwing into

an equation, what you will see is that the r squared

adjusted value is something that you want to focus on.

Why do you wanna focus on that?

Because what you need to know for that is you need to know what will happen when

you add more variables, more independent variables to a regression equation.

When you add more x variables to a y equals function of x equation,

the R squared value, the proportion of variation explained by that equation,

is always, always, always going to go up, if ever so slightly.

So what are the implications of this?

Even if you throw in nonsense independent variables into an equation,

the R squared value will always go up even if it is ever so slightly.

So what is that telling us?

Well that's telling us that we simply keep on putting in

independent variables even if they don't have any relationship.

We can get a higher R squared value.

Well the R squared adjusted, gives you a more truer sense of whether adding that

variable is giving you significant bang for the buck if we call it that.

So the R squared adjusted value, adjusts for

the number of predictors that you have in that model.

And what you'll find from any kind of output that you see from a regression,

you'll find the adjusted R squared value is always adjusted down from the R

squared value.

So in a practical sense what it's saying is that it's accounting for

the number of predictors that you have so

it's adjusting it down from the raw R squared value that you are having.

How is this helpful?

It helps to make a decision about whether you should be adding more X variables

into this equation other than looking at the significance of those x variables,

you also want to look at whether the R squared adjusted value is going up

because what can happen with the R squared adjusted value is,

it can actually go down when you add more independent variables into the equation.

What is useful is to keep an eye on the R squared adjusted, compare it to the R

squared, and then also keep an eye when you're going from one model to the other,

whether the R squared adjusted is going down when you're adding more variables.

All right. So

we've talked about overall modeled significance.

We've talked about the R squared value and then the next thing we want to look at

is the individual coefficients in the equation, and

we call these the betas, the b-coefficients, so you might have b1, b2,

b3 and what are we talking about there?

We're talking about the null hypothesis for each one of these coefficients saying

that is this independent variable significantly explaining the dependent

variable, and for each one we have a hypothesis that says,

and the null hypothesis says that b1 is 0 in mathematical terms we're

saying the slope is 0, so it has no impact on the y value that we had.

And the alternative hypothesis is that it's different from zero and

it's significantly different from zero, for

us to include it in the model would be a statistically significant effect.

So how would you test it's statistical significance after the coefficients,

the beta coefficients, you would test it based on or regression tested rather,

based on the t distribution.

And the t distribution here is based on degrees of freedom.

And what are the degrees of freedom that we're talking about here?

You may remember degrees of freedom.

We talked about them when we were talking about ANOVA.

Here we're talking about degrees of freedom for

the t-distribution being n- k- 1, n being the sample size,

n in statistics as always the sample size, k is the number of

independent variables in this case and we subtract one more for the intercepts.

So n- k- 1 degrees of freedom and again you don't necessarily have to know

these technicalities when you're interpreting it based on software but

if you wanted to know the critical value for being able to reject or

retain the null hypothesis for a particular coefficient,

you could get it based on your alpha value as well as your

degrees of freedom and going to the t table or getting it from Excel.

All right, next let's take a look at what you get

from a regression equation in terms of an equation.

So what you get is you get a regression equation

which gives you the direction, the size, those are the two things it gives you so

it gives a slope, and the slope would be plus or minus, and

there would be a size associated with that slope.

There would be a b value that associated with that coefficient, so

it gives you the size.

And then as we saw on the previous slide,

you also get a statistical significance value of it.

So you get a p value associated with each coefficient, which will tell you

whether that coefficient is helpful in terms of interpreting what you're getting,

in terms of predicting what you're getting as the y.

So is x1 or x2 or x3, each one of those a significant predictor of y.

The sign of each coefficient is going to be important and

how you would interpret the coefficient is, is that it would represent the mean

change in the response for one unit of change in the predictor.

The mean change in y for one unit of change in that x value.

That's how you would interpret this.

That's what would be the practical

12:21

interpretation of that particular independent variable.

And then, finally, you will have an equation that you can use.

So you'll get an equation that says y equals a particular intercept,

so just to make up an example here.

It could be y = 4 + .2 times

a + .3 times b + .6 times c.

What is it telling you?

Now you have an equation.

So if you had the values for a, b, and c, you can plug them in and

you can get a value for the y.

So let's take an example in order to use this idea of interpreting an equation.

So let's say that you have a potato chip company that is analyzing

factors that affect the percentage of broken potato chips in their bags.

And what they're looking at in terms of the independent variables.

So the dependent variable is broken potato chips.

The independent variable is the percentage of potato relative to other ingredients.

That's one independent variable.

And the other independent variable is the cooking temperature.

So, does cooking temperature and the percentage of

potato in the potato chips affect the number of broken potato chips in the bag?

The percentage of broken potato chips in a bag.

That's what we're looking at in this particular example.

And let's say you did the analysis based on some data and

you don't have the data right here, but you do have the results of the analysis.

So what you have right here is the results of the analysis,

it's giving you, let's say the equation was significant.

Again, you don't have the f value, you don't have the [INAUDIBLE] table, but

let's say that was significant and you've gone past that.

You said well the first thing we looked at was the S statistic for

the whole equation.

The model is significant.

We looked at the R squared.

The R squared is 67.2%.

That's telling you that 67.2% of

the variation in broken chips is explained by these to variables.

Are each of these two variables significant?

So let's take a look at that.

We see the P values for each of these variables.

Percentage, potato.

The P value is 0.001.

Cooking temperature, the P value is 0.02.

So what is this telling us?

Let's say that we were using an alpha value of 0.05.

These two P values are less than 0.05.

So each of the coefficients as a P value that's less than .05.

It's .00 and .02.

So we can say that each of these two significantly affects broken potato chips.

The intercept or the constant in this case,

what is shown as a constant in this particular table has a P value of .32,

and we don't really interpret the constant or

the intercept in the case of a regression equation, so we leave it as it is.

We are going to use that as part of the equation.

It's going to be a part of the equation when we want to do

any kind of calculation of broken potato chips, but however,

we do not worry about its p-value being greater than 2.5.

So keeping that in mind, what are we seeing here.

We're seeing a regression equation.

We can come up with a regression equation that says percentage of broken potato

chips in a packet is going to be equal to 4.231 minus 0.044.

The percentage potato plus 0.023 cooking temperature,

and that's what we have in terms of our equation.

That's the equation you have over here.

Now, what I'd like you to do is take this equation as an exercise,

Take this equation.

And first of all, think about how you would interpret it,

which we've already done on the previous slide.

So, that should be something that should be easy to think about in terms of this

specific example, as to what are we talking about here in terms of how does

potato content affect broken chips?

How does cooking temperature broken chips?

And second, what I'd like you to do is for a particular setting of 50% potato and

a cooking temperature of 175 degrees Celsius,

come up with an expected value of broken potato chips.

So, calculate based on this equation, the expected value of broken potato chips.

So, go ahead and do the calculations for it and then we'll come back and

take a look at the solution.

16:48

So what you would have seen here, in terms of applying everything that we learned in

terms of the equation to this particular example, is you would have said, well,

we can interpret the equation based on for each 1% increase in the amount of potato.

The percentage of broken chips is going to decrease because we had a negative sign

for that particular slope.

It's going to decrease by 0.044%.

And on the other hand, with an increase in temperature for

each 1 degree Celsius increase in cooking temperature,

the percentage of broken potato chips is expected to increase by 0.023%.

Now, you have to cautious about the fact that you have to treat this as

a equation that you can use only within the range of the data in

which you had collected in order to come up with this equation.

So, what I'm saying here is that you can not go beyond the range of the data on

the basis of which you came up with this equation.

Because, how do you come up with this equation in the first place?

Or how did somebody come up with this equation in the first place?

They modeled what they found based on data that they collected from the process.

So, what I'm saying is that if the temperature range was going from

a certain temperature to a certain temperature,

only if it was going from 100 to 300 degrees only.

You can't use this equation to try and predict something that would happen at

any temperature less than 100, or any temperature greater than 300.

You don't want to go beyond the range.

Similarly for amount of potato,

you don't wanna go beyond the range where you have actually collected the data.

So, finally in order to complete the question that we had and

how you would answer it.

The last thing is, we were asked to predict the percentage of broken chips

based on settings of 50% potato and a cooking temperature of 175 degrees.

So, if you go through these calculation,

you're saying that you calculate an expected value of 6.056% broken chips.

So, that's what your interpretation, in terms what you find here.

All right.

Now, let's take an example here that we are going to solve using Excel.

So, let's take an example of data that you have in the form of an Excel spreadsheet.

But you're also going to see the data on the slides first, and use this to

practice the use of multiple linear regression to do some kind of analysis.

So, this is a company that manufactures various types of sparkling lights.

The manager is interested in getting a better understanding of overhead costs.

So, we have the data given to us.

The data that she has tracked is on total overhead costs for the past 36 months.

So, we have 36 months worth of data.

This is going to be the y variable.

The y variable is going to be total their overhead costs.

To help explain these, she has also collected data on two variables that,

obviously, she believes are related to the amount of work done at the factory or

the overhead cost.

So, these variables are machine hours,

the number of machine hours used during the month.

And production runs, the number of separate production runs during the month.

So, these two variables represent how much work is being done in the factory.

And what she believes is, these should have an impact or these

do have an impact on the overhead costs that are being incurred at that factory.

All right, so we have an explanation of what each of these measures are.

So, let's go ahead and take this data and pull that into Excel.

And use it to come up with an analysis of what we find in terms of first,

is there a relationship between these three things,

between this one y variable and the two x variables?

The two x variables being machine hours and production runs.

And second, we want to see what is that relationship if there's a relationship.

So, let's move to Excel and do this calculations first,

do this analysis first.

20:58

So, here you see the data for the fireworks company problem.

You have the Excel spreadsheet that should be available to you.

And the data has three variables, three columns of variables.

Overhead costs is going to be our dependent variable and

machine hours and production runs are going to be our independent variables.

So, let's go ahead and do the analysis here.

We go to data.

We get to the add in, which is data analysis.

And there, Excel has it alphabetically.

So, let's go ahead and find regression, and here's regression.

We hit OK.

And then it's asking for the input range.

So, our y variable in this case is the overhead.

So, we simply go to overhead and we highlight the whole

column There,

and then we go for the x range in here, depending on how many variables you have.

You're gonna have multiple columns.

In this case we are going to have two columns, because we have two x variables.

Now, this also should give you an indication of how you would have

to put your Excel spreadsheet in a way that you can use it for regressions.

So, you want your x variables to be in consecutive columns,

right next to each other.

It doesn't give you the ability to skip and go to other columns.

So, you need to put all the x variables in consecutive columns,

which we already do have it here.

But in case you don't, then you'll need to do that.

You also need to pay attention to clicking labels,

the first row has the labels, so we wanna make sure that Excel knows that.

And you don't need to change the confidence level, it's set up at 95%.

We're simply dealing with p values as the way to interpret the results,

although confidence intervals would give you exactly the same results.

So, 95% confidence interval is simply looking at a alpha value of 5%.

But we don't need to worry about that at this point.

For the output, we are still going to ask for it to be in the new worksheet play.

And we're not gonna worry about the residuals and things like that.

Although, these would be things that you can check for,

whether assumptions of aggression are being violated and things like that.

We're not getting into those kinds of advanced information in this course.

So, let's just leave that unchecked and we hit OK, and Excel should give us results.

So, here you have the results for the regression with the regression statistics,

the ANOVA table and the coefficients.

23:42

So here's the data that you had and

you used this to do the analysis of y as a function of these two x's.

And here are the results that we got.

So, the results here that we got here are broken up into different slides.

You have the first slide, which is giving you the model significance and

some of the information about the overall model.

So what do we see here?

We see that the F statistic in the ANOVA table is 107.03,

which is a very high F value.

And we also see that the significance value is really, really small.

So it's 3.75 multiplied by E to the power of -15 so many,

many, many, many, many 0s until you get to a number here.

So the model is significant.

Next, we move on to C the R Square.

The R squared that we're getting over here is 0.87.

So you look at the R Square value of 0.87 and

that's telling you 87% of the variation.

In the overhead cost is being explained by the two X variables that we had,

which is the production runs and then number of changeovers that we have.

So our production runs and the number of hours that we have.

So those are the 2 x variables that we had, and we looked at it's effect on

overhead costs, so 87% of the variation is being explained, and

what you can also see from here is that the Adjusted R Square is only

slightly lower in terms of it being adjusted for number of variables.

Which it's only at 0.86, so that's an indication that we might find that both

of these independent variables might be significant in affecting the Y variable.

26:03

So here, you have the equation, the intercepts, and the P-values for

the intercept, and for the two coefficient.

So you have the intercept which has a P-value of 0.55 we said

earlier that we're going to use the intercept but

we're not going to interpret its P-value in terms of it's not significant.

We're going to interpret the other two P-values,

which are telling us that both of these X variables, the machine hours and

the production runs are significantly effecting the overhead costs,

and we can get an equation from this based on what we see over here.

So the equation that we can get is that overhead costs are going to

be equal to 3996.68, plus 43.54 times machine hours,

and plus 883.62 times production run.

That's the equation that we can come up with based on this model.

So here you see how you would interpret the result that you would get from

the data that we used in order to do this calculation.

So finally in closing, we looked at a very simple type of a regression model.

The most basic type of regression model, the most common type of regression model

that we can use in most situations we can use this.

So what are the other kinds that are out there?

So as we were talking about earlier, we can take

discrete variables and throw them in as independent variables, we would have

to do some kind of coding of those discrete variables in our throw them in.

We could also do some kind of interaction analysis.

Similar to what you would do in a tubian analysis of variance.

You could an interaction analysis based on multiplicative terms.

You can do the effect of X on Y depends on values of a different X.

So it depends, the effect of x1 on the values of Y, depends on values of x2.

And that you can find in some way in terms of doing the analysis.

You can find it by doing a multiplicative term,

multiplicative interaction term and adding that into the regression equation.

You can also for nonlinear effects you can add square terms and

cubed terms, that is a quick way of looking for nonlinear effects.

If the square term is significant,

if the cubed term is significant, what does that mean?

You simply take an X value and you square it.

So in our case, if we wanted to look at nonlinear effects

of the number of production runs you have created another column that says,

production runs squared, and we would have added that in if we had a hypothesis of

that there's gonna be a nonlinear effect of production runs on overhead costs.

There are other regression models that you can use out there in terms of