0:15

Hello, and welcome to the lesson on introduction to logistic regression.

Logistic regression despite its name,

is actually applied to classification tasks.

Basically, this technique works by performing linear regression

on continuous features to make a prediction on a discrete feature.

So many of the same things that we've discussed with

linear regression will apply to logistic regression.

The difference being that this is for classification tasks.

Specifically at the end of this lesson,

I want you to understand what the basic concepts underneath logistical aggression are,

to be able to explain the benefits of using

logistic regression and why you may or may not

want to use it for a particular classification task.

And I want you will be able to apply

logistic regression by using Python and the Scikit learn library.

Now, all of the content for this lesson is

contained in the introduction to logistic regression notebook.

Traditionally, when you do regression,

you're solving for a continuous predictive value.

Logistic regression is for a classification task however,

where you're predicting a discrete value.

So we're going to follow along with the ideas of linear regression,

but introduce them for logistic regression.

And in this case, we're going to have to make a prediction into

a probability of success or failure which is bounded by the range zero to one.

To do this, we're going to have to have a transformation.

The most popular transformation uses the logit function,

you can also use the probit function.

This whole task of using the logic function is known as logistic regression.

And that's because the inverse of the logic function,

which is what we actually use,

is called the logistic function.

Now in this notebook,

we're going to introduce logistic regression and demonstrate how it can be used for

binary prediction and we're also going to show some of

the other tasks that are important in logistic regression,

including things such as marginal effects and odds ratios.

So, first we will start with our standard setup code,

before moving into the formalism of what actually is going on with logistic regression.

Imagine we have a situation where

we can have a binary outcome that would be a success or failure.

For instance; Flipping a coin.

Head is a success tail is a failure.

We can call the odds of success as the probability of

success P over the probability of failure 1 minus P. Now,

if we take the idea of linear regression,

but now I want to map it into this range to zero to one,

we can do that by using the logit function.

So let's talk about the logit function.

The logit function simply takes the logarithm of

this odds ratio and this takes this probability,

P, in this range,

and gives us a value out.

If we invert it, we now can turn a continuous value,

Alpha, into that probability.

There's also the probit function that we can use short for probability unit and that

can also be used for logistic regression which is

sometimes known as, probistic regression.

So, let me show you an example of the logistic function.

Here's our Alpha value. It's continuous.

It goes from minus infinity to infinity and it maps into a space of zero to one.

This dash line shows you a threshold.

As we move that up or down,

we determined failure below, success above.

That threshold can actually be a tunable hyper parameter.

Now, in order to figure out the optimal model parameters,

we need to do the minimization of the cost function,

just like with linear regression.

So if we had a linear model like this,

we could actually solve it by putting in our Y,

our predictive value, getting our logistic function out,

this is now going to map into our probability.

So we have to have a cost function.

The typical way we solve this is with gradient descent,

or specifically, stochastic gradient descent;

which introduces a randomized component to gradient descent.

Gradient descent is a simple concept,

it simply computes the derivative or finds a slope of

the tangent line to the cost function at a particular point.

So let me show you a plot.

Here we've got an example.

The code just makes this plot.

Say this blue curve is our cost function and we want to find the minimum.

To do that, we start some place out,

in this case X equals two,

and we compute the gradient.

That's this green dash line.

And, it tells us that in order to reduce the cost we have to move to the left.

How far we move to the left,

we don't know. So we move a little bit.

What we do then is compute the gradient again.

And we say, "Which direction do we have to move?"

In this case, we keep moving to the left until we've reached the minimum.

If this cost function was a more complex surface,

so for instance if it was in multidimensional,

you might have multiple little minimums and we want to find the global minimum.

This process can be more complex.

And yet, this lies at the heart of all machine learning algorithms.

We have some sort of cost function that we need to figure out the minimum for,

to determine which is the best model for our data.

The standard thing we do is gradient descent or some variant of that technique,

but there's other algorithms as well and you'll see that

as you become more proficient at machine learning.

So, before jumping into logistic regression,

I want to introduce logistic modeling where we can

model a data set with a logistic function.

The data we're going to use for this is the challenger O-ring disaster.

The idea is you have temperature

and you know whether an O-ring failed at a certain temperature.

Before I do that, I want to mention these two code cells.

This is an important code cells we're going to check.

First we're going to define the data name for our data locally.

And that's the first code cell,

the second code cell says,

does that data file exist?

If it does, we do nothing.

If it doesn't, we actually use the Wget command to extract it from a remote archive.

So in this case, you could see file already exists, we didn't download it.

But if we did, this notebook run,

it would pull that data down.

We're going to use that repeatedly in this course as we introduce new data

sets into the analysis with machine learning.

So we process this data,

we sample a few rows,

you can see there's a few features.

How many thermal distresses or failures do we have?

What's the temperature? What's the pressure?

What's the order?

We're going to focus on the temperature and whether there was a thermal distress.

So looking at this,

the first thing we can see is there's one problem,

this, we wanted to map between zero and one.

One meant there was a distress,

zero meant there wasn't.

So we have to take care of that.

The way we're going to do it, we're simply going to change that

to one and we talk about this in the notebook here.

So we do that, we now can apply our modeling,

we can model this data and we get a result out.

We can then make a plot of this data.

So here are the actual measured values,

whether there was a failure or was not a failure,

and you could see at high temperatures,

there tends not to be a failure,

low temperatures there is and here's our model.

We can then use that to make predictions,

and in this case it shows that as the temperature gets very low,

we have a 100 percent chance of failure.

This actually is the demonstration.

The challenger to launch disaster was at 36 degrees,

so the engineer should have expected a failure.

Of course, hindsight's perfect and so

we have to be always cognizant of that looking back.

Now we're going to introduce logistic regression.

We talk about some of the important hyper parameters,

then we introduce logistic regression,

in this case, we're using the C parameter setting it to something.

The reason we do this is because we don't want any kind of regularization.

We'll talk about that in a future module which is designed to prevent over-fitting.

But I'm just going introduce this value here so it doesn't do any regularization.

This is strict logistic regression, no regularization.

The rest is notebook here,

now we're just going to start talking about test train split.

We split our data,

we want to make sure we follow a stratified split.

That means that when we split it in by training and testing,

we maintain the class relationship so that we don't

get imbalanced data sets which gives us a model that doesn't predict very well.

Here's our fit to that challenge O-ring data,

and then we can make predictions based on the temperature.

We can also make a classification report just like we did with

linear regression and we can make a confusion matrix,

just like we did with linear regression.

We could also change the data set if we want.

First, I want to talk a little bit about performance metrics.

This is effectively our confusion matrix that we saw before.

We can actually compute things such as the true negatives,

false positives, true positives, false negatives,

and then we can actually look at these to compute

different performance metrics and this table

here shows how to use those values to create those.

So we can also talk about type one errors and type two errors,

these are very important concepts.

Typically, they're used in hypothesis testing,

which we'll talk about in a future course,

but we can also calculate many of these quantities and so we do this.

Here's the precision accuracy recall F-one score,

and then we can compute the same things but using

the built in functions from Scikit Learn,

and you can see they give the same values.

We could also do more complex fitting.

Here we can compute the coefficients of

our fit and then predict our data and then make a plot.

So this is the same data you saw before.

Here's that logistic regression model and now here's

our logistic regression itself computing.

And notice how we go between the zero,

positive zero straight up to the probability of one

with our model predicted data, our test data.

I introduced the SGD classifier here simply because it can implement

logistic regression by default with no regularization so it makes it easy to do it.

We don't have to use that C parameter, we simply have to say,

use the log loss,

and that makes it logistic regression.

And we can compute the exact same thing with that.

Then the notebook switches over to the tips data set.

In this case, rather than making a prediction for

total bill like we did in the linear regression notebook,

we can actually try to make a prediction on one of the categorical features.

In this case, we're going to say, "Can we make a prediction on whether

somebody is a smoker based on the total bill,

the tip and the size of the party? "

And so we go through this notebook,

we make the same things we've been doing before,

we split our data into test and train.

We take our model,

we fit our model, we make predictions from our model and you can see the results.

We're going to apply other classification algorithms to

the same problem in future data sets and see how this changes.

We can get our confusion matrix,

you can see there's a big change.

This doesn't look as good as previous confusion matrices,

but that might be okay because sometimes you're more

worried about a specific performance such as minimizing

false negatives or minimizing false positives and thus shall

accept certain types of errors in order to get the results you want.

We could also look at other ones,

this case we're going to use categorical features in addition to those total bill,

tips and size to try to see if it'll improve the results.

And so we can go through and you can see that it does improve the results including

those not necessarily with this particular value here.

We'll see that in the confusion matrix,

this was 20 remember,

but now it's only gone down to 13 so it's a little better,

but we might be able to do better if we did some other technique as well.

So the next thing I wanted to get into

was showing you how to do this with a formula based.

We first need to get our data frame and then we can

use the stats model API interface to do logistic regression.

So here we're saying our label whether somebody

is a smoker or not is related to total bill,

tip and size and we can computer our function and get the results out.

And so you can see your parameters that go multiplying

these particular features and the error on those and the confidence intervals.

We can look at other things like confusion matrices if we want.

Lastly, then I wanted to go down and look at

the last two concepts to introduce in this notebook and that was marginal effects,

which is what are

the relationships between the different features and making a prediction?

And so we can compute those very easily with

the stats model and that's with this particular code sell on its output shows,

the relationship between the different features and their predictive power.

And then I wanted to show the odds ratio,

which is important when you're looking at whether

a feature contributes or not in a particular way to the prediction.

And so I wanted to show this as well, and again,

we can get this output very easily with the stats model interface.

So I've gone through a lot here in this particular notebook.

It will take you some time to go through this but hopefully you'll get a good feel for

both the classification challenge in general and

the use of logistic regression for classification tasks.

If you have any questions let us know, and good luck.