0:04
Let's talk about model types.
So, I want you to imagine we're a bank,
we're giving out personal loans to customers.
And the boss has come along and told us to use this machine learning thing.
Now, before we can go any further,
we need to know what our question is.
This will tell us what kind of machine learning model we need to build.
Now, the question is dictated both by the business needs,
by the questions that the business has and wants answered,
and by what kind of training data we have available?
So, I'm going to talk about four different types of models.
We have Regression, predicting a number then,
two types of Classification: Binomial means there's two choices,
Multinomial means there's more than two choices
and the final one on the list is Unsupervised learning.
1:10
So, a regression means predicting a number.
With our bank loans it might be,
we want the model to tell us how much we can safely loan a customer?
The "How much" is the business question.
What we need in the data is a single column that tells us,
perhaps previously how much we have lend to customers?
The other columns in that data will be features,
information about the customer,
maybe their age, their income, their job,
how long they've been in the job,
if they own property, that kind of thing.
And so maybe the human expert has filled in how much we lend them previously.
With the binomial, the answer column has to be a yes-no,
true or false, black-white.
And in this case,
the data might be the same features telling us about our customer,
but did we give them a loan in the end or not?
Or alternatively, it might be,
did the loan go bad or not?
Multinomial is when it's not a simple yes-no question.
But it's also not the number that we're trying to predict.
We've already looked at a multinomial classification,
the iris data set because there's
three different types of iris we were trying to distinguish between.
For bank loan, we might use a multinomial classification if we want to decide,
which loan product is the best fit for this particular customer.
Alternatively, we might want to distinguish shades of gray,
so we might have three categories: Yes,
give this person a loan.
No, definitely don't give them a loan.
Or not sure, let's bring them in for an interview.
So remember, your training data to do any of these three supervised learning models,
your training data has to have an answer column.
Has to be a single column and the way H2O
works is you don't tell it what type of model you want to build.
You tell it which column is your answer column and that's referred to as Y, the letter Y.
So, if your answer column is numeric,
H2O will build a regression model.
If it's enum column also called a factor or a categorical column,
it will do a classification.
And if there's only two options in that column,
it will make a binomial otherwise it's making a multinomial.
But the last one was unsupervised.
Now, unsupervised is when you don't have that answer column.
We're going to take a proper look at this in week five.
But for a bank loan,
one example might be where our training data is
just a list of customers we lend money to,
or just a list of good learns.
What we can do is learn a model to try and understand those customers.
Then our new customer, we compare,
and H2O can tell us is this customer similar or are they different, are they anonymous?
And if they are anonymous perhaps, we don't give them the loan,
if they're similar to our current good customers, we give them a loan.
Now, you might also hear binomial classification referred to as logistic regression.
They are the same thing, is predicting a yes-no or a true or false,
a one or zero.
When you're using logistic regression or binomial classification,
you have a couple more metrics available to you.
Metrics to understand how good a model you've built.
It's best understood by looking at this chart,
it's called an ROC Curve.
Along the x axis,
we have probability of it being a false positive and on the y axis,
we have probability of it being a true positive.
And the metric is most commonly used it's called AUC, area under curve.
If we draw a curve,
a straight line maybe from the bottom left to the top right,
this is a bad model.
It's a model that learned nothing basically.
Our ideal is a curve with that which is pushed more towards the top left?
So, this is an AUC of about 0.8.
It might not be smooth in real life and that's something to bare in mind about AUC is
introducing two pieces of information into a single number.
Now, a single number is really nice for comparing,
but don't forget there's actually two numbers behind it.
Anyway, the ideal AUC,
look perfect model is where it pushes right into
the top left corner and you're left with a straight line, a straight line.
This is an AUC of 1.0.
The diagonal line is an AUC of 0.5.