We looked at the tips data set and said that we could use either the tip amount as the label or the sex of the customer as a label. In option one, we are treating the tip amount as the label and want to predict it, given the other features in the data set. Let's assume that you are using only one feature, just the total bill amount to predict the tip. Because tip is a continuous number, this is a regression problem. In regression problems, the goal is to use mathematical functions of different combinations of features to predict the continuous value of our label. This is shown by the line where for a given total bill amount times the slope of the line, we get a continuous value for tip amount. Perhaps the average tip rate is 18% of the total bill. Then the slope of the line would be 0.18 and by multiplying the bill amount by 0.18, we'll get the predicted tip. This linear regression with only one feature generalizes to additional features. In that case, we have a multi-dimensional problem, but the concept is the same. The value of each feature for each example is multiplied by the gradient of a hyperplane, which is just the generalization of the line to get a continuous value for the label. In regression problems, we want to minimize the error between our predicted continuous value and the label's continuous value usually using mean squared error. In option two, we are going to treat sex as our label and protect the sex of the customer using data from the tip and total bill. Of course, as you can see from the data, this is a bad idea. The data for men and women is not really separate and we would get a terrible model if we did this. But trying to do this helps me illustrate what happens when the thing you want to predict is categorical and not continuous. The values of the sex column takes, at least in this data set, are discreet male or female. Because sex is categorical and we are using a sex column of the data set as our label, the problem is a classification problem. In classification problems, instead of trying to predict a continuous variable, we are trying to create a decision boundary that separates the different classes. So in this case, there are two classes of sex, female and male. A linear decision boundary will form a line or a hyperplane in higher dimensions with each class on either side. For example, we might say that if the tip amount is greater than 0.18 times the total bill amount, then we predict that the person making the payment was male. This is shown by the red line, but that doesn't work very well for this data set. Men seem to have higher variability while women tend to tip in a more narrow band. This is an example of a nonlinear decision boundary shown by the yellow lips in the graph. How do we know the red decision boundary is bad and the yellow decision boundary is better? In classification problems, we want to minimize the error or misclassification between our predicted class and the labels class. This is done usually using cross-entropy. Even if we are predicting the tip amount, perhaps we don't need to know the exact tip amount. Instead we want to determine whether the tip will be high, average, or low. We could define a high tip amount as greater than 25%, average tip amount as between 15 and 25% and a low tip amount is being below 15%. In other words, we could discretize the tip amount. And now predicting the tip amount or more appropriately the tip class becomes a classification problem. In general, a raw continuous feature can be discretized into a categorical feature. Later in this specialization, we will talk about the reverse process. A categorical feature can be embedded into a continuous space. It really depends on the exact problem you're trying to solve and what works best. Machine learning is all about experimentation. Both of these problem types regression and classification can be thought of as prediction problems in contrast to unsupervised problems, which are like description problems. Now where does all this data come from? The tip data set is what we call structured data consisting of rows and columns. And a very common source of structured data for machine learning is your data warehouse. Unstructured data are things like pictures, audio, or video. Here I'm showing you a natality data set, a public data set of medical information. It is a public data set in BigQuery and you will use it later in the specialization. But for now, assume that this data set is in your data warehouse. Let's say we want to predict the gestation weeks of the baby. In other words, we want to predict when the baby is going to be born. You can do a SQL select statement in BigQuery to create an ML data set. We will choose input feature of the model, things like mother's age, the weight gain and pounds, and the label gestation weeks. Because gestation weeks is a continuous number, this is a regression problem. Making predictions from structure data is very commonplace. And that is what we focus on on the first part of this specialization. Of course, this medical data set can be used to predict other things too. Perhaps we want to predict baby weight using the other attributes as our features. Baby weight can be an indicator of health. When a baby is predicted to have a low birth weight, the hospital will usually have equipment such as an incubator handy. So it can be important to be able to predict a baby's weight. The label here would be baby weight and it's a continuous variable. It's stored as a floating point number, which would make this a regression problem. Is this dataset a good candidate for linear regression and or linear classification? The correct answer is both. Let's investigate why. Let's step back and look at the data set with both classes mixed. Without the different colors and shapes to aid us, the data appears to be one noisy line with a negative slope and positive intercept. Since it appears quite linear, this will probably most likely be a good candidate for linear regression where what we are trying to predict is the value for y. Adding the different colors and shapes back in, it is much more evident that this data set is actually two linear series with some Gaussian noise added. The lines have slightly different slopes and different intercepts and the noise has different standard deviations. I've planned the lines here to show you that this is most definitely a linear data set. By design, I'll be a little noisy. This would be a good candidate for linear regression. Despite there being two distinct linear series, let's first look at the result of a one dimensional linear regression y from x to start building an intuition. Then we'll see if we can do better. The green line here is the fitted linear equation from linear regression. Notice that it is far away from each individual class distribution because Class B pulls the line away from Class A and vice versa. It ends up approximately bisecting the space between the two distributions. This makes sense since with regression, we optimize our loss of mean squared error. So with an equal pull from each class, the regression should have the lowest mean squared error in between the two classes approximately equidistant from their means. Since each class is a different linear series with different slopes and intercepts, we would actually have a much better accuracy by performing a linear regression for each class, which should fit very closely to each of the lines plotted here. Even better, instead of performing a one-dimensional linear regression predicting the value of y from one feature x, we could perform a two-dimensional linear regression predicting y from two features, x and the class of the point. The class feature could be a one the point belongs to class A and a zero if the point belongs to class b. Instead of a line, it would form a 2D hyperplane. Let's see how that would look. Here are the results of the 2D linear regression. To predict our label y, we used two features, x and class. As you can see, a 2D hyperplane has been formed between the two sets of data which are now separated by the class dimension. I've also included the true lines for both class A and class B as well as the 1D linear regressions line of best fit. The plane doesn't completely contain any of the lines due to the noise of the data tilting the two slopes of the plane. Otherwise, with no noise, all three lines would be perfectly on the plane. Also, we have kind of already answered the other portion of the quiz question about linear classification because the linear regression line does a really great job already of separating the classes. So this is a very good candidate for linear classification as well. But would it produce a decision boundary exactly on the 1D linear regressions line of best fit? Let's find out. Plot it in yellow is the output of a one dimensional linear classifier. logistic regression. Notice that it is very close to linear regressions green line, but not exactly. Why could this be? Remember I mentioned that regression models usually use mean squared error as their loss function whereas classification models tend to use cross entropy. So what is the difference between the two? Without going into too much of the details just yet, there is a quadratic penalty for mean squared error. So it is essentially trying to minimize the euclidean distance between the actual label and the predicted label. On the other hand with classifications cross-entropy, the penalty is almost linear when the predicted probability is close to the actual label. But as it gets farther away, it becomes exponential when it gets close to the predicting the opposite class of the label. Therefore, if you look closely at the plot, the most likely reason the classification decision boundary line has a slightly more negative slope is so that some of those noisy red points, red being the noise or distribution, fall on the other side of the decision boundary and lose their high error contribution. Since they are so close to the line, their error contribution would be small for linear regression. Because not only is the error quadratic but there is no preference to be on the one side of the line or the other for regression as long as the distance is small as possible. So as you can see, this data set is a great fit for both linear regression and linear classification unlike when we looked at the tips data set where it was only acceptable for linear regression, and [INAUDIBLE] better for a non-linear classification.