In this video, we discuss classification trees where the target variable is binary. We use classification trees to discuss two key ideas for building trees, recursive partitioning and the pruning. In recursive partitioning, we repeatedly splits the records into two parts, so to achieve maximum homogeneity within the new parts. In pruning, we simplify the trees by cutting out less important branches to avoid overfitting. Let's first briefly discuss the steps for recursive partitioning. First, pick one of the predictor variables, x. Next, we pick a value s that divides the training data into two portions. After this step, we measure how pure the resulting partition is. We try different predictor variables and split values to maximize purity. And then we'll repeat the process. The split procedure we just described applies to continuous variables. For continuous variables we can order records according to the value of the chosen variable. And split them at middle point between any consecutive values. Consider a data set with four values, 3.3, 1.3, 2.0, and 3.7. After we sort the data from smallest to largest, we obtain (1.3, 2.0, 3.3, 3.7). The possible break points are the middle points, 1.65, 2.65, and 3.5. For categorical data, splitting procedure is slightly different. Categorical data with multiple categories can be split into any pairs of subsets. Note that this is not supported in some software implementations, including XLMiner. Instead, we first convert the categorical variable into multiple binary variables. And then perform the split on the binary variable. We mentioned earlier that as a goal, your recursive partitioning is to find pure partitions. How do we measure purity, then? There are many alternative measures, including gini index and entropy. Let me briefly discuss gini index as an example. Gini index for a set of data is 1- the sum of Pk squared. Where Pk is the proportion of cases that belong to class k. For binary classification, the number of classes is 2. When all cases belong to the same class, Pk is 1 for one class and 0 for all other classes. It follows that a gini index is 0. It can be verified that a gini index takes the maximum value when all classes are represented. For binary classification, this means that Pk = 0.5. For the appointment data, lag takes integer values, 0, 1, 2, etc. Therefore, the possible breaks are 0.5, 1.5, 2.5, etc. Let's consider split at the break 20.5. For lag less than 20.5, 384 cases canceled and 3,356 cases arrived. The proportion of cancellation is 0.1027. And its proportion of arrival is 0.8973. The gini index is 0.1843. Know that this index measures the purity of the data where lag is less than 20.5. Similarly, the gini index for the data with lag greater than 20.5 is 0.4509. The overall gini index is weighted across the two sets. Where the weight is relative size of the two sets. We can verify that as an overall weighted gini index as 0.3173. Our goal is to find a split where the overall gini index is as low as possible. Specifically, we obtain overall impurity measure for each split and compare this measure across all possible splits in all variables. We choose a split that reduces impurity the most. The chosen split points then become nodes on the trees. In end nodes, or leaf nodes, we would like to make prediction and label accounts accordingly. The are two alternative ways to make the prediction. The first one is majority voting, where would we determine the battery prediction. For example, when there are 6 successes out of 9, we'll predict success. We can also choose to make a probability prediction. For example, we observe 6 successes out of 9. We predict 2/3 probability of success. Overfitting is a serious issue in building trees. In fact, the natural end of the process is 100% purity in each leaf node. This is the case, for example, if each end node contains only one operation, which clearly overfits the data. As a result, the predictive accuracy for new data will not be good. Here is part of the classification tree for the probability data. Know that the tree is not fully presented here because it has too many levels. Lots of branches in there, the predictive performance of this tree is pretty bad. This leads to the second key idea of building trees, which is pruning. The key question is, what is the right-sized tree? If the tree is too small, there is not enough flexibility. If the tree only has one node, then the tree makes the same prediction for the every operation, which is clearly undesirable. If the tree is too large, it overfits the training set. As a result, the predictive accuracy on new data will not be good. There are several alternative methods for pruning. A popular one is called CART, and it's based on the idea of cross-validation. That is, we divide the data into training and validation sets, and pick the tree with the highest predicting accuracy. There also some alternative methods, such as CHAID and C4.5.