[MUSIC] So we talked about resampling, and we talked about the bootstrap, and we talked about cross-validation, where you're splitting your training set into different sets in order to validate your result. So one idea that might occur to you is well, look, we have all these different subsets of our training set lying around. Why don't we train different classifiers on each one and combine their results? So instead of trying to use it to validate one particular classifier, why don't we just make a whole bunch of different classifiers? And so the question that you need to answer is well, look, can we, does it actually make sense? Does it work to take a bunch of not very good classifiers and combine them. Will it actually get better? And the answer is not trivial to find out but it was found out and the answer is yes. And so you can average the results of different models and you get a couple of benefits. One is, the strength will go up. The classification performance will be better and you can drive it up by design. But more importantly, perhaps is more resilience to noise, and so you reduce the chance of over fitting by having a bunch of different classifiers all work in tandem. So why wouldn't you do this? Well, there's two reasons. One is, it's time consuming. Obviously, training a bunch of different classifiers is more expensive than training just one. But also the behavior of your model becomes difficult to explain. When you have a whole bunch of different models, all sort of voting on the answer. It's difficult to take some intuition for what logic is actually being applied. And so remember we said just like one rule was easier to understand than a big set of rules, one decision tree was easy to understand. A big set of decision trees is perhaps not. But that's not gonna stop us from doing it. So the idea here is that there's this wisdom of the crowds model. Well it works when you're talking about these digital artifacts as well, these generated models. And so how do we actually do this? Well, one technique is called bagging, which is a poor man term of bootstrap and aggregating. And so the idea here is to draw your N bootstrap samples, retrain the model separately on each sample just as we described, and then average the results. And so if you're talking about a regression problem, that we haven't gone in detail yet on, then you would actually literally just average the results. If it's classification where you're only picking a discrete class label, then you do a majority vote. And so this works really, really well for overfit models. It resists overfitting because it decreases variance without changing the bias. So without moving the answer around, it just shrinks everything to that answer. And if you remember we put up that plot where high variance tended to mean overfit. It means that you are specialized to the training data and that you're very sensitive to the training data. But it doesn't help much with models that do have a high bias, with wrong models, and you remember high bias low variance models are more or less insensitive to the training data. And so all that you're doing with bagging is taking different permutations of the training data, and so if you're not getting a good signal out of the training data anyway this isn't gonna help much. So another technique is boosting, exemplified by this algorithm Eta-boost. So here instead of selecting data points randomly with the bootstrap, you want to favor the misclassified points on each step. So take a booth strap sample, retrain a model, see what would inspect the mistakes of the classifier made and recompute the weights. For the next round of bootstrap, giving extra weight to the example data points that your model got wrong. [MUSIC]