[MUSIC] As we talked about cross-validation to break your data set up into pieces in order to validate your model, and in general, splitting things into training sets and tests sets, which is crucially important to make any sense whatsoever. But a related notion here is resampling your data set, or maybe a more general notion is to resample your data set. And another use of this is the bootstrap, which is a very, very general and very, very powerful technique coming out of the cystics community in 1979. So here's how it works. Given a dataset of size N, you draw N samples from that same dataset with replacement to create a new dataset. Then repeat this a whole bunch of times. Now what you have is a bunch of different sample datasets drawn from the same population. You can compute whatever statistics you're interested in on each one of these. Or you can train a model, which is the context we're talking about and you can interpret each of these as a sort of individual experiment. And this is exactly what the frequentest perspective calls for. Right? If you remember the definition of a confidence interval, it says that under repeating experiments, 95% of the time this confidence interval will contain the true mean or the true whatever statistic you're interested in. This is giving you a direct simulation of that interpretation of the statistics. So this is a fantastic use of computational resources. It means you don't have to have this precise, analytic model of the world and what's going on, you can just rerun experiments. And in 1999, this seemed like wow, this is going to be very computationally intensive, it's not very feasible. Now, for most of these datasets it's actually trivial. So this is a very, very powerful technique. All right, so here you see an example, this is probably pretty clear, but I just wanna make sure you see how it works. Imagine that the original data set you're given is just one, two, three, four, five six, then you're gonna draw multiple samples from this simple data set with replacement, so you might get 4, 3, 4, 2, 1, 6, and you'll notice 4 came twice. 2, 3, 6, 1, 3, 5, and 3 came twice. And so on. And now for each one of these samples, you can compute the statistic you're interested in, say just the mean. And what you'll get is an unbiased estimate of the true mean. Okay. So great. So that's a pretty trivial case. If you have a more complicated data set like this and you want to do a slightly more complicated statistic such as fit your regression line. Well, same thing, these are all your points and you can draw boot strap samples from them. So this a 100 points, I think I said 1000 here. This is just 100 points. Draw a 100 samples from this dataset with replacement. And then refit a regression line to it. And I've done that here ten times or so. You see that the line does wiggle because some of these points are missing. For example, this out wire might have been missing in some the samples. Some of the blue star samples, and therefore didn't influence the line, so it shimmies around. But it stays sort of confined to the data because you're still drawing from the same sample data set. So reasoning about how much that wiggles you can compute things like confidence intervals over these statistics. So if you wanted 90% confidence interval for the slope of this regression line, just do this a bunch of times. Take the 5th percentile. Take the 95th percentile. And you've got it. Okay. So this is really, really general. Really, really powerful associated closely with statistics community as opposed to the algorithmic modeling kind of machine learning community. But we're gonna see how to use it for decision trees. [MUSIC]