[MUSIC] So let's see what this looks like a little more precisely, okay. So for each step t, we've got a vector of weights, which represent the probability of selecting example i in the sample. So that's this D. So D at t, D of i, is the probability of selecting i in the sample. And these are initialized to uniform probability, okay? And then x and y, x sub i and y sub i are your, Record and your label, okay? I've been saying data point example and record. I think sort of interchangeably. I don't think it's helpful, but bear with me. So H sub T is the trained classifier at step T using the sample drawn according to the weights D. Okay. Now, compute the error by adding up all the weights for everything that was mis-classified or this epsilon of error. You compute this epsilon by adding up the weights for everything that was mis-classified. And that's what this incantation is down here, right? So if this says wherever the model H sub T, apply it to x sub i, where it's not equal to, or the classifier tells you something different than the actual class label then include it in the set. And add up all the, oops excuse me, add up all the D's for those. Okay, then compute what's called the odds, which is the odds given probability is that probability over 1- that probability. Okay, so compute the odds of mis-classifying, and then weight those values, those mis-classified values by the odds. Okay. Oh excuse me. You're going to, that's not quite right. You're going to, this is a value between zero and one. Right? And so this gives lower weight to D of i. And you're only going to apply this function to the ones that were correctly classified. And all the other weights, you just leave alone. And so this isn't, I didn't put that data for in an attempt at clarity, I didn't include the entire expression. That may not have been a good idea. So these only apply to the correctly classified examples. Okay? And then there's actually one other step that I also removed in an attempt at clarity. Which is you, you need to normalize all these down. Because what you've done now is taken a set of weights that normalize two probabilities. They add up to 1. And you've just adjusted some of them down. Well now it's not gonna add up to one anymore, but that's okay because you know the relative weight. So you need to renormalize all this. And so it's a pretty simple step, but I find when you sort of throw a divided by z in here it gets confusing. So I just wanted to at least explain it this way. Okay? And then that gives you your new weights for the next round, and then you repeat this process. Okay. So, a question I wanna ask about this though, given our appreciation for big data and for scale out solutions. The difference between bagging and boosting. Well, I guess I should probably promise you. So, boosting has a bunch of nice properties and it's a very, very successful algorithm. And one thing that's especially cool about this is that it's a meta algorithm, right? It works on anything. It doesn't say anything about what this classifier used to be. It could be a decision tree, it could be any other method that you might come up with, right? All it does it effect the data you use to train the classifier at each step. This is very very general and very,very powerful. Because of this weighting, it actually does sort of zero in on the mistakes that are made. Giving more and more weight errors, so you can sort of imagine intuitively that, the strength of the classifier is going to get better and better over time. So that is well and good and it works but it has a disadvantage with respect to really big data relative to a simpler just as value. That is inherently sequential. Right? So we have this for each time stamp, the weighted time T + 1 depend on the way to step T. With bagging, we can just go wild in parallel and do all of this at the same time. Okay. So, that's some to keep in mind, that you don't see quite as often in the discussion of these various methods cuz they tend to focus on the mathematics involved in the sort of performance. But, in data science context, we need to keep in mind that we're gonna be working on very very large data sets at times. And so, a slightly worse performing but, easy to paralyzing method can win. [MUSIC]