[MUSIC] So we have taken a swift look at the search strategy. Training of a classifier on our several features is not a big deal, usually. However, let's take a look at additional constraints that are imposed on the classifier so that it could be used for such an analysis. So there are two considerations we have to keep in mind. The first one is called uniformity. So the correlation between classifier prediction and mass can lead to false peaks which spoil event counting. So as you see on the picture, if you have a blue distribution and then you apply some kind of classifier that is correlated with a mass, then you might get a green distribution that is roughly the same on the region, side bend regions. But at the center where you're expecting to get a mass of mother particle, you get a peak. And sometimes this peak might be just to the features or just to internal properties of a classifier with no physical motivation. And the second restrictions that we have to keep in mind is that training data set that we used for a signal is fully represented by simulated events and background is real. So when we train a classifier, signal versus background, we don't know exactly if the features or if the classifier has learned something specific to signal versus background, or Monte Carlo versus real. So those aspects can also introduce a bias in our counting. So let's take a uniformity, take a look at uniformity. Here you have on the top right the distribution of two variables for example prediction in the mass. And on the left bottom, you will have even numbers that represent the distribution of the events in the beans of mass with certain predictions given global threshold of true positive rate, for example, 0.1, 0.3, 0.5, etc. So in case of nonuniform distribution or nonuniform distribution of predictions versus mass, in a joined distribution picture you see something that is displayed on the top right, and you can see it more distinctly on the levels picture. So basically when you apply a classifier that gives true positive rate given specific number like 0.1, which is depicted by blue. And then you see what is the actual fraction of the dataset that is selected by this threshold on a given range on a mass axis. So when you block those colorful numbers and you see there are peaks that correspond though there is not uniformity. How can you measure this non-uniformity? So if you introduce a mass window and you compare distribution in this mass window with a global distribution in your data set, which is depicted on this figure by blue points. You can plot a marginal distribution like you see on the right side of this picture. The blue and yellow has histograms. So since those are pretty much aligned with each other it means that the distribution in the mass window is more or less the same as distribution of predictions in the whole data set. And in case you have some discrepancies or if you have some correlations between predictions and the mass, those marginal distributions will differ. So as a measure to non-uniformity, we can introduce Cramer-von Mises test, which is an integral characteristic that integrates the difference in CDFs into distributions like in the mass region and the whole data set. And we compute this formula over all possible regions or some set of regions in the mass scale, in the mass window. So uniformity check that relies on this measure can work in the following manner. So first we confirm that random predictions and mass can be considered independent variables. And we assume a null-hypothesis, that mass and predictions are independent. Then we generate distribution of CvM value under null-hypothesis by repeating many times the following steps. First, we generate random predictions and we compute CvM value. And then we can choose a p-value and compute corresponding value of Cramer-von Mises test. So the basic approach that thesis used to deal with correlation of predictions of classifier with the mass is pretty simple. So we remove the features in the training sample that introduce such correlation. We know that momentum, or if we have several particles, momenta of those particles can give us a hint about the mass of the particle or mass of those particles. So we can omit those features. So it is simple and it works. But removing those features can lessen the power of classifier. So the question is can we modify machine learning algorithm that still uses all the features but provides uniform predictions with regard to background or with regard to signal allowing the mass axis. So let's recap a gradient boosting algorithm. So it is an approach that greatly builds an ensemble of estimators. So here the small is just a tree and we have an ensemble of those trees weighted by some coefficient. And this approach minimizes given loss function, and functions could differ. For example, Mean Squared Error is the one you have in the middle, or AdaLoss, that is used for AdaBoost, or LogLoss, which is used in many different cases, in different classifiers. Each term in the ensemble approximate the residuals between true value of the function that we want to predict and all preceding terms predicting it. So one approach is called uniform boosting or uBoostBDT. BDT stands for Boosted Decision Trees. So if we aim to get certain false positive rate at the region that equals to a certain constant. So at first, we fix the target efficiency. For example, we want to get 30% there and find corresponding threshold for the classifier. Then we train a tree with its decision function, d(x). And increase weight for misclassified items, events. And then increase weight of background events in the regions with high false positive rate, according to the formula on this line. So following this procedure, we achieve the false positive rate in this region and we can repeat it for all regions. So it works but it is a little bit computationally hungry. A little simpler approach is based on gradient boosting. So why don't we minimize CvM with gradient descent? So the problem is that we can't compute the gradient of this formula. But we can make an approximation of CvM according to the formula on the slide and for this formula we actually can compute derivative. And for this we can modify the loss function according to the formula but also we add this LFL with some coefficient to the loss function of Adiboost. And this actually works pretty nicely. So you see three examples. So on the left part of the slide you see Adiboost and those levels of efficiencies. And you see that side bands are inclined pretty heavily. So don't disregard middle points because there should be no point, it goes to zero for no physical reasons. Just consider the angles at which side bands levels, go to the central region. Uboost is a bit better, and the winner is UGB+FL so, it's faster, and also it allows for tradeoff between quality and uniformity by tuning parameter alpha. Another approach that you can take into account for this problem is based on adversarial neural networks. The first one is adversarial decorrelation introduced by a paper, Chase Shim and his colleagues. So the idea is that you build a classifier. And on top of this classifier you introduce another algorithm adversary network that takes into account how well it can reconstruct the dependency between prediction of their original classifier. With mass or with other feature that you want to decorrelate. So to make it work, they also added some gradient scaling layer to increase the gradient passing through this adversary part. And signal events are given zero weight in adversary loss. So it basically works. So you see an example from their presentation, the red distribution is predictions of just original traditional neural network. And the black ones are the distribution of predictions of the network that has been trained adversarially according to the previous scheme. And the quality of the second network is roughly the same as the quality of the original one. And on this slide you have a little more generic scheme that the previous was just the example of. In this adversarial part of the network, they reconstruct parameters of the distribution that we want to avoid. And in case we see this distribution in the output of adversarial part, we punish the original network.