Now that you've seen how the Gaussian or the normal distribution works for a single number, we're ready to build our anomaly detection algorithm. Let's dive in. You have a training set x1 through xm, where here each example x has end features. So, each example x is a vector within numbers. In the case of the airplane engine example, we had two features corresponding to the heat and the vibrations. And so, each of these excise would be a two dimensional vector and n would be equal to 2. But for many practical applications n can be much larger and you might do this with dozens or even hundreds of features. Given this training set, what we would like to do is to carry out density estimation and all that means is, we will build a model or estimate the probability for p(x). What's the probability of any given feature vector? And our model for p(x) is going to be as follows, x is a feature vector with values x1, x2 and so on down to xn. And I'm going to model p(x) as the probability of x1, times the probability of x2, times the probably of x3 times the probability of xn, for the n th features in the feature vectors. If you've taken an advanced class in probably in statistics before, you may recognize that this equation corresponds to assuming that the features x1, x2 and so on up to xm are statistically independent. But it turns out this algorithm often works fine even that the features are not actually statistically independent. But if you don't understand what I just said, don't worry about it. Understanding physical independence is not needed to fully complete discuss and also, be able to very effectively use anomaly detection algorithm. Now, to fill in this equation a little bit more, we are saying that the probability of all the features of this vector features x, is the product of p(x) 1 and p(x2) and so on up through p(xn). And in order to model the probability of x1, say the heat feature in this example we're going to have two parameters, new 1 and sigma 1 or sigma squared is 1. And what that means is we're going to estimate, the mean of the feature x1 and also the variance of feature x1 and that will be new 1 and sigma 1. To model p(x2) x2 is a totally different feature measuring the vibrations of the airplane engine. We're going to have two different parameters, which I'm going to write as mu 2, sigma 2 squared. And it turns out this will correspond to the mean or the average of the vibration feature and the variance of the vibration feature and so on. If you have additional features mu 3 sigma 3 squared up through mu n and sigma n squared. In case you're wondering why we multiply probabilities, maybe here's 1 example that could build intuition. Suppose for an aircraft engine there's a 1/10 chance that it is really hot, unusually hot and maybe there is a 1 in 20 chance that it vibrates really hard. Then, what is the chance that it runs really hot and vibrates really hard. We're saying that the chance of that is 1/10 times 1/20 which is 1/200. So it's really unlikely to get an engine that both fronts really hot and vibrates really hard. Is the product of these two probabilities or the chance of both of these things happening we're saying is the product of both of these probabilities. A somewhat more compact way to write this equation up here, is to say that this is equal to, the product from j =1 through n of p(xj). Would parameters mu j and sigma squared j. And this symbol here is a lot like the summation symbol except that whereas the summation symbol corresponds to addition, this symbol here corresponds to multiplying these terms over here for j =1 through n. So let's put it all together to see how you can build in the nominee detection system. The first step is to choose features xi that you think might be indicative of anomalous examples. Having come up with the features you want to use, you would then fit the parameters mu 1 through mu n and sigma square 1 through sigma squared n, for the n features in your data set. As you might guess, the parameter mu j will be just the average of xj of the feature j of all the examples in your training set. And sigma square j will be the average of the square difference between the feature and the value mu j, that you just computed up here on top to say. And by the way, if you have a vectorized implementation, you can also compute mu as the average of the training examples as follows, we're here, x and mu are both vectors. And so this would be the vectorized way of computing mu 1 through mu and all at the same time. And by estimating these parameters on your unlabeled training set, you've now computed all the parameters of your model. Finally, when you are given a new example, x test or I'm just going to write a new example as x here, what you would do is compute p(x) and see if it's large or small. So p(x) as you saw on the last slide is the product from j = 1 through n of the probability of the individual features. So p(x) j with parameters mu j and single square j. And if you substitute in, the formula for this probability, you end up with this expression 1 over root 2 pi sigma j of e to this expression over here. And so xj are the features, this is a j feature of your new example, mu j and sigma j are numbers or parameters you have computed in the previous step. And if you compute out this formula, you get some number for p(x). And the final step is to see a p(x) is less than epsilon. And if it is then you flag that it is an anomaly. 1 intuition behind what this album is doing is that it will tend to flag an example as anomalous if 1 or more of the features are either very large or very small relative to what it has seen in the training set. So for each of the features x j, you're fitting a Gaussian distribution like this. And so if even 1 of the features of the new example was way out here, say, then P f xJ would be very small. And if just 1 of the terms in this product is very small, then this overall product, when you multiply together will tend to be very small and does p(x) will be small. And what anomaly detection is doing into the algorithm is a systematic way of quantifying whether or not this new example x has any features that are unusually large or unusually small. Now, let's take a look at what all this actually means on 1 example, Here's the data set with features x 1 and x 2. And you notice that the features x 1 take on a much larger range of values than the features x 2. If you were to compute the meme of the Features x 1, you end up with five, which is why you want is equal to 1. And it turns out that for this data said, if you compute sigma 1, it will be equal to about 2. And if you were to compute mu to the average of the features on next to the average is three and similarly is variance or standard deviation is much smaller, which is why Sigma 2 is equal to 1. So that corresponds to this Gaussian distribution for x 1 and this Gaussian distribution for x 2. If you were to actually multiply p(x) 1 and p(x) 2, then you end up with this three D surface plot for p(x) where any point, the height of this is the product of p(x) 1 times p(x) 2. For the corresponding values of x 1 and x 2. And this signifies that values where p(x) is higher or more likely. So, values near the middle kind of here are more likely. Whereas values far out here, values out here are much less likely. I have much lower chance. Now, let me pick to test examples, the first 1 here, I'm going to write this x test 1 and the second 1 down here as x test 2. And let's see which of these 2 examples the album will flag as anomalous. I'm going to pick The Parameter ε to be equal to 0.02. And if you were to compute p(x) test 1, it turns out to be about 0.4 and this is much bigger than epsilon. And so the album will say this looks okay, doesn't look like an anomaly. Whereas in contrast, if you were to compute p(x) for this point down here corresponding to x 1 equals about eight and x 2 equals about 0.5. Kind of down here, then p(x) test 2 is 0.21. So this is much smaller than epsilon. And so the album will flag this as a likely anomaly. So, pretty much as you might hope it decides that x test 1 looks pretty normal. Whereas excess to which is much further away than anything you see in the training set looks like it could be an anomaly. So you've seen the process of how to build an anomaly detection system. But how do you choose the parameter epsilon? And how do you know if your anomaly detection system is working well in the next video, let's dive a little bit more deeply into the process of developing and evaluating the performance of an anomaly detection system. Let's go on to the next video