So in this lesson, we are going to discuss different clustering methods. Many clustering methods because this is a very popular field and very important. As we just said, when you have a lot of data. You generally want to apply some clustering to make sense of the data and find subgroups in your data. So one method is a partitioning method. It constructs various partitions, and then evaluate them according to some criteria. For example minimizing the sum of square errors. Typical method is KMeans so it functions exactly like finding groups of similar objects. So density based approach is based on a different principle which is the connectivity and then CT function. Typically method is a DB scan, so you want to calculate clusters based on how much density there is. You could see this on a map where you have cities because they're more dense in terms of number of houses. And then hierarchical approach creates, hierarchy called composition of the set of data or objects using some criterion. This is a very important method for biology, a typical method is Diana. So, we going to start by the first method. Family of methods which is partitioning. And we're going to see partitioning with the KMeans method or algorithm. So this is an algorithm that proceeds in several steps. It starts by a fixed number of clusters. So we need to tell the algorithm how many clusters we want. In this case we're going to say two. First step is to first create two clusters. The first time is going to be almost random, and this is going to be arbitrary. We assign, arbitrarily, a center object red point. So here, for example, we are in a two-dimensional space. What we do is we decide to put one point, which is called the centroid, to the left, and one to the right. I could also have chosen top and bottom. Again, this is arbitrary and these are not points that exist in the data. The next step is to assign the points to the cluster centered on one of the two objects. So, for example, here we had this particular center was here. The other was here. So I calculate the distance between each point to the center and to each center. And if, for example, on this blue point is closer to this center, that would be here, then assign it to this cluster. If it were this point here is closer to the point in this area. So assign it to this area. So it right now the first it's pretty much going to separate the point between those that are more on the left and those that are more on the right. And you'll see here it's a shape of the clusters that are being formed. And then I'm going to continue pretty much these steps one and two iteratively until there is no change in the shape of the clusters. So, what's going on is that now I can calculate my center. I calculate my center as what's called the centroid image cluster. It's like the middle of the cluster. It means I add up all the points together and this is the center that I calculate. For this particular cluster. Remember we had this black cluster here. I calculate the center, called the centre. Here also I calculate the center. So then this one moves from here to here and this one moves from here to here. So now they are much more centered in the clusters. And then I'm going to repeat step 2, I'm going to calculate all the distances between the points and the new centroids and each points. So this one is a centroid etc. There was one particular [INAUDIBLE] that if you look at the distances this point now is closer to the one on the right than to the one on the left. And so it's going to swap cluster. So this one becomes navy blue, because it now belongs to a cluster here, and we have a cluster here, a cluster here, that's represented here. And again moving back to step one, to see, to calculate the centroid, and see whether there is any movement. And now actually if I, so step 2 is calculate a centroid. Step, sorry, step 1 is calculate a centroid, step 2 is assigns of points to the closest centroid. And now if I assigned a point again to the closest centroid. There is no change they all are in the right cluster and their process can stop. The advantage of this method is that it is quite efficient and there has been improved methods such as partitioning around medoids. Where it starts, instead of starting from an arbitrary point outside of the data set it starts by a point in the data set. The drawback is that as you saw the number of clusters is fixed. And very often we don't know how many clusters we have in the data. Of course, we can make hypothesis, try things on mostly two clusters or I can try two clusters and I try three and four etc. But there are methods I can use to find these number automatically now. An important drawback is that clusters have a convex shape. So it means that they are, in a sense, close to circular or they don't have a jagged shape. And this method, it's sensitive to outliers. So, for example, how can we find automatically the optimal number of clusters. One method is called the Ward algorithm. K is calculated by the Cubic Clustering Criterion (CCC) which would go from increasing to decreasing. We see here for example the Cubic Clustering Criterion, sorted one, two, three, four. It increases and then decreases with five. So we say the optimum number of clusters here would be four, because this is a point just before it starts decreasing. And so this is used for example in sass. Enterprise miner uses this criteria for automatically finding the number of K is for KMeans algorithm. Thank you for your attention.