As we mentioned in the previous video, cluster analysis can be viewed as an optimization problem. Excel includes an optimization tool called Solver. We'll not get into the details of how Solver works because we will cover optimization later in this course. Also, the approach of using Solver for the cluster analysis is only practical for datasets that are relatively small. So we just want to show that it is possible to use Excel to approach cluster analysis from the point of view of an optimization problem. To make it more interesting we're going to show how to use Excel for cluster analysis using an example. As we mention before, companies are often interested in segmenting their customers to better target a specific product offerings to meet specific customer needs. Let’s assume that an online retailer has collected data on three contents of her customers. This data is in the Excel file Market Segmentation- Excel. You might want to pause the video to locate the file and open it so you can follow along or just continue to watch the video and play with the model later, it's up to you. Let us start in the original Data tab to examine what data was collected for each of the three candy customers. You can see that for each customer, there are two demographic variables, income level and number of dependence as well as two buying behavior variables, number of purchases made last year, and the average value of each purchase. Let's assume that the online retailers wants to know if the data supports the hypothesis that there are three segments in her market. So we're going to use closer analysis to help her answer this question. The first step in our analysis is to normalize the data. Recall that normalization is a transformation of each variable in the dataset. In our table, the variables are in the columns label Income, Average Purchase, Last year purchases and the Number of Dependents. The normalization consists of subtracting the mean and dividing by the standard deviation. Click on the normalized Data tab to see the normalized values. The normalized values have a mean of 0 and a standard deviation of 1. Now, click on the Clusters tab to look at the model. The model is based on the idea of choosing three customers to represent the centroids of the three clusters. Cells H5, H6, and H7 have the three customers that haven't chosen a centroids. The arbitrary solution that is shown on the spreadsheet consists of customers 10, 20, and 30 as the centroids of the three clusters. The center values in the table correspond to the normalized values for the selected customer. For example, if we change 10 to 15, we see that the central values change. In this other table, the model uses Euclidean distances to assign the customers to the closest cluster. For example, we see that the first six customers are assigned to cluster 2. Customer 7 is assigned to cluster 3 and customer 10 is assigned to cluster 1. The model also calculates the total squared distance, which is the sum of all the distances in column E. Now we're ready to optimize, which is nothing more than searching for the set of three customers that are the best centroids. Clearly, the best set is the one that minimizes the total sum of a squares. We don't want to do this by manually changing the values because believe it or not, there are more than 4.4 million ways of choosing three customers from a set of 300. So we're going to use an Excel tool called Solver. To access this tool, we got to the Analyze Group in the Data tab. If Solver is not there, you need to first load it. If you Google adding solver to Excel, you will find instructions on how to add Solver to the different versions of Excel for Windows and for the Mac. You can pause this video and come back here once you have loaded Solver. Okay, if your Solver is ready to go, then click on it and you will see that the model is all ready there. This is a simple model for the set of objective, SJ9 that is the objective function is the total sum of squares. We want to minimize this value so Min is chosen. The changing cells are the values that the solver can manipulate to minimize your objective function. In this case, there are only three cells, H5 to H7, which correspond to the three customers that are going to be selected as centroids. And then, there are three constraints related to the changing cells. The values in the changing cells must be integer. They must be greater than or equal to 1, and they must be less than or equal to 300. For reasons that go beyond the scope of this course for this particular problem, we must choose the Evolutionary Solver. You can now click on the Solve button. The solution process starts, and search progress is shown on the status bar at the bottom of the spreadsheet. The solution time depends on the computer speed. So if it is taking too long, you can always press this K key to stop the search. The Evolutionary Solver is a so called Metaheuristic. What this means is that it cannot guarantee that when it stops, it has found the very best solution. Also called the optimal solution. When the Solver stops, it simply reports the best solution that it could find. It is states that the Solver can not improve the current solution. We're going to press the Escape key and assert the solver solution. Since we may end up with different solutions, we're going to change the solution that we have to one that I found, and that I would like to discuss with you. The solution has customer 103, as the first Centroid, customer 112 as the second, and customer 170 as the third. Let's enter those values in the corresponding cells and examine the normalized centroid. The normalized values are very helpful in interpreting each cluster because they are centered at 0. So negative values are below average, and positive values are above average. Let's interpret the first cluster. The one that has customer 103 as the centroid. The normalized values for this group are all negative. The group is about one standard deviation below the average in all attributes. Perhaps these are students given that their income is below average, they don't buy very frequently, and they don't spend a lot of money when they do buy. Also they don't seem to have dependents. Moving to the second cluster, represented by customer 112 we see that all the normalized values except for dependent at at least one standard deviation above the average. This group could be one of middle aged professionals with above average income and average number of dependents. They make frequent online purchases and they spend above average in the items they buy. Finally, the third cluster represented by customer 170 has an average income and also makes purchases with a value that is close to the average. However, the number of dependents that the members of this group has is almost one standard deviation above average. This group could very well represent families. Therefore, a reasonable conclusion in this analysis is that the data seems to indicate that the market for this online retailer consists of three segments the students, families, and professionals. We have shown how cluster analysis can be approached as an optimization problem. We have mold the problem in a way that the search for the best set of clusters is equivalent to searching for the best set of customers to represent each cluster. In our model a customer that represent a cluster becomes the centroid of the cluster. To illustrate the process we use a market segmentation example. This is the same example that we will use in the next video to illustrate how to do cluster analysis with Excel minor.