[MUSIC] Now in practice there are only a few situations where you can define your goals visually because in practice, you very often have more than two dimensions of variables that you may consider for your clustering. So we need to rely on a more rigorous approach to create groups and identify patterns within the data. An approach often used in practice is the hierarchical clustering technique. The recital at the end of this module will give you an opportunity to learn how to perform hierarchical clustering in R yourself in particular, using the hclust function. Note that you can replicate all the examples that are presented in this video thanks to the scripts and datasets that are provided with this MOOC, and discussed during the recitals. So let's apply hierarchical clustering to the previous example. I said that what's important is to find the right balance between treating similar cases similarly, and different cases specifically. This is exactly the same from a clustering perspective as if to say that we want to maximize the similarity within the clusters, and maximize the dissimilarity between clusters. In statistics, the similarity is very often measured with the distance between observations. We may, for instance, take the Euclidean distance, that is the distance you probably know from geometry. What the hclust function does is - for a given number of groups - to identify the optimal clusters in that sense. Let's go back to the SKU example. And - as you can test yourself during the recital - we can have three groups that maximize the within-cluster proximity, and maximizes the inter-cluster distance. And what's neat is that we'll find exactly the same groups as we did before but now with a more vigorous approach. We now have a tool that is statistically more robust but also more flexible. First, as in practice the question of "how many groups should we have?" is very often crucial. We can now select the number of groups as a parameter of hclust. Second, we can also include as many dimensions in the analysis as we want. Not only 2 or 3 but even 10 or 20 if needed. Should we have 20 dimensions, finding groups visually would actually be impossible. By the way, about the number of groups to define, some people may claim that you should use technical criteria such as the dendrogram to decide how many clusters you should have in practice. In contrast, in this training, we claim that you should mostly rely on common sense and business criteria to decide on the number of clusters, while ensuring that clusters you pick are still statistically relevant, obviously! In the supply chain example, we decided to keep only three groups because the splits along those axes are easy to communicate. It's a common sense criteria. Later we may be constrained by other factors such as the capacity of your workforce or the current organization of your company. We'll discuss such examples later. But at the end of the day, you need to understand that business analytics aims at finding the right balance between business relevance, technical accuracy, and common sense. Thanks to the hclust function, which provides a hierarchical clustering tool, we can now approach more complex examples where we have more than two dimensions to study. In practice, you will very often have to select only part of the variables that you can use to make groups because all the data is not usable. I'll skip the discussions on data sampling, technical variable selections, or missing data. And I'll focus on the core of our approach: can we select variables that are more relevant from a business actionable perspective? And for that, let's imagine that you're an HR manager of a big consulting company. and that you're concerned by the number of employees leaving the firm. As an HR manager, you want to retain your best employees within the company but you cannot follow-up with each one of them too often as that would be very time consuming. Instead, you may rely on an HR analytic solution to understand what are the most frequent situations explaining why an employee decides to leave. Now let's imagine that this company has collected a bunch of data during exit interviews. Such data would be information like the satisfaction of the employee about the company, the last project evaluation, the number of completed projects within the last 12 months, the average amount of hours worked per month, the time spent in the company, and whether he or she had a baby within the last year. We first need to make the variables comparable. For instance, we cannot really compare the age of an employee with her satisfaction level. What we can typically do in practice is to standardize the variables. To do this, we first subtract to each value from each variable the average for this variable. This is called mean-centering. And then we divide by the standard deviation. In doing so we make the distance along variables comparable. We explain in the recital how to make this standardization in practice with the scale function. Now we just have to apply the hierarchical clustering. I won't go through the technical details because I've already provided all the intuitions needed, and we explain during the recital how to implement it in R. We'll discuss later how we can decide on the number of segments in a more systematic way. But let's imagine that we decide to have 4 segments. We run hclust and after some transformations we obtain the following output. How can we interpret the results? We need to understand how the segments are different. We could do an ANOVA, an analysis of variance to be systematic. But let's just try to visualize the most noticeable differences with *colors. The first segment didn't do many projects on average and was underutilized. It's also a segment where employees have been in the company for a shorter time than the average. Let's call them the "low performance segment". The second segment has a low level of satisfaction. The good evaluations worked on lot of project, has a good utilization rate, and has been in the company for a long time. We see that there are those who works the most, but also, they're less satisfied. The two events could be related, so let's call those employees "the burned" ones. The third segment is very similar to the second segment, with good evaluations and high utilization. But in contrast, they are very satisfied. Let's call them "the High Potential". And finally, the last segment doesn't have any distinctive characteristics. Let's call them the "Misc" segment. First, we need to say something about the newborn variable. In practice, on the one hand, we could consider this variable as exogenous, which means that we can measure it, but except if you're in a country that limits the number of children a family can have, we cannot do anything about it in practice. So this variable will never be really actionable in any case. From this perspective, the variable should not have been included in the data set in the first place. On the other hand, the firm may consider developing parents' benefits, such as paternity or maternity leave, or on-site childcare. So in practice, we see that the question of which variable should be taken into account in the analysis is actually related to its actionability and it's relevance. It's a business discussion and not a technical one. Second, we should note that the level of satisfaction is a consequence of everything else. We cannot really act on it directly. So it has to be seen as a consequence and not as the driver of managerial impact. And we should focus on what is actionable in practice. The number of projects done or the utilization, for instance, can directly be acted upon by the management. We can staff more or less a certain person if we see that the situation needs to be changed. The conclusions are then straightforward. For the low performance, whether they leave the company themselves or are fired, it may not be a priority to try to retain them. In contrast, we should really do something quickly for the "burned out"!. The manager should have anticipated the situation and helped those employees to take a step a back proactively. Now, for the High Potentials, the situation would be more complex. Those employees have probably been hired by the client. It's probably that we couldn't make them a better offer. We can always try to give them a raise, or promotion, or better projects, but it's always difficult to retain those employees that are happy but want to leave nonetheless. Finally, we very often observe a "misc" segment whatever the situation or the sector. We cannot always explain everything. And some events may be exogenous: explained by events that are out of our control. We may want to collect additional data, to try to understand this specific case better but since it's a relatively small group, it's not really a priority either. Notice that in all the cases, we start from the business issue, sub-optimal supply chain design or employees leaving the company. And it's the business issue that drives the type of variables we are interested in. The same for how we design our analysis or our conclusions. You have to understand that there is no such thing as on "optional clustering or segmentation approach" in practice. What matters is that you use the right clustering approach for the business problem at hand, and that your conclusions are actionable.