In this video, we talk about how to deal with outliers in data exploration. For a single variable, an outlier is an observation far away from other observations. Obviously, far away is a relative term, and there is no consensus definition for outliers. Perhaps the most commonly adopted definition is based on the distance between each data point and the mean. Usually any observations above or below three standard deviations of the mean are considered outliers. But there are also other definitions based on statistical tasks, nearest neighbors, and quartile ranges. Software packages sometimes follow different conventions when dealing with outliers. So it is helpful to understand the specific implementation in the software tool you are using. For a pair of variables, outlier is a point that fall outside of the overall pattern in the relationship, which can be examined using scatter plot. Consider the scatter plot shown here. The red point in the lower right corner is an outlier in the graph. It is interesting to point out that this little point is not an outlier when we examine either the x variable or the y variable alone, because it is in the middle of the range for both variables. Nonetheless it is clear that this point requires some attention as it deviates dramatically from other points on the graph. Two important questions should be asked when outliers are spotted. First, is outlier a mistake or legitimate point? Sometimes outliers are simply caused by data recording errors. In other cases, outliers are legitimate observations. The second question is whether the outlier is part of the population of interest. Depending on the answer to this question, we can decide whether outliers should be included in our analysis. What should we do about outliers? There's no single solution here. If an outlier is the result of data recording errors, we should recorrect the error if possible. If an outlier is outside of the population of interest, we should simply remove the outlier from further analysis. One should be cautious when removing outliers as removing them can sometime dramatically change the result of subsequent analysis. Outliers are not always outside of the population of interest. Sometimes they are actually the main focus of our analysis. As an example, it can be argued that all successful startups are outliers, as the success rate of startups is very low. However, in many analysis we're only interested in analyzing successful startups. Data transformation can sometimes eliminate outliers as well. A data point might be an outlier on a regular linear scale, but it may not be an outlier anymore if we apply logarithm transformation. It is also possible to treat outliers as missing data. For example, it is appropriate if we have data recording errors but do not know how to correct them. Some analytical tools are more robust when dealing with outliers than others. For summary statistics, it is well known that median is more resistant to outliers than the mean. Some robust tools deal well with outliers because they are designed not to be influenced by extreme observations. Many commonly used analytical tools have robust counterparts. We can also use cross validation, which we will discuss in a separate video, to decide whether outliers should be included in analysis or not.