In this video, I would like to move on to an important topic in spirometry data analysis, dealing with missing values. Data sets often come with missing values where no data is available in the row or column. In many cases, we cannot simply throw away missing values, because we need to have enough data for meaningful analysis. We often leave them in and fill the values with some smart estimates. In which case, it is important to minimize balances or distortions. Furthermore, missing values may themselves be informative. That is, the fact that a data point is missing can have high predictive power. Let's look at a concrete example. The table on the slide shows that data sales for newspaper on a newsstand. The first column leads to the dates, and the second column shows the number of copies sold each day. Curiously, there's no sales recorded on March 27th. This missing value can cause difficulty when the modeling approach requires a data value for each column or row. For example, this the case for linear regression. Although it is straightforward to remove all records with missing values, it can lead to significant loss in data. The issue is especially severe for datasets with many columns. Even if there is only a small fraction of missing values for each column, there will be many rows of at least one missing value. As a result, too many rows will be removed in the process. The graph on this slide is a non-plot of the sales data, where the x-axis is a date and the y-axis is number of copies sold. The plot shows significant variation in daily sales with the highest value of 50 on March 28th. There is no sales record on March 27th. We have a missing value on that day, which is filled with a value 0 on the graph. The sales data appear to follow some pattern. The value is high initially and gradually decreases in the next few days and then increases again, and the pattern then repeats. This slide shows the same graph with weekday information, which turns out to be quite useful. It is apparent from this graph that sales seem to follow a weekly pattern. The sales numbers is the highest on Mondays, and this then gradually decreases in the next few days, but ticks up over the weekend. Note that the missing value on March 27th is quite disruptive to this pattern, without a correct inference value, a real impression can be misleading. Suppose that we would like to use this data for sales for cost, a common business task, or for cost to be brought down by the zero sales value on March 27th? This again shows the importance of dealing with missing values as a data preprocessing step in predictive modeling. What happened on March 27th? There are many possible causes of missing values. Sometimes, a missing value is a simply result of negligence in data recording, we may forget to record a value. In this particular situation however, after some investigation, we discover that March 27th is the Easter Sunday, and the newsstand is closed. Although we have a perfect understanding of what caused the missing value on March 27th, we still need to decide on what to do about it. If we include zero sales in our data set, it will certainly distort our sales forecast. There are many possible ways to deal with missing values, and here we discuss a few of them. The first one is to simply remove the data. As we mentioned earlier, however, it is not always feasible, as we may throw away too much data. The second approach is to impute or guess a value. We can fill in the missing value with zero, with average sales, or with a smart guess from some interpolation. For example, we can use sales on the same day last year to fill in the value. In general, we can use observations from similar data points to intelligently guess the value. Finally, we can also making missing its own category. Such an approach is typically more appropriate for categorical data. Let's return to our little example and try to impute a value for March 27th. Filling in the value with zero, as is already done, is probably not a good approach. Were the newsstand not closed for Easter Sunday, it is quite unlikely that the sales will be zero. What about filling the value with average sales across all different dates? If we do that, we get a value of 37.23, which seems to be a reasonable value to use. Yet, another approach is to use some other interpolation approach since our data demonstrates some weekly pattern, we can use sales on the last Sunday as an estimate. In which case, we will fill in the value 40. It can be argued that either of the two values, 37.23 and 40 are reasonable estimates for the cells of March, 27th. The first one, 37.23, uses more data to come up with the estimate, and therefore may be more reliable. The second one, 40, takes advantage of the weekly pattern in the data. However, it only uses one data point to estimate its value. If for some reason, the sales on Sunday, April 3rd is exceptionally high or low for reasons we do not know, the same bias will certainly be carried to this estimate. Missing values should be contrasted to censored values which are partially observed valued and therefore are not accurate. Censored values also need to be carefully treated in its variant data masses. Going back to our later example, the sales on Monday, March 28th is 50, which is also the largest sales number in our data set. After talking to the store manager, we discovered that only 50 copies of the paper are available for sale on that day. This data point is censored value. The sale on this day is not fully observed because we ran out of inventory. A sales record of 50 suggests that sales can be at least 50. However, you do not know exactly what it would be. Now let me briefly recap our discussion. We talked about the importance of dealing with missing values. There's no single accepted solution. It is often helpful to consider the problem context, and dig deeper in order to understand the causes, and plan your remedies for missing values. It is important to minimize balances or distortions when dealing with missing values. We often cannot simply throw away missing values because we need to have enough data for meaningful analysis. The pattern of missing values can sometimes carry important information and be highly predictive. Many software packages provide a wider range of options for handling missing values. It is important to understand and choose the right options, since how missing values can have dramatic impact on the modeling outcome.