In this video, we discuss adding and removing variables. Oftentimes, it is possible to find other relevant information to add to the data set. In sales forecast, it's often not sufficient to use internal data. It is common to collect comparison information and macroeconomic information. At other times, a data set may contain duplicate information or columns that are not informative. As an example, sales amount is often kept in multiple currency in IT system. However, only sales amount in one currency should be used in analysis, otherwise can cause numeric difficulties in analysis. Additional variables can come from several sources. First, data sets are often compiled from online and public data sources. It is often possible to collect additional information from those sources. Second, it is also possible and sometimes necessary to create additional variables from the data, including dummy variables, transformed variables, and arch-in-mean terms. We will now briefly talk about dummy variables, we will discuss transformed variables and arch-in-mean terms later in the course. It is common to create dummy variables from a single variable that takes a small number of fixed values. Each of the values is called a category and the variable is called a categorical variable. We can create m- 1 dummy variables, from a categorical variable with m categories, where m is an integer. Each dummy variable takes the value of 0 or 1, where 1 usually indicates yes and a 0 indicates no. Let's look at an example, the table on the left shows three property types, single family home, townhouse and condo. So there are three categories and m equal to 3. We can create two timing variables D1 and D2, where D1 indicates whether it is a single family home and D2 indicates whether it is a townhouse. Both D1 and D2 taking the value 0 indicates a condo, which is neither a single family house or a townhouse. This procedure of creating dummy variables can lead to a large number of dummy variables when there are many categories for a categorical variable, which can be problematic when observations for some categories are sparse. We can systematically reduce the number of categories by combining categories that are close to each other. Going back to the same example in a original data set, there are many categories other than single family home, townhouse and the condo. Depending on the purpose of our analysis we may choose to combine some categories. For example, condo, coop, multi-family and mobile home can all be mounted to the others category. In this way, we end up using only two dummy variables, instead of five should we not combine categories. I'd like to caution here that how categories should be combined is not always straightforward and it should be chosen carefully. It is common to combine sparse categories together, but it may also depend on the problem context. Many software packages have built-in functions for this purpose. As we discussed before, we may want to remove variables from our data set. The first motivation is to simplify model and analysis. Some variables may not contain much useful information and therefore do not deserve added complexity for including them in analysis. There are also cases when some variables contain the same information, and we should only keep one of them. By removing variables and including a smaller set of variables, we end up with more parsimonious models, which also tend to be easier to interpret. Note that some methods do not work well when the distribution for a variable is degenerate, meaning that the variable only takes a single value. Variables with degenerate distributions have to be removed for better performance. A variable with degenerative distribution is also called a zero variance variable in exploratory data analysis. For example, if everybody in a high school class are born in the same year, then the first year is a zero variance variable, and is not a relevant variable in analysis, including it may cause numeric difficulties. It is more common to have near-zero variance variables where only a very small portion of the data takes a different value. Going back to the same example, if all students in the class are born in the same year except one student, then the first year is a variable with near-zero variance. Keeping zero variable or near-zero variance variables can cause cardinality issues in regression analysis, where coefficient estimates can change erratically, when there are small changes to the data, such behavior is clearly undesirable.