In this video, we discussed two related topics in data expiration, data cleanup and transformation. Here are some common questions to ask when you clean up data. Do character variables have valid values? Are numeric variables within range? Are there missing values? Are there duplicated values? Are values unique for some variables, for example, ID variables? Are the dates valid? Do we need to combine multiple data files? It is important to check those questions and handle them appropriately in the data cleanup process. While it is straightforward to handle some of those issues, dealing with others might be highly non-trivial. As an example, handling missing values usually requires good understanding of the problem context and can take many alternative approaches. We discuss it in a separate video. There are many commercial and open source tools for data cleanup. Two very popular open source tools are OpenRefine and Data Wrangler. The tools can greatly help many common data exploration tasks. I encourage you to explore their websites to learn more. Next, we move on to data transformation, which usually means that we apply a mathematical function to each data value. Perhaps the most common data transformation is centering and scaling of a single variable. For those of you with statistics background, you can think of it as calculating the z-score of each observed value. That is, each data value is first reduced by the mean and then divided by the standard deviation. One of the immediate benefits of centering and scaling is to make numerical procedures easier to work with and more stable. This is because the centering and scaling ensures that multiple variables in a dataset is on a common scale. Centering and scaling is often required or recommended for some modeling tools, such as clustering, principal component analysis, and neural networks. The main drawback is that the data becomes harder to interpret. The data value after centering and scaling measures the number of standard deviations between each data point and the mean, and its uniqueness. There are many other data transformations. Some of them can be expressed using common mathematical functions such logarithm, square, square root, and inverse. Except for the first one, all transformations mentioned here are considered polynomial transformations. Because they involve polynomial terms of the original data value. Different transformations are appropriate for different problem contexts. Sometimes we have to experiment to figure out the right one to use. Each of those transformation can be easily done in Excel and other data analysis tools. There are ways to automatically determine the appropriate data transformation from the data. An example is Box-Cox transformation, which is defined using a parameter called lambda. Different lambda values give different transformation. When lambda is 0, the transformation is simply a logarithm transformation. When lambda not 0, it is a polynomial transformation with a parameter lambda. For example, when the lambda = 2, it is effectively a square transformation. Note here that offsetting by 1 and dividing by lambda does not change the distribution of the variable. Similarly, when lambda = a half, it is a square root transformation. When lambda = -1, it is the inverse transformation. Lambda can be estimated from the data itself, and can take many different values. In this way, this transformation covers all transformations discussed before. The transformations we discussed so far work on a single variable. This should be contrasted with data reduction, which reduces the data by generating a smaller set of variables. The purpose of data reduction is to use a smaller set of variables to capture most of the information in the original variables. A widely used data reduction technique is principal component analysis, which seeks to find weighted averages of the variables to capture most of information which is measured by variance in the data. The newly generated weighted averages are known as principal components. The principal components are uncorrelated and ideally a few of them are able to capture most of the variance in the data. An important note here is that we need to first scale the variables before applying principal component analysis. This is so that the principal components are not dominated by variables that are much larger in scale. Principal component analysis will be discussed in more detail in the third course of this specialization.