Whenever you're ready. >> In this video we discuss two related topics in data exploration, data clean up, and transformation. Here are some common questions to ask when you clean up data. Do character variables have valid values? Are numeric variables within range? Are there missing values? Are there duplicate values? Are values unique for some variables, for example ID variables? Are the dates valid? Do we need to combine multiple data files? It is important to check those questions and handle them appropriately in the data clean up process. While it is straightforward to handle some of those issues, dealing with others may be highly un-trivial. As an example, handling missing values usually requires good understanding of the problem context, and can take many alternative approaches. We'll discuss it in a separate video. There are many commercial and open source tools for data cleanup. Two very popular open source tools are OpenRefine, and Data Wrangler. The tools can greatly help many common data exploration tasks. I encourage you to explore their websites to learn more. Last, we move onto data transformation, which usually means that we apply a mathematical function to each data value. Perhaps the most common data transformation is centering and scaling of a single variable. For those of you with statistics backgrounds, you can think of it as calculating the z-score of each observed value. That is, each data value is first reduced by the mean, and then divided by the standard deviation. One of the immediate benefits of centering and scaling is to make numerical procedures easier to work with and more stable. This is because the centering and scaling ensures that multiple variables in a data set is on a common scale. Centering and scaling is often required or commanded for some modern tools such as, clustering, principal component analysis, and neural networks. The main drawback is that the data becomes harder to interpret. The data value after centering and scaling mirrors the number of standard deviations between each of the point and the mean, and is unitless. There are many other data transformations. Some of them can be expressed using common mathematical functions such as logarithm, square, square root, and inverse. Except for the first one, all transformations mentioned here are considered polynomial transformations, because they involve polynomial terms of the original data value. Different transformations are appropriate for different problem context. Sometimes we have to experiment to figure out the right one to use. Each of those transformation can be easily done in Excel and other data analysis tools. There are ways to automatically determine the appropriate data transformation from the data. An example is Box-Cox transformation which is defined using a parameter called lambda. Different lambda values give different transformation, the lambda is zero, the transformation is simply a logarithm transformation. When lambda is not zero, it is a polynomial transformation with parameter lambda. For example, when lambda equals 2, it is effectively a square transformation. Note here that offsetting by 1 and dividing by lambda does not change the distribution of the variable. Similarly, when lambda equals one-half, it is a square root transformation. When lambda equals -1, it is an inverse transformation. Lambda can be estimated from the data itself, and can take many different values. In this way, this transformation covers all transformations discussed before. The transformations we discussed so far work on a single variable. This should be contrasted with data reduction, which reduces the data by generating a smaller set of variables. The purpose of data reduction is to use a smaller set of variables to capture most of the information in the original variables. A widely used data reduction technique is principal component analysis, which seeks to find weighted averages of the variables to capture most of the information which is measured by variance in the data. The newly generated weighted averages are known as principal components. The principal components are uncorrelated and ideally, a few of them are able to capture most of the variance in the data. An important note here is that we need to first scale the variables before applying principal component analysis. This is so that the principle components are not dominated by variables that are much larger in scale. Principle component analysis will be discussed in more detail in the third course of this specialization.