In the previous video, you learned that you'll need lots of data for ML models to make predictions. The accuracy of those predictions, however, depends on large volumes of data that are free of bugs. Let me use our software analogy to explain what I mean by bugs. In traditional software development, a bug is a mistake in the code that causes unexpected or undesired behavior. In ML, even though there can be bugs in the implementation of an algorithm, bugs in data are far more common. Consider this example. A few years ago, some Googlers wanted to use ML to help diagnose diabetic retinopathy, which is the fastest growing cause of blindness, potentially affecting more than 415 million diabetic patients worldwide. Working closely with doctors in the US and India, they created an ML model that would diagnose diabetic retinopathy almost as well as ophthalmologists. They trained an ML model using labeled images of the backs of eyes. Because humans were involved in the labeling of the images, the labeling system is not completely objective. The data might have included incorrect labels or even human bias, which is then propagated into the ML model itself. How would you ensure that you have optimal data quality when training an ML model. The best data has three qualities. It has coverage, it's clean or consistent, and it's complete. I'll explain each one. Data coverage refers to the scope of a problem domain and all possible scenarios it can account for. In other words, all possible input and output data. Let's go back to our manufacturing use-case and assume that the car parts are divided into red and blue. If red and blue makeup all of the possible scenarios, but you only train your model with red parts, then the model may not be able to detect defects in blue car parts when it is presented with new data. So more data and broader coverage produce a more accurate ML model. The second quality of good data, is it's cleanliness. This is sometimes called data consistency. Data is considered dirty or inconsistent if it includes or excludes anything that might prevent an ML model from making accurate predictions. This is a lot like the errors or bugs we talked about earlier. An example of inconsistency is the data format. If time is an important data point for your ML use case, then you'll need to ensure that the timestamp format across all of your data is the same. This is both historical data and any new data you're collecting. If we use the manufacturing scenario again, inconsistencies in data can appear in several ways. For instance, if you're images have shadows in them, the model won't know whether the shadows are part of an object. If you want to predict on images that are supposed to have shadows, that's okay. Otherwise you're data is dirty. I mentioned incorrect labels earlier, which is another form of dirty data. In this scenario, you might have parts that were labeled as fractured, but in reality, they were discarded because they were just the wrong size. Remember the retinal images? There were several examples of images that the doctors couldn't agree on and therefore couldn't label. That was probably because the images themselves had issues. Perhaps an image was blurry or maybe there was a speck of dirt on the lens when the picture was taken. Again, the wrong labels or no labels on these images will cause data cleanliness issues that affect the accuracy of the ML model. There are lots of examples of human error that caused dirty data as well. In sales, for instance, maybe a person entered the purchase data incorrectly into a data storage system. Errors in automated services can create dirty data too. In retail stores, for example, maybe a transaction was recorded incorrectly every time the register ran out of paper. The more incorrect or dirty data you have, the more correct and clean data you'll need to provide a counterbalance, so that the ML model learns the correct outcome. Another quality of good data is completeness. This refers to the availability of sufficient data about the world to replace human knowledge. Think of this as the various status, categories or themes that help complete a user's profiles, such as address gender or height. The impact of incomplete data is that it can limit the performance of an ML model. We say there's incomplete data when there's a lack of better data, there are mistaken expectations about how ML works and what it's capable of and when program design and implementations are poorly executed. Let's go back to our manufacturing example. Imagine that one of the major sources of defects is overheating, but you're not collecting temperature data. That's an example of incomplete data. Even if you start collecting temperature data, now, you might not have the historical data that maps to past examples of good and fractured parts. Another form of incomplete data is the number of cases for all possible scenarios the data is intended to cover. In the same manufacturing example, your goal is to match the labels, meaning "good condition" and "fractured" with every part. If axle is one of the items that you're evaluating for defects, you'll need examples of axles in good condition and fractured. If we don't have that data, your data is incomplete. Remember data's the only tunneled through which your model views the world. Anything the model can't see, it assumes it doesn't exist. For example, if a model was given an image that only showed what's on the left, it might think the road was open and traffic-free. But in reality, if I show you the full image, you'll see that the road is just closed. The good news is that the most of these problems can be solved simply by getting more data. But you have to be purposeful when collecting that data. Do you need to improve coverage, improve cleanliness or consistency, or improved completeness? Let's use a sample scenario to practice what you've learned. Suppose you lead a multi-region retail chain and want to use machine learning to predict how many employees you'll need at each branch in a given period, such as between October and mid-January around the holiday shopping season. Here's what I want you to answer. To accurately predict the number of employees you'll need per branch. What data would you need? How would you collect it? How might you broaden the coverage, cleanliness, or completeness? Here are a few ideas you might have thought of. In this scenario. Some of the data points you'll need are: store size, average number of customers in a given time period, number of different departments, number of employees per department, number of self-checkout stations, average of returns at the store and average wait time at customer service. If you're the branch owner, you probably already have much of this data, such as the number of stores, employees per department and self-checkout stations. For the remaining data, you may need to work with a data analyst to combine existing data or to set up a system to collect the data you need. For example, average number of customers in a given time period, average returns, or average wait times. When it comes to the data quality, you'll want to confirm that the data from each branch is from the same time period. This is consistency. You'll want to make sure there are no empty fields. This is completeness. You'll want to make sure you've accounted for all possible inputs and outputs for a complete picture of the prediction, you want to make, this coverage. As you start working on ML projects, be sure to review the data you have and check for any issues. Pay special attention to coverage, cleanliness, and completeness. Remember data's central to ML. You'll need to account for as many possibilities when preparing your data to train an ML model. In the next video, I'll explain more about making predictions and repeat decisions with ML.