Part of the challenges and opportunity of clinical data science are developing methods to repurpose data that were collected for entirely different reasons to improve the health and care of other patients. In the traditional research model, we used to have to take our study questions, design and get approved a study to answer that specific question, and then go through the laborious process of setting up and recruiting patients for the study. On the bright side, in this model, we retain complete control over what data are collected, when, and by what method. This method, our analyses could be more straightforward and our dataset relatively clean with few missing values. Now though, there is a treasure trove of data that we can use to answer the questions we have as they come to us. But the limitation is that these data are messy and impacted by a lot of factors outside of our control. These data are not carefully curated and may be missing data that you need to answer a particular question. Understanding the reason behind why the data you want is not available is critical to implementing high-quality analysis using these data. In other videos, we're going to talk about the most common reasons for data missing in electronic health records. But first, I want to highlight one general limitation of using EHRs for research at all. As you've seen in identifying patient populations and predictive modeling and transforming clinical practice, most of the algorithms we develop are evaluated based on manual review of the EHR to determine the truth about a patient's status. But really, this is the recorded truth about the patient, simply if and how the medical record contains the data of interest. This is different than the absolute biologic and physiologic truth about what is happening with the health of the patient. For example, a patient may be hypertensive and experience no symptoms. Even if they were starting to develop symptoms like getting headaches, the patient may or may not choose to talk to their doctor about their headaches. Even if they tell their doctor about their headaches, the provider may or may not really hear that the headaches are an issue for the patient. For example, if the patient is coming in with a lot of other health problems, the headaches may get missed or seem less important compared to all of the other issues. Even if the doctor does think about patients headache, they may or may not choose to document it in their clinical note or use it as a primary diagnosis for the visit. In other words, the absolute truth gets filtered through a lot of different levels before it becomes the recorded truth. Most importantly, our algorithmic approaches and implementations create yet another layer of filtering to generate our detected truths. In other words, who the algorithm decides as a case or control. As you move forward in the field, it's important to be aware of how far from the truth you are getting and think through the impact of that distance on your analyses. Now, you may be thinking, so if it's this complex and our algorithms are so far from the truth, why do we even do this again? Well, let's go through an example. This example comes from a paper by Dr. Richard Tannen at University of Pennsylvania, where they were trying to use electronic health records to replicate the nursing health hormone replacement therapy findings. Even starting with an initial group of over 900,000 women who were in the right age range for the analyses, after all the different steps of the inclusion, exclusion and study protocol processes, less than 7,000 cases and 18,000 controls were able to be analyzed. But simply, even for traditional clinical research, we need large numbers of patients to find statistically meaningful outcomes. Using EHRs, especially in large hospital or research networks, is one way to have enough people to understand important effects that might otherwise be undiscoverable.