In this video, I introduce the task of extraction and transformation and loading or ETL which is necessary to convert clinical data which is available in a local data model into a format that conforms to a common data model that can support large scale multi-institutional real world data analytics. ETL developers often refer to the source and target data models. The source data model is a database or data set as present in its native or local format. The target data model is a database or data set that is in the final format, that is created by the ETL process. As mentioned ETL stands for Extraction, Transformation and Loading. The Extraction phase focuses on getting data out of the source database. The transformation phase focuses on changes needed to the data elements to divert from the local structures to the common data model structures and then the loading phase focuses on getting data into the target database. We focus in this module only on the transformation step. Some approaches are also called ELT where data are loaded into the target database without any transformation and then transformed into the common data model format. The principles for transformation are the same no matter which sequence of options, ETL or ELT is used. This is the graphical view of the ETL processes, which includes additional steps such as validating that the extraction process correctly identified and moved all the data elements into the extraction. And that at the end of the ETL pipeline, data quality analytics are applied to the transform data to ensure that ETL processes did not corrupt of drop data. The image in the previous slide implied that ETL was a linear process, however, in actual practice ETL is highly iterative as each step in the transformation May reveal insights that alter the previous steps. While the intent is to discover transformation issues as early as possible, it is not uncommon for subtle data issues become apparent relatively late in the process. Over time as transferred data are used, additional unusual situations, often called edge cases may cause the team to go back to the very beginning and altering the ETL specifications which in turn can impact all of the subsequent steps in the pipeline. ETL requires a significant commitment of technical resources. This table from an open access paper by Tone [INAUDIBLE] a colleague of mine at the University of Colorado groups 23 ATL challenges into 6 themes that stretch across all aspects of extraction, transformation and loading. The decision to transform local data into a common data model requires long-term enterprise level commitment by the participating institution. ETL processes sit in the middle of the source data model and the target data model. Here, illustrated by a simple arrow connecting the source to the target. One significant technical challenge is the need for ETL developers to have a strong understanding of both the source data model that exists in their institution and the target data model that will be used for real world analytics. Given the complexity that is embedded in each table and each field within each tablem the amount of knowledge needed to do accurate ETL work can be daunting even for an experienced data analyst. Detailed documentation of each data model along with detailed transformation rules or conventions can greatly lower the significant technical barrier. An issue that is often under appreciated is the difficulty in identifying all the ways in which a data element may be represented in the source data model. This example shows 14 different ways in which the concept of height was represented in an actual electronic medical record. The list is sorted by the number of times that representation appears in the source database in the right most column. Notice the long tale of low frequency use that may represent older elements no longer being used with very long tales. So the institution may need to decide to not invest in transforming very low frequency variables that represent only a very small percentage of values. This is a second example of a long-tail variable. In this case, we are showing various ways in which patient weight is captured. Again, a very long-tail is seen in the rightmost column. Here you can see the different ways in which blood pressure is measured. However, unlike the previous examples, the issue here is not all of the blood pressure measurements in this list would be considered identical measurements for analytic purposes even though the labels contain the word blood pressure or BP. In this setting, it would be important for a clinical person, knowledgeable about blood pressures to work with the programmer to determine which of these variables can be considered identical versus which of these types of pressures need to be kept separate variables. This video has focused on a traditional model for ETL where data are physically removed from the source system transformed in some staging area and then loaded into the target database. However, newer methods are emerging that may alter how future common data models are populated. Message-based approaches use very specific messaging formats to ensure that the content of a message is identical, irrespective of the source system. Transformation of local data into the message standard is done by the local system. Transformation of the data from the message standard into the common data model can be done once. It could be applied to all systems that have implement the same message standard. Similarly, direct connections using API or application programming interfaces enable a similar approach that involves local transformation into the API standard and then a single universal transformation from the API standard into the common data model format. We anticipate that both messaging and API-based approaches will grow as more specific health data standards are developed.