Now that we've looked at the components in the UI, we will discuss the process of building a data pipeline. A pipeline is represented visually as a series of stages arranged in a graph. These graphs are called DAGs or directed acyclic graphs because they flow from one direction to another and they can not feed into themselves. Acyclic simply means not a circle. Each stage is a node and as you can see here, it can be of a different type. You may start with a node that pulls data from Google Cloud storage, then passes it on to a node that passes a CSV. The next node takes multiple nodes, has an input, and joins them together before passing the join data to two separate data sync nodes. As you saw in our previous example, you can have multiple nodes for out from a single parent node. This is useful because you may want to kick off another data processing workstream that should not be blocked by any processing on a separate series of nodes. Naturally, you can combine data from two or more nodes into a single output in a sync as you saw before. In data fusion, the studio is your user interface where you author and create these new pipelines. I'll highlight some of the major features that you will explore later in your lab. First, the area where you create these nodes and chain them together in your pipeline, that's your canvas. If you have many nodes in a pipeline, the canvas can get visually cluttered, so use the mini map to help navigate around a huge pipeline quickly. You can interact with the canvas and add objects by using the Canvas Control Panel. When you're ready to save and run the entire pipeline, you can do so with the pipeline actions toolbar at the top. Don't forget to give your pipeline a name and description, as well as make use of the many preexisting templates and plugins, so you don't have to write your pipeline from scratch. Here, you see we've used a template or data pipeline batch which gives us the three nodes you see here to move data from a GCS file, process it in a wrangler, and output it to BigQuery. You should make use of preview mode before you deploy and run your pipeline in production to ensure everything you run will run properly. While a pipeline is in preview, you can click on each node and see any sample data or errors that you will need to correct before deploying. After deployment, you can monitor the health of your pipeline and collect key summary stats of each execution. Here, we are ingesting data from Twitter and Google Cloud platform, and passing each tweet before loading them into a variety of data syncs. If you have multiple pipelines, I recommend you make liberal use of the tags feature to help you quickly find and organize each pipeline for your organization. As you see here, you can view the start time, the duration of the pipeline run, and the overall summary across runs for each pipeline. Again, you can quickly see the data throughput at each node in the pipeline simply by interacting with the node. One last thing you'll notice is the compute profile used in the Cloud. Currently at the time of recording, Cloud Data Fusion supports running on Cloud Dataproc, but Cloud Data Source support is on the road map. Remember, clicking on a node gives you detail on the inputs, outputs, and errors for that given node. Here, we are integrating with the Cloud speech to text API to process audio files into searchable text. You can track the individual health of each node and get useful metrics like records out per second, average processing time, and max processing time, which can alert you to any anomalies in your pipeline. You can set your pipelines to run automatically at certain intervals. If your pipeline normally takes a long time to process the entire dataset, you can also specify a maximum number of concurrent runs to help avoid processing data unnecessarily. Keep in mind that Cloud Data Fusion is designed for batch data pipelines. We will dive into streaming data pipelines in future modules. One of the big features of Cloud Data Fusion is the ability to attract the lineage of a given field value. Let's take this example of a campaign field for double-click dataset and track every transform operation that happened before and after this field. Here, you can see the lineage of operations that are applied to the campaign field between the campaign dataset and the double-click dataset. Note the time this field was last changed by a pipeline run and each of the input fields and descriptions that interacted with the field as part of processing it between datasets. Imagine the use cases if you have inherited a set of analytical reports and you want to walk back upstream all of the logic that went into a certain field. Well now, you can.