Now that we have built our machine learning model, let's operationalize the model. We are now at the final stage of operationalizing the model. This will actually consists of a few steps; creating our data processing pipeline, training our model and ML engine, which will subsequently allow us to easily serve our model to end-users via a rest API. Then, finally to deploy an app, an app engine that will allow end-users to neatly consume our predictions. Right now, however, we'll be focusing on creating our two datasets for training and evaluation using the logic that we developed earlier in this course, using repeatable splitting with the hash and modulo operator. Let's talk about productionalise ML pipelines elastically with Cloud Dataflow. As you will see, key benefits of using Cloud Dataflow are: it allows us to process and transform large amounts of data in parallel and it supports both streaming and batch jobs. Thus, you can utilize the same logic for about trading your model and serving it. There are two key terms here Apache Beam and Cloud Dataflow. Firstly, there's Apache Beam which is a unified model for defining both batch and streaming data parallel processing pipelines, as well as a set of language specific SDKs for constructing pipelines and runners for executing them on distributed processing backends. Then, there's Dataflow which executes the code you wrote using the Apache Beam API. Why would you want to use Cloud Dataflow? One way to think about feature preprocessing or even any data transformation is to think in terms of pipelines. Here when I say pipeline I mean a sequence of steps that change data from one format to another. So, suppose you have some data in a data warehouse like BigQuery. Then, you can use BigQuery as an input to your pipeline to a sequence of steps to transform your data maybe introduced some new features as part of the transformation. Finally, you can save the result to an output like Google Cloud Storage. Now, Google Cloud Dataflow is a platform that allows you to run these kinds of data processing pipelines. Dataflow can run pipelines written in Python and Java programming languages. Dataflow sets itself apart as a platform for data transformations because it is a serverless, fully managed offering from Google that allows you to execute Data Processing Pipelines at scale. As a developer, you don't have to worry about managing the size of the cluster that runs your pipeline. Dataflow changes the amount of compute resources, the number of servers that will run your pipeline elastically, all depending on the amount of data that your pipeline needs to process. One thing that makes writing the Apache Beam easier is that the code written for beam is similar to how people think of data processing pipelines. Take a look at the pipeline in the center of the slide this sample Python code analyzes the number of words in lines of text in documents. So, as an input to the pipeline you may want to read text files from Google Cloud Storage. Then you transform the data, figure out the number of words in each line of text. This kind of a transformation can be automatically scaled by Dataflow to run in parallel. Next, in your pipeline, you can group lines by the number of words using grouping and other aggregation operations. You can also filter out values, for example, to ignore lines with fewer than ten words. Once all the transformation, grouping, and filtering operations are done, the pipeline writes the results to Google Cloud Storage. Notice that this implementation separates the pipeline definition from the pipeline execution. All the steps that you see before call to the p.run method or just defining what the pipeline should do. The pipeline actually gets executed only when you call the run method. One of the coolest things about Apache Beam is that it supports both batch and streaming processing using the same pipeline code. In fact, the library's name beam comes from a contraction of batch and stream. So, why should you care? Well, means that regardless of whether your data's coming from batch data source like Google Cloud Storage or even from a streaming data source like Pub/Sub, you can reuse the same pipeline logic. You can also output data to both batch and streaming destinations. You can also easily changed these data sources in the pipeline without having to change the logic of your pipeline implementation. Here's how. Noticed in the code on the screen that the read and write operations are done using beam.io methods. These methods use different connectors, for example, the pub/sub connect can read the content of messages that are streamed into the pipeline. Other connectors can read raw text from Google Cloud Storage or a file system. The Apache Beam API has a variety of connectors to help you use services on Google Cloud like BigQuery. Also since Apache Beam is an open source project, companies can implement their own connectors. Here is boilerplate code that you can use to execute a Beam pipeline, that reads from BigQuery, transforms the data bit and rights to Google Cloud Storage as a CSV. So, you'll see we're reading in or importing the Apache Beam library. Then, we define transform which is a user-defined Python function which takes a row, applies transformation to it in this place multiplying column a times b for c and then returning the result via the yield statement as a CSV. Once we define our user-defined function, then, we'll create our Beam pipeline. You'll notice that there's arguments to that. This specified different parameters of our pipeline, such as what are runner will be. In this case, we're using the Dataflow Runner. Below, we define the SQL query, where we read from our source table. Then, we'll apply three operations as separated by the pipe delimiter including reading our data from BigQuery, applying our Python transformation to it, and then writing the files out to CSV files on Google Cloud Storage. Finally, once we've build our pipeline, we can then execute it with p.run. Once you have created your pipeline, it is easy to execute it. Simply running main() runs the pipeline locally. Similarly, to ML Engine, it's best to prototype locally on a subset of data. Then, run your job on the cloud over the entire dataset. To run on the cloud, you need to specify the cloud parameters. In this case, we need to specify the project, where we're running our job, the job name, the staging location and the temp location which are Cloud storage buckets to save any temporary files. Finally, we need to specify the runner, which is the pipeline runner that will parse your programming, construct your pipeline. Since we're running on the cloud, we must use the dataflow runner. Now, we are ready to process our baby weight data, which is still located in a BigQuery table. For both the training and evaluation datasets let's read our data from BigQuery. For example, the Train Read Operation, performed some preprocessing yielding the CSV format the results, that's the training CSV, and then dump the results out to CSV files, to find operation titled train out. We will also build a moderate our job in real-time using the Dataflow dashboard at console.cloud.google.com and searching for Dataflow. This Dashboard will tell us the number of workers, status of our job, and other useful metrics.