Hi, within this class, we are going to conclude our journey regarding the optimization for bidding accelerated FPGA-based application with the SDAccel. In this class, we will discuss optimization related to the host system that is responsible for transferring the data to and from the FPGA board, as well as to send the command to start the execution of a kernel. Before starting, it's useful to quickly recall the general execution flow of an SDAccel application. To this purpose, let's quickly review such steps for an application targeting an Amazon EC2 F1 2x large instance. Overall, the lifecycle of an SDAccel application follows these steps: first, the host system loads a complete bitstream to the FPGA device in order to configure both the static region and the reconfigure region that hosts the accelerated kernel function. Second, the host copies the input data from the host DDR memory to the onboard DDR memory via PCI Express. Third, the host configures the kernel parameters and starts the kernel execution via the AXI light control interface. Fourth, the kernel starts the computation. It reads and writes data to the board DDR memory via the AXI master-slave interfaces that are connected to the memory controller. Fifth, the kernel completes the execution and notifies the host. Sixth, the host copies back the output data generated by the kernel from the on-board DDR memory to the host DDR memory and continue processing as needed. As we can see here, the actual kernel execution is only one out of six different step, each requiring its own time to complete. When we deal with the host optimization, therefore these are minimizing the execution times of the other five steps, or better, try to hide the execution time of the other steps while the kernel is running. Step one which represents the FPGA configuration can take a few seconds to complete. Nevertheless, if in our application we use a single FPGA bitstream, there is no much we can do in order to hide the time taken by this step. However, if multiple bitstream are acquired over time, we could hide the time taken to load the bitstream by performing in parallel useful competition on the host side. Indeed, it's important to recall that our system is an interior genius system. It consists of both a CPU host side and on the actual FPGA accelerator. Hence, why not exploiting the host as well in order to perform useful computation? In general, the idea is to preemptively acquire the FPGA configuration before the actual configuration is needed. Now, let's consider the three steps: two; host device memory data transfer, four; kernel execution, and six; device to host memory data transfer. For every new data that we want to process, all the three steps must take place. This is especially true when we deal with very large datasets where we need to process in its blocks in order to fit with the device memory. A very simple implementation that iteratively execute a kernel on subsequent blocks of the datasets is shown in the figure. As you can see, the FPGA is idle when we are performing the data transfer to and from the FPGA. The read and write operations are from the host perspective. Here, write means that the host send one block of data, while the read means that the host get back the results of the corresponding input data. A better approach is to use a double buffering technique. Instead of waiting for the kernel to complete, we can start to transfer the next chunk of data to the accelerator. Then, once the kernel completes, we can promptly start it on the new block of data without the need to wait for the data transfer to occur. In this fashion, we are effectively executing the kernel continuously without interruption provided that the execution time of the kernel is higher than the time taken to move the data to and from the FPGA device. This type of implementation can be achieved by exploiting the OpenCL event to synchronize the execution of the different memory transfer and kernel executions. Finally, let's now focus on steps three, four, and five that regards to kernel configuration. Kernel execution and the notification of completion. Here, it's important to know that the kernel configuration is not instantaneous and depending on the OpenCL runtime library within the SDAccel environment, the time might vary from 30 microseconds to 60 microseconds according to the number of kernel arguments. Hence, if the kernel execution takes an amount in the order of tens or hundreds of microseconds, then the time taken for configuring the kernel is not negligible. In order to reduce such overrate, it's important to minimize the number of kernel score. As an example, assume that you have built a kernel that applies a filter to a small image. If we need to process a batch of images, instead of calling the kernel on each image, we can create a kernel that receive a parameters that tells it how many images to process from the DDR memory so that it can compute the entire batch in a single execution.