Hi and welcome to this class where, we will see on how to optimize the implementation of our kernel in order to efficiently use the available resources on our target FPGA. In particular, we will discuss the loop unrolling optimization. Let’s come back to our vector sum example that has been introduced in the interface optimization classes. The code version that you see here, has already some interface optimizations applied to it. In particular, the code already exploits burst data transfer and we leverage on local memories in order to read the operands and store the results of our floating-point additions. The core of our kernel resides in the loop labaled as «sum_loop». Here we iterate over the N elements of our vectors and iteratively perform the additions. Looking again at our synthesis reports, we can see that each iteration of our «sum_loop» takes 10 cycles. Since the loop needs 1024 iterations, the overall latency for computing the loop is 10240 cycles. Note that the number of loop iterations is referred as trip count within the Vivado HLS performance reports. To understand why we need 10 cycles for each iterations we can look at the analysis report. Here, we can see that 2 cycles are needed to load in parallel the operands from arrays local_a and local_b, 7 cycles are required to perform the floating point addition, and, finally, one cycle is needed for storing back the result into array local_res. Is there any way to reduce the overall latency of the loop and achieve higher performance? Well, luckily the answer is yes! We will now look into two different optimizations directives, namely loop unrolling and loop pipelining. If we take a closer look to our original code, we can clearly see that all the iterations of the loop are independent from each other. Indeed each addition is done on different elements of the input arrays and it is stored on different elements of the output array. Hence, would it be possible to perform multiple additions in parallel on different elements? The answer is again yes and the way to achieve it is by unrolling the loop. Loop unrolling effectively means unrolling the loop iterations so that, the number of iterations of the loop reduces, and the loop body performs extra computation. This technique allows to expose additional instruction level parallelism that Vivado HLS can exploit to implement the final hardware design. In this example, we have manually unrolled our sum_loop by a factor of 2. As you can see, the variable i increments with step 2, hence effectively reducing the number of loop iterations from 1024 to 512. On the other hand, each loop iteration performs two additions instead of one. The same optimization can also be expressed in a much more convenient way by using the HLS UNROLL pragma. The pragma must be placed directly within the loop that we wish to unroll. The pragma also allows to specify the unrolling factor by which we want to unroll our loop. Notice that the unrolling factor can be any number from 2 up to the number of iterations of the loop. If the factor parameter is not specified, Vivado HLS will try to completely unroll the entire loop. However, this can be achieved only if the number of iterations is constant and not dependent on dynamic values computed within the function. Alright, let’s now see what is the effect of our optimization! If we run Vivado HLS and look at the synthesis report, we can now see that the latency of the sum_loop has halved! The reduction comes from the fact that the loops now iterates over 512 iterations, but it is still able to perform each loop iteration in 10 cycles, as for the previous case. To understand how Vivado HLS achieves this, we can look at the analysis report. Here we can clearly see that Vivado HLS was able to schedule the execution of the two floating-point additions as well as the load and store operations completely in parallel! Nevertheless, this optimization comes at a cost. In order to perform the two floating-point additions fully in parallel we need two floating-point adders in our hardware design which increase the overall resource consumption of our kernel. Indeed, if we look at the resource estimation report we can actually see the two floating-point adder instances and their corresponding resource consumption. In our design we are far away from using all the available FPGA resources, but in more complex designs it is very important to consider the impact on resource consumptions when applying optimizations to our kernel. In this example, unrolling by a factor of 2 provided a straight 2x reduction in the latency of the loop at the cost of 2x extra resources for its implementation. Nevertheless, in some cases, it might not be possible to achieve such an ideal latency improvement. When performing loops optimizations, there are two potential issues that needs to be considered: First, constraints on the number of available memory ports and available hardware resources. Second, loop-carried dependencies. I know, you are interested in knowing more; don't worry! More information will be provided in the following lesson.