In the previous class, we did quite a good job in optimizing our code. But at the end of the day, we ask ourselves if there was a simpler way to achieve the same results without the need to rewrite such a large amount of code. We know that the answer is yes. It can be done because Vivado HLS provides the array partitioning pragma. So starting from our original code, where we apply the loop unrolling and loop pipelining to the vector sum loop. We simply had the following array partitioning pragmas. Here we instructed the Vivado HLS to perform cyclic partitioning on the arrays local A, local B and local arrays with a factor of four. As you can see, we do not need to change any code, Vivado HLS automatically takes care of converting the arrays into smaller partitions and creates all the needed logic to convert accesses to other generic either side to corresponding access within one of the created partitions. The Vivado HLS array partitioning pragma requires several parameters. The name of the variable to partition, the partitioning type that can be either cyclic, block, or complete, the partitioning factor which is meaningful only for cyclic and block partitioning, the dimension of the array to partition, as we will see also multidimensional arrays can be partitioned. We already discussed in details our cyclic partitioning work. Let's also look at the other two partitioning scheme offered by Vivado HLS. Namely, Block and Complete partitioning. Let's start with Block partitioning first. The idea of block partitioning is to split the array in F contiguous chunks, where F is the partitioning factor. Here, we are considering an array A, with 12 elements, that is block partitioning to four partitions. Each partition contains three contiguous elements from the original array. In particular, we can see that the first partition contains element O, one, and two. The second partition contains elements three, four, and five and so on. This type of partitioning is useful if we need to access in parallel elements of the array coming from two different partitions, such as an elements in the first half and second half of the array. Nevertheless, block partitioning is less frequent than cyclic partitioning. Indeed, in many cases, when we perform partial unrolling, we often require parallel accesses to contiguous elements of the array and not to elements in different sections of the array. Finally, when we apply complete partitioning, we assign each element of the array to its own memory. Since having a memory with a single elements is quite of a waste, Vivado HLS maps single elements as register, using FLIP-FLOP resources instead of BRAMS. As you can see from the image, a complete partitioning of an array A of 12 elements, is equivalent to a block partitioning or a cyclic partitioning with a factor of 12. So why don't we always use complete partitioning to ensure that every element of the array can be accessed in parallel? Well, there are two main reasons. First, large partitioning factors or complete partitioning tend to use a large amount of resources. Indeed, every partitioning requires at least a dedicated BRAM or register. With the larger partitioning factors, we tend to increase BRAM or FLIP-FLOP resources requirements. Second, every BRAM is able to store up to 18k bits of data. Since we assume that the partition contains a couple of single precision floating point values, that is, 256-bits, we are wasting more than 98 percentage of the BRAM capacity. Recall that, when we partition an array, we also need the extra logic to understand from which partition we need to access the data. The partitioning is well calibrated to the actual memory accesses that we need to perform in parallel, such logic tend to simplify. As in our previous example, in which for an unrolling factor of four, we use a corresponding cyclic array partitioning with the factor of four. With these settings, Vivado HLS can generate a clean hardware implementation in which at every iteration of the loop, we access the 4-partition in parallel and addresses that are very simple to compute. All right. We are quite done. We have seen how we can apply cyclic, block and complete partitioning to an array that has a single dimension. However, Vivado HLS also allows to perform partitioning on multi-dimensional arrays. The concept discussed so far are still valid and can be simply applied to a multidimensional case. In order to understand how multidimensional partitioning works, let's consider a few examples on a simple V-dimensional array A or matrix. Our matrix has six rows, dimension one, and four columns, dimension two. If we apply cyclic partitioning on the first dimension, the row dimension, with a factor of two, we are basically storing the other rows into one partition and the even row into another partition. As you can see, within the HLS array partitioning directive, we use the dean parameter to specify which dimension to partition. Now, if we further partition our matrix on the second dimension, the column dimension with a block partitioning with factor two, we obtain the results in the picture. Basically, we get four partitions, the odd lines in the first two columns, the even lies in the first two columns. The odd lines in the second two columns, the even lines in the second two columns. As a last example, consider the case in which we completely partition the second dimension and we do not partition the first. In this case, we obtain four partition, a partition for each column. Notice that such partitions contain each six elements, hence there still implemented using primary sources and not registers. As a rule of terms, when optimizing a kernel function, we suggest to proceed as follows. Try to apply loop pipelining to the innermost loops and evaluate the benefits of the optimization. If you cannot achieve an initiation interval of one cycle due to parallel memory accesses, analyze the code of the loop body and see what kind of partitioning would allow to perform the accesses in parallel at every iteration of the loop. Try to improve performance farther by unrolling the loop. First with partial unroll and consider to apply full unrolling into pipeline the outer loops. After each modification, check again if there is any issue due to limited memory part that might be solely using array partitioning. In these classes, we have not covered the full set of possible optimization that Vivado HLS offers, such as data flow optimization. Nevertheless, we cover some of the most important optimizations. After having mastered the loop unrolling, loop pipelining, and array partitioning, we suggest to improve your knowledge on the available optimization by accessing the SDx pragma references guide online.