We are starting a discussion of memory traffic optimization, and first, I would like to explain why memory traffic is so important for performance. Earlier in the course we talked about vectorization. If you see that your loops are vectorized. This is a part of the success. Later, we will learn that you can push the performance of your vector calculations even further by making sure that your data containers, and loop order give unit-stride access, by doing data alignment, and possibly container padding and eliminating peel loops and multiversioning. But ultimately, you will need to start worrying about data re-use in caches, if you want vectorization to matter. And that is, because vector arithmetics in modern computers is cheap, and it is memory access that is expensive. So, if you don't optimize cash usage to minimize memory access then vectorization will not matter. You will be bottle neck by memory access. This is shift required in your thinking compared to programming processors of 20, or 30 years ago, it is cheap to do mathematics, it is expensive to go to memory. So, let's attach a number to that claim. How cheap are floating point operations per second? For an Intel Xeon Phi processor 7250, you have 68 cores working per well. They are clocked at 1.2 GHz, slightly lower if you do vector math, but this is close enough. Every instruction processes eight double numbers at once in 512 bit rate of if you are doing the fuse multiply add operation, then each instruction is actually two operations, addition and multiplication. And, because you have two vector processing units, you can actually issue 2 IPC per second. The product of these numbers is the theoretical peak performance of 2.6 TFLOP/s. So, this processor has the ability to perform 2.6 trillion floating point numbers, floating point operations every second. Which, by the way, is close to the theoretical performance of the fastest super computer on the planet 20 years ago. How much data is there in 2.6 trillion TFLOP numbers? Well, multiplied by 8 bytes, you get around 21 terabytes. So, your CPU wants to consume 21 terabytes of data every second. If this data is coming from the own package memory, MCDRAM, then the bandwidth that you can expect is at best half a terabyte per second. If you take the ratio of these 2 numbers, you will find that the CPU wants to consume around 50 times more data per second than the memory can deliver. This break point is a convenient number that can tell you whether your application is compute-bound, limited by the arithmetic throughput of your processor, or bandwidth-bound, limited by the memory bandwidth. If you do more than 50 floating point operations for every memory access, you're compute-bound. Fewer than 50, bandwidth-bound. You can sophisticate this analysis a little more and draw our roofline diagram. The roofline diagram predicts the maximum performance that you can achieve. For a particular arithmetic intensity. If the arithmetic intensity is infinite, you only do compute, you never go to memory, then of course you are limited by the theoretical performance of your processor. If your arithmetic intensity is one, therefore every memory access you do one operation, and you will be bottlenecked by that half a terabyte per second that they showed on the previous slide. If you connect this asemtope to the single data point, using unit slope line, you will get something that looks like a roof. And the value of the slopes is it's predictive power. If you can estimate with paper and pencil, or by measurements the arithmetic intensity of your algorithm. Then, there are two possibilities, maybe your arithmetic intensity is here and your hitting the sloping part of the roof, then you know, that then you are been with bound. And you can improve your performance using techniques that increase arithmetic intensity. So, that you move up the roofline diagram, and this will be your performance gain. At the same time, if your arithmetic intensity is high enough and you are hitting the flat part of the roof. Then it doesn't make sense to use techniques for improving the arithmetic intensity, because you are now compute note. And of course the exact shape of the roofline diagram depends on the type of operations that you are assuming, Fuse Multiply Add have a higher ceiling Exponentials and reciprocals will have a lower ceiling. It also depends on the type of memory that you are dealing with. You can draw the roof line for the high bandwidth memory, for the on platform memory. And depending on what you draw the break point will be different. You can also draw roof line models for caches. And apply your analysis to say, cache misses as opposed to going to memory.