Week two is dedicated to the kind of parallelism that exist inside of every core of Intel processors. Vector instruction support. Vector instructions are one of the implementations of SIMD parallelism. SIMD stands for Single Instruction Multiple Data. As the name suggest, it is the ability of processors to apply a single arithmetic operation, or a single string of instructions to multiple data elements at once. For example, if you have the task of adding together element by element to arrays, and if the cost of the addition operation is around one cycle. Let's pretend that this is the throughput, one instruction per cycle. Then adding together 100 elements, will require 100 cycles in a scalar processor. In contract, if you have a vector processor, then thanks to parallelism in hardware, you have the ability to load not 1, but 2, 4, 8, or 16 elements into a vector register. And call only one addition operation that will apply to all of these elements at once. The idea is that the throughput and the latency of this vector instruction can be the same as the throughput and latency of the corresponding scalar instruction, you will just call fewer of them. So the same 100 elements will be processed in 25 cycles. So this is what vectorization is. It is the ability of your software, To issue a single instruction on multiple data elements at once and finish the arithmetic processing faster. If you are not yet aware of vectorization, you may have a very good excuse for that, and this chart explains why that is. It shows the years of introduction of the supported instruction sets in Intel architecture. And it also shows the width of the vector registers in these data sets. As you can see, the first appearance of vectorization in Intel architecture was in the late 1990s with MMX instructions. MMX stands for multimedia extension. And this is all that these instructions were good for, processing multimedia, they had 64 bit registers, and they supported operations only on integers not very important for scientific computing. If you did scientific computing, things could become interesting in the early 2000s with the introduction of SSE, Streaming SIMD Extensions and its various modules. For example with SSE2, we saw floating point math and the vector width increased to 128 bits. So let's do the math. If you have a vector that's 128 bit long, how many single precision floating point numbers can you fit in there? 32 bits per number, 4. So 4 is the potential speed up of your application, thanks to vectorization. But this is in single precision. A lot of scientific computing is done in double precision. And there the predicted speed up is around 2. A speed up of 2 may not be enough of an incentive for many people to modernize their applications, to take advantage or vectorization. So, you may have a very good excuse for not knowing what vectorization is, simply because it was not important until recently. Things changed with the introduction of AVX instructions, in Intel's Sandy Bridge architecture, because the vector width increased to 256 bits. AVX sands for Advanced Vector Extensions, and they also support single precision and double precision floating point math. Now the potential speed up in single precision is 8 and in double precision 4. But with Xeon Phi, vectorization is very difficult to ignore. That is because the potential speed up with 512 bit vectors is 16 in single precision, and 8 in double precision. As one of Intel's engineers has put it, if you're not using vectorization, you may be paying for 16 times the processor that you're using. It is good to have vectorization in your code. So now let's see what it takes, to teach your software to use vector instructions.