Let us now look at a very simple example of a stencil computation,

while in practice there are many more complicated examples like this.

So the problem is as follows.

Let's say we want to solve this recurrence.

With some boundary conditions where you have,

let's say, x_sub_0 is 0,

x_sub_1 is 1 and we want to do this

for zero less than i less than n.

So i going from one to n minus one.

So we're really trying to figure out in this case an iterative averaging example

how can we compute x_i such that it's the average of its neighboring elements.

And this is in a one-dimensional array.

As I mentioned earlier,

we have stencil computations that work on much larger arrays,

multi-dimensional, that essentially follow this pattern.

So you can think about it and maybe figure

out that the solution is quite easy to work out in practice.

If we had n equal to 10,

then for i going from 0 to 10 you'd

have zero and one at the boundary and the intermediate values would be 0.1,

0.2 and so on to 0.8, 0.9.

And if you take a look, you can quickly convince yourself that each element is

indeed the average of its two neighboring elements.

Now the question is: instead of solving this in our head,

how could we write a computer program to come up with this answer?

Well, the usual approach that's taken is to have two arrays.

And this is in an algorithm that's referred to as Jacobi's relaxation.

Let's say you have an old value of the array x and you have a new value of the array x.

And then every iteration,

you're going to compute new x,

the ith element, as the average of the i minus one and the i plus 1 elements of old x.

Notice the way I've drawn it.

We can see that all these elements of new x can be computed in parallel with each other.

And when we are done, we are going to swap the pointers so that what was

earlier new x becomes old x, and what was

earlier old x can now accumulate the values for the next phase.

And the idea is if we keep repeating this computation,

we will eventually converge to the answer that we figured out analytically.

The reason the computation is important is in real world problems,

we do not have an analytical solution and we need this computation to obtain the answer.

So how would we write code to do this computation?

Well, we have an outer iteration loop,

which we can consider kind of like

a time step loop and that'll go from one to some number of times steps.

And then for each time step,

we are going to do one of these parallel steps so we can have

a for-all for i that goes from one to n

minus one and does the update with new

x -- i equal to -- and I'll just write the average of old x,

i minus one and old x, i plus one.

And when we are done with the for-all,

we have to swap the pointers.

So this is fine.

We can go through a large number of steps and each time have a for-all,

but there is an inherent inefficiency in this approach because each for-all has n

minus one tasks and that gets multiplied by the number of time steps.

So we have n steps times n minus one,

a quadratic number of tasks being created.

Now it turns out this is exactly the scenario for which barriers can help

you generate more efficient code and to do that,

what you have to do is - and it's very interesting,

you have to move the for-all loop outside.

You have the for-all loop outside and then within each iteration of the for-all,

within each parallel task i,

you do the sequential loop.

You have the sequential loop and then within the sequential loop,

you can perform this update for i,

which is the new x

sub i equals average of old x,

i minus one, and old x, i plus one.

And now here's the challenge;

you want to swap the pointers and go to

the next time step.

But before going to the next time step,

you have to make sure that all the i's have completed computing

the new x average because we are going to swap the pointers and

start overwriting what was old x, in the next phase.

And if we don't have this coordination,

then we'll end up with some data race condition.

And this, as you can guess,

is perfectly suited to a barrier operation.

So we can insert a barrier over here.

This is also referred to as a "next" operation,

which is an indication that

the for-all loop has to move to its next phase after the barrier.

And with this we ensure that all the i's

which are running in parallel for a given time step

complete the updates, wait on the barrier,

swap pointers, and then go to the next round of updates.

So what we've seen is that barriers can be used not

just to separate two phases like we saw earlier with hello and goodbye;

they can be embedded in sequential computations like for loops and while

loops and help coordinate across tasks.

Now the big advantage of doing this approach is we only

have n minus one tasks created. That's it.

We do have - the number of barrier operations is still n

minus one times n steps barrier operations,

but the number of tasks is much fewer and this is a much better tradeoff in practice

because creating tasks is much more expensive than a barrier operation,

which just involves coordination among tasks that have already been created.

So now you've seen a more interesting example using

barriers and it is indeed reflective of a lot of

real world computations and we see that barriers

can lead to more efficiency and help reduce the number of tasks that you create.