0:00

Having mastered computing the connecting components of an undirected graph in

linear time, let's now turn our attention to directed graphs,

which also arise in all kinds of different applications.

Now the good news is, is we'll be able to get just a blazingly fast primitive for

computing connectivity information for directed graphs.

The bad news, if you want to call it hat,

is we'll have to think a little bit harder.

So it won't be as obvious how to do it.

But by the end of this lecture will you know linear time algorithms,

a very good concept.

It's really just based on depth first search for

computing all of the pieces of a directed graph.

In fact, it's not even so obvious how to define pieces,

how to define connected components in a directed graph.

Certainly not as obvious as it was with undirected graphs.

So, see what I mean, let's consider the following four node directed graph.

0:45

So on the one hand, this graph in some sense in one piece.

If this was an actual physical object, made say of a bunch of strings connected

to each other and we picked it up with our hands,

it wouldn't fall apart into two pieces, it would hang together in one piece.

On the other hand, when you think about moving around this network,

it's not connected in the sense that we might think about.

You cannot get from any one point a, to any other point b.

For example, if you started the right-most node in this graph,

certainly there's no directed path that will get you to the left-most node.

So what's typically studied and

most useful for directed graphs is what you call strong connectivity.

And a graph is strongly connected if you can get from any one point to any

other point and vice versa.

And the components then informally are the maximal portions of this graph,

the maximal regions, which are internally strongly connected.

So the maximum regions from within which you can get from any one

,point a to any other point b along a directed graph.

For the record, let me go ahead and give you a formal definition,

although this intuition is perfectly valid.

Just regions where you can get from anywhere to anywhere else.

1:55

And as in the undirected case, we're going to give a somewhat slick definition

rather than talking about maximal region satisfying some property.

We're going to talk about the equivalence classes of a particular

equivalence relation.

But really it means the same thing.

This is just sort of the more mathematically more mature way of writing

it down.

So the equivalence relation we're going to use, if it's on nodes of the graph, and

we'll say that u and node is related to a node v if you can get from u to v

via directed path, and also from v to u in some other directed path.

2:32

So remember what it means to be an equivalence relation, that's reflexive,

that is everybody's related to itself.

But of course there is a path from every node to itself.

It also means that it's symmetric, so if u is related to v, then v is related to u.

Well again, by definition we're saying that the vertices are mutually reachable

from each other.

So that's clearly symmetric by definition.

And then it has to be transitive.

And the way you prove it's transitive is you just paste paths together, and

it just works the way you'd expect it to.

So let's illustrate this concretely with a somewhat more complicated directed graph.

3:21

So what I hope is pretty clear is that each of these circled regions is indeed

strongly connected.

That is, if you start from one node in one of these circled regions,

you can have a directed path to any other node.

So that's symmetric, because on a directed cycle you can get from any

one starting point to anyother place.

And all of these have directed cycles.

And then there's the one strong component that just has the one node,

which is obviously strongly connected.

What I also claim is true is that all of these regions are maximal.

Subject to being strongly connected.

That's why they're the strongly connected components.

That's why they're the equivalence classes of this equivalence relation

we just defined.

So if you take any two pairs of nodes which lie in two different circles,

you either won't have a path from the first one to the second, or

you won't have a directed path from the second one back to the first one.

In fact, the structure of the strong components in this black graph

exactly mirrors the directed acyclic graph that we started in red.

So in the same way in the red four-node graph, you can't move from right to left.

Here in this bigger graph, you can't move from any of the circled SCCs to the right

to any of the circled SCCs to the left.

So for example, from the right-most nodes,

there are no directed paths to the left-most nodes.

4:26

So that's a recap of the definition of the strongly connected components.

I've motivated in a separate video some reasons why you might care about computing

strongly connected components.

And in particular,

on extremely large graphs which motivates the need for blazingly fast sub routines,

so four free primitives that will let you compute connectivity information.

So you'd love to be able to just know the pieces of a directed graph.

Maybe you don't even know why they're going to be useful but

you just compute them because why not?

It's four free primitive.

So, that's what I'm going to give you on this lecture.

So the algorithm that we're going to present is based on depth-first search.

And that's going to be the main reason why it's going to be so blazingly fast,

is because depth-first search is blazingly fast.

Now, you might be wondering,

what on earth does graph search have to do with computing components?

They don't seem obviously related.

So let's return to the same directed graph that I shows you on the previous slide.

5:17

And to see why something like depth-first search might conceivably have some use for

computing strong components.

Suppose we called depth-first search starting from this red node as

a starting point.

What would it would explore?

So remember what the guarantee of something like depth-first search or

breath first search for that matter is.

You find everything that findable, but naturally nothing else.

So what is findable from this red node?

Where by findable I mean,

you can reach it from a directed path emanating from this red node.

Well, there's not much you can do.

So from here you can explore this arc.

And you can explore this arc.

And then you can go backward.

And so, if you do DFS or BFS from this node,

you're going to find precisely the nodes in this triangle.

All of the other arcs involved go the wrong direction and

they won't be traversed by, say, a depth-first search call.

So, why is that interesting?

What's interesting is that if we invoke DFS from this red node, or

any of the three nodes from this triangle, then it's going to discover precisely this

strongly connected component, precisely the three nodes in this circled SCC.

So that seems really cool, seems like maybe we just do a DFS, and

boom we get an SCC.

So maybe if we can do that over and over again we'll get all the SCCs.

So that's a good initial intuition, but something can go wrong.

Suppose that instead of initiating DFS from one of these three nodes on

the triangle, we say, initiated from this bottom most node in green.

6:44

So remember, what is the guarantee of a graph search subroutine like DFS?

It will find everything findable but of course, nothing more.

So what's findable from this green node?

Well, naturally everything in its own SCC, right?

So the four nodes here, it'll certainly find those four nodes.

On the other hand, if we start from this green node, since there are arcs that

go from this bottom-most SCC to the right-most the SCC.

Not only will this DFS call find the four nodes in the green node's strong

component, but it will also traverse these blue arcs and

discover the three nodes in the red triangle.

So, if we call DFS from this green node, we'll capture all seven of these.

So the point is, if we call DFS,

it looks like we're going to get a union of possibly multiple SCCs.

In fact, in the worst case,

if we invoke DFS from the leftmost node, what's it going to discover?

It's going to discover the entire graph.

And that didn't give us any insight into the strong component structure at all.

So, what's the takeaway point is, the takeaway point is if you call

DFS from just the right place, you'll actually uncover an SCC.

If you call it from the wrong place, it will give you no information at all.

So the magic of the algorithm that we're going to discuss next is we'll show having

this super slick pre-processing step which ironically is itself is called

a depth-first search.

We can in linear time compute exactly where we want to start the subsequent

depth-first searches from, so that each indication

gets us exactly one strongly connected component and nothing more.

9:37

Now the naive way to implement this would be to literally construct

a new copy of the input graph with all the the arcs in the reverse direction, and

then just run depth first search on it.

Of course, the sophisticated, the sort of obvious optimization would be

to just run DFS on the original graph, but going across arcs backwards.

So I'll let you think through the details of how you'd do that, but that just works.

You run DFS, and instead of going forward along edges, you go backward along edges,

that simulates depth first search on the reverse graph.

Now I've written here DFS loop and that just means the user will check more to

make sure that you see all of the nodes of the graph even if it's disconnected

you have an outer loop where you just try each starting point separately.

If you haven't already seen it then you run DFS from that given node.

I'll be more detailed on the next slide.

10:27

Now at this point you should be thinking that I'm totally crazy, right, so

what are we trying to do?

We're trying to compute these strongly connected components.

We're trying to actually compute real objects, these maximal regions and

all I'm doing is searching the graph.

I do it once forward.

I do it once backward.

I mean it doesn't seem like I'm computing anything.

So here's the catch and it's a very minor catch.

So we're going to get you to do a little bit of bookkeeping, it's going to be very

little overhead so we'll still have a blazingly fast algorithm.

So but with a little bit of bookkeeping, here's what's going to happen.

The second depth first search.

Which searches the graph, will in it's search process discover the components

one at a time in a very natural way.

And that will be really obvious when we do an example

which we'll do in just a second.

Now, for the second depth first search to work as magical way where it

just discovers the connective component one at a time, it's really important that

executes the depth first searches in a particular order,

that it goes to the nodes of the graph in a particular order.

And that is exactly the job of the first pass.

The depth first search on reverse graph is going to compute an ordering of the nodes

which, when the second depth first search goes through them in that order,

it will just discover the SCCs one at a time.

In linear time.

12:41

So that's the algorithm at a high level.

It's really just two passes of DFS with some bookkeeping, but

this is under specify.

You really shouldn't understand how to implement the algorithm just yet.

So what do I owe you?

I owe you exactly what I mean by the DFS-loop, although this is seen more or

less in the past.

It's just a loop over all the vertices of the graph, and

if you haven't seen something yet in DFS from that starting point,

I need to tell you what finishing times are and how they get computed.

They're just going to be integers 1 to n,

which is basically when depth first search gets finished with one of the nodes, and

then I need to tell you how you compute these leaders.

So, let me tell you all three of those things on the next slot.

14:46

In our first depth first search it's going to be labeled totally arbitrary, so

these are basically just the names of the node or their position in the node array,

whatever, you just do it in some arbitrary order.

Now the second time we run DFS loop, as indicated on the previous slide,

we're going to use the finishing times as the labeling.

As we'll see, the finishing times are indeed numbers between 1 in it, so

now what do we do is we just iterate through the nodes in decreasing order.

16:42

Now once this for loop has completed,

once we've examined every outgoing arc from i and for each node j.

Either we already saw it in the past or we've recursively explored from j and

have returned.

At the point, we call ourselves done with node i,

there's no more outgoing arc to explore.

We think of it being finished, remember t is the global variable that's keeping

track of how many nodes we're done with, so

we increment t because now we're done with i.

And we also remember that i was the t-th vertex with which we finished.

17:14

That is, we said i's finishing time to be t.

Because depth first search is guaranteed to visit every node exactly once,

and that therefore finish with every node exactly once.

This global counter t, well when the first node is finished it'll be value 1,

then when the next node gets finished I'll have value 2,

then it'll have value 3 and so on.

When the final node gets finished with it'll have value n.

So the finishing times of the nodes are going to be exactly the integers

from 1 to n.

Let's make this much more concrete by going through a careful example.

17:45

In fact, I think it'll be better for everybody if you,

yourself, traced through part of this algorithm on a concrete example.

So let me draw a nine node graph for you.

So to be clear, let's assume that we've already executed step one

of the algorithm, and we've already reversed the graph.

So that is, this blue graph that I've drawn on the slide, this is the reversal.

We've already reversed the arcs.

Moreover the nodes are labeled in some arbitrary way from 1 to 9.

Just assume these are how they show up in the node array for example and

remember in the DFS loop routine you're supposed to process the nodes from

18:20

top to bottom from n down to 1.

So my question for

you then is in the second step of the algorithm when we run DFS-Loop, and

we process the nodes from the highest name 9 in order down to the lowest name 1.

What are the finishing times that we're going to compute as we run DFS-Loop?

Now, it is true that you can get different finishing times depending on the different

choices that the DFS-Loop has to make about which outgoing arc to explore next.

But I've given you four options for what the finishing times of the nodes 1,2,3,

all the way up to 9, respectively, might be.

And only one of these four could conceivably be an output of the finishing

time of DFS loop on this graph, so which one is it?

All right so the answer is the fourth option,

that is the only one of these four sets of finishing times that you

might see being computed by DFS-Loop on this blue graph.

So let's go ahead and trace through DFS-Loop and

see how we might get this set of finishing times.

So remember in the main loop we start from the highest node 9 and

then we descend down to the lowest node 1.

So we start by invoking DFS from the node 9.

19:30

So now from here there's only one outgoing arc, we have to go to so

we mark 9 as explored.

And then there's only one place we can go, we can go to 6.

So we mark 6 as explored.

Now there's two places we can go next,

we can either go to 3 or we can go to 8 and in general DFS could do either one.

Now to generate this fourth set of finishing times

I'm going to need to assume that I go to 3 first okay?

So again, what DFS does, what we're assuming it does, it starts at 9, and

it has to go to 6, it marks those as explored, then it goes to 3.

It does not go to 8 first it goes to 3 first.

Now, from 3, there's only one outgoing arc which goes to 9, but 9,

we've already marked as explored.

So it's not going to re-explore 9, it's going to skip that arc.

Since that's 3's only outgoing arc, then that for loop completes, and

then 3 is the first node to finish.

So when we finish with 3, we increment t, it started at 0, now it's 1, and

we set the finishing time of 3 to be 1.

Just like we said it was in the example.

So, now we backtrack to 6.

Now we have another outgoing arc from 6 to explore, so now we go to 8.

From 8 we have to go to 2, from 2 we have to go to 5, from 5 we have to go 8.

8 we've already seen, so then we're going to be done with 5,

because that was its only outgoing arc.

So then we increment t, now it's 2, and

the finishing time of 5 is going to be 2 as promised.

21:28

Now if we were computing those leaders all of these nodes would get the leader 9,

but again the leaders are only relevant for the second pass.

So we're just going to ignore the leaders as we're doing this tree so

we're just going to keep track of finishing times.

So now we're not done so all we did is we finished with the DFS that is invoked from

the node 9 and we found 6 of the nodes total in that depth first search.

So now we return to the outer for loop and we decrement i.

So it started at 9, we're done with that, now we go down to 8.

We say, have we already seen 8, yes 8's already explored so we skip it.

We go, we decrement i down to 7, we say have we already seen node 7?

No we have not okay?

7 is not yet explored.

So we invoke DFS now from node sever 7 has two outgoing arcs,

it can either go to 4 or it can go to 9.

Let's say it checks the outgoing arc to 9 first.

Now 9 we already explored.

Granted, that was an earlier part of the for loop, but we remember that.

We're going to keep track of who got explored on previous iterations of the for

loop so we don't bother to re-explore 9, so we skip that.

So now from 7 we have to go to 4, from 4 we have to go to 1,

from 1 we have to go back to 7.

7's already been exploratory backtrack and now we're done with 1.

So 1 is the next one we're completed with and

the finishing count of 1 is going to be 7 as promised.

We backtrack to 4, there's no more outgoing arcs from 4 to explore,

so that's going to be the eighth one to finish.

22:49

As promised, and the last one to finish is poor node 7.

It is last.

So that would be an example of how the DFS-Loop subroutine computes finishing

times on a reversed graph.

So now, let's work through the second pass on the forward version of the graph using

the same example.

Now remember, the point of the first pass is to compute a magical ordering, and

the magical ordering is these finishing times.

So now we're going to throw out the original node names,

and we're going to replace the node names in blue by the finishing times in red.

We're also going to work with the original graphs,

which means we have to reverse the arcs back to where they were originally.

So those are the two changes you're going to see when I redraw this graph.

First of all, all the arcs were reverse orientation.

Second of all, all of nodes will change names from their original ones

to the finishing times that we just computed.

23:43

So here's our new graph with the new node names and

all of the arcs with their orientation reversed.

And now we run DFS again on this graph.

And again we're going to process the nodes in order from the highest label 9 down to

the lowest label 1.

Moreover, we don't need to compute finishing times in the second pass,

we only need to do that in the first pass.

In the second pass we have to keep track of the leaders, and remember the leader of

a vertex Is the vertex from which DFS was called that first discovered that node.

All right, so what's going to happen?

Well, in the outer for loop, again, we start with i equal to nine, and

we invoke DFS from the node 9.

24:22

So, that's going to be the current leader because that's where

the current DFS got initiated.

Now, from 9, there's only one choice.

We have to go to 7.

From 7, there's only one choice, we have to go to 8.

From 8, there's only one choice, we have to go back to 9.

And then, 9's already been seen, so we backtrack.

We go back to 8, we go back to 7, we go back to 9, and that's it.

So, when we invoke DFS for node 9,

the only things that we encounter are, the nodes 7, 8, and 9.

And these are all going to be given the leader vertex 9.

You will notice that this is indeed one

of the strongly connected components of the graph.

We just sort of found it with this indication of DFS from the node 9.

So, now we go back to the outer for loop.

And we say, okay, let's go to node 8, have we already seen 8?

Yes.

What about 7, have we already seen 7?

Yes.

What about 6?

Have we have already seen 6?

We have not, we have not yet discovered 6, so

we invoke DFS from node 6, we reset the global source vortex s to 6.

From 6, we can go to 9, we can go to 1.

So, let's say we explore 9 first.

Well, we already saw 9 in an earlier iteration in the for loop, so

we don't explore it again, so, we don't discover 9 now, so we backtrack to 6.

We go to 1, from 1, we have to go to 5, and 5 we have to go to 6,

and then we start backtracking again.

So, the only new nodes that we encounter when we invoke DFS

from the node 6 are the vertices 6, 1, and 5.

And all of these will have a leader vertex of

6 because that's where we called DFS from when we first discovered these 3 nodes.

And you'll notice, this is another FCC of this directed graph.

So, we invoke DFS again, not from a new node, the new node 6.

And what it discovered,

the new nodes it discovered was exactly an FCC of the graph.

Nothing more, nothing less.

So, now we return to the outer for loop, we go to node 5, have we already seen 5?

Yes.

Have we already seen 4?

No, we haven't seen 4 yet.

So, now we invoke DFS from 4.

Again, we could try to explore 5, but we've seen that before,

we're not going to explore it again.

So from 4 then, we have to go to 2, from 2 we have to go to 3,

from 3 have to go back to 4, and then, after all the backtracking, we're done.

So, the final call to DFS will be from the node 4.

26:40

And, that DFS will discover precisely, newly discover,

precisely the nodes 2, 3 and 4.

They will all have the leader vertex 4 because that was where this DFS was

called from.

It's true we'll go back to the for loop and we'll check have we seen 3 yet?

Yes. Have we seen 2 yet?

Yes.

Have we seen 1 yet?

Yes, and then the whole thing completes.

And, what we see is that,

using the finishing times computed from that first depth first search pass,

somehow the strongly connected components of this graph just showed up and

presented themselves to us and one at a time on a silver platter.

Every time we invoke DFS, the nodes we discovered newly were

precisely one of the FCCs, nothing more, nothing less.

And, that's really what's going on in this algorithm,

turns out this was true in general.

The first pass, DFS on the reverse graph, computes finishing times so

that if you then process nodes according to decreased order in finishing times.

In the second pass,

each invocation to DFS will discover one new FCC and exactly one FCC.

So, they'll just present themselves to you,

one per DFS call in that second pass' for loop.

27:49

This is, of course, merely an example.

You should not just take a single example as proof that this algorithm always works.

I will give you a general argument in the next video.

But hopefully there is at least a plausibility arcing.

No longer does this three-step algorithm seem totally insane.

And maybe you could imagine, perhaps it works.

At least there's some principles going on.

Where you first compute the right ordering to process the nodes, and

then the second pass peels off FCCs one at a time like layers from an onion.

One thing that I hope is pretty clear is that this algorithm, correct or

not, is blazingly fast.

28:26

Pretty much all you do is two depth per searches.

And, since depth per search as we've seen in the past, runs in time linear in

the size of the graph, so does Kosaraju's two-pass algorithm.

There are a couple subtleties and I encourage you to think about this and

you'll be forced to think about this in the program and project for week four.

So, for example, in the second pass,

how do you process the nodes in decreasing order of finishing time?

You don't want to sort the nodes by their finishing time,

because that would take n log and time.

So, you need to make sure that you remember in the first pass,

that you sort of remember the nodes in a way that you can just do

a linear scan through them in a second pass.

So, there are some details, but, if your intuition is that this is really just

double DFS properly implemented, that's pretty much exactly right.