The method that we're looking for, for distance based phylogeny construction,

is called the "Neighbor-Joining Algorithm", which was introduced in 1987.

It's a result so fundamental to bioinformatics, in fact,

that it has been cited 30,000 times, and

it's one of the top 20 most cited papers over all scientific fields.

So first, to see how this works,

define the neighbor-joining matrix of a given distance matrix, D,

as the matrix D*, whose (i,j)-th entry is given by the formula shown on the slide.

So, let me give you an example.

Here's the matrix that we encountered before,

where j and k corresponded to the minimum element of the matrix, but j and

k were not neighbors in the tree that fit this matrix.

The total distance of a leaf node is just the sum of distances from this leaf to

all other leaves.

So for example, the total distance of k in this matrix is

21 plus 12 plus 13, or 46.

We will then define D*(i,j) by taking n minus 2 times D(i,j) and

then subtracting the total distance of i, and subtracting the total distance of j.

In this case, n, the number of leaves or

the number of rows in our distance matrix, is equal to 4.

So D*(k,l)is equal to 2 times D(k,l),

which is 13, minus 46, minus 48,

which gives us a value of negative 68 for D*(k,l).

Now, if it's obvious to you that defining this crazy matrix D* was

going to be a productive endeavor, then you are probably far smarter than I am.

Like the Burrows-Wheeler transform, which we encounter in another chapter,

which makes matching multiple patterns to a text a lightning-fast endeavor,

I think this matrix D* is black magic.

Whereas taking a minimum element of the original distance matrix

did not guarantee that we were going to find neighbors,

taking a minimum element of the neighbor-joining matrix, D*,

does find a pair of neighbors.

You can find a proof of this theorem in a detour in the text.

It's not a short proof.

In fact, although D* was introduced in 1987,

it took a decade before a proof of this theorem was found.

Also, the Neighbor-Joining Theorem is part of the reason why I led you down

the wrong path in the first place when we first attempted to

construct a phylogeny from a distance matrix.

Now, if we use D* instead of D, then we will be able to successfully

identify a pair of neighbors from a minimum element of the matrix.

So let's walk through the neighbor-joining algorithm.

First, we construct the matrix D*, and I'll move that up from the previous slide.

So there are two pairs of leaves that correspond to a minimum element of D*.

i and j must be neighbors, and k and l must be neighbors, as well.

Let's work with say, i and j.