One of the most widely used and also one of the most widely abused techniques for

exploratory data analysis is exploratory data analysis with clustering.

So the idea here behind clustering is, can we identify data points that

are close to each other and then cluster those together into groups somehow?

And so, the first thing that you need to know about this is that it is

incredibly popular.

And so for example, this is a very highly cited paper that

discusses how do you apply hierarchical clustering to a set of data.

Before we can get into clustering though,

we need to be able to define how close two things are to each other.

And so as a simple example,

let's say we wanted to know the distance between Baltimore and Washington, DC.

So if we wanted to do that we could take the longitude measurements for

Washington DC and Baltimore and denote them by Y.

And the latitude measurements and denote them by X.

Then we could take the distance in longitude and

just by taking their difference and the distance in latitude.

And now we have two measures of the distance between Baltimore and DC,

if we wanted to combine them together,

one thing we could do is just add those two things up.

But it turns out sometimes this distance will be negative on the latitude and

positive on the longitude and they'll cancel each other out.

So you can square them.

That means that they'll always be positive distances but

now they're not necessarily on the same scale.

The scale that we care about.

And so you can take the square root to get something like the distance between

Washington DC and Baltimore that you might care of.

This is called the Euclidean distance.

And so Euclidean distance can be generalized even if you have lots and

lots and lots of genes.

You can take every single gene, take the difference between the two samples for

that gene, square them, sum them up, and take the square root.

And you'll get the Euclidean distance.

That's a way to measure distance between points.

Another way, especially when you're dealing with binary data,

is to look at something like the Manhattan or taxicab distance.

So the best way to think of this intuitively is imagine that you have two

places in a city.

You have this building here, and this building here, and

you want to get between them, you have to drive along the blocks.

So what's the distance between them?

Well we can measure the total number of blocks that you have to go

in the east west direction and the total number of blocks that you have to go

in the north south direction and that gives you the distance.

So you can do that by taking the difference in their

east west locations and taking the absolute values and

the difference in their north south locations and taking the absolute value.

One interesting thing about this distance metric is because everything is at a right

angle, as long as you follow any distance along blocks, the blue, red,

and yellow distances will all give you the same distance between the two points.

So now that we have a distance to find,

there's a couple of different ways that we can try to cluster points together.

So the first way is what was called hierarchical clustering.

And the basic idea is you start with the two nearest points, merge them together.

Then find the next two nearest points, merge them together and so forth.

So here's a really simple example.

Here I plotted some points with an X observation and a Y observation and

I want to look and see what are the clusters.

So you can kind of see from looking at the data right off that there's a cluster

down here in this corner, a cluster in this corner and maybe a cluster up here.

So the first thing that you do when doing hierarchical cluster is you find the two

points that are closest together and connect them.

In this case, it's points five and six.

So when we draw a line between five and

six representing the distance between those two points.

The next things that we need to do is find the distance, the next nearest distance.

But now it's a little bit tricky because points five and

six have sort of been merged together here.

So there are different ways that you can merge them together,

but one common way is to just take the average.

So you take the average y value and the average x value and

you get a new data point.

So now when I'm measuring the distance between seven and the cluster of five and

six, I measure the distance between seven and this center point.

So if I do that, it turns out that the next two nearest points are points 10 and

11, which are also very close together and so I draw a connection between them.

And then I continue going along doing this, and

if I find a point that say I want to connect the points 5 and 6 and 10 and

11, then I would draw a connection between these two groups of points.

So if I do this, I get what's called a Cluster Dendogram.

And so again, you can see, remember there were these three clusters we thought we

saw and it turns out they appear to be here in the dendogram.

And you apparently see them in the dendogram and the point eight was kind of

an outlier in the plot and you can see it's kind of an outlier here as well.

So a couple of things to keep in mind about this dendrogram.

One is the distance between two points

is defined by the distance along the line from one to the other.

So you can see that the distance between two and

three is closer than say from two to four.

But it's a little bit hard to read because you have to sort of follow the line

all around to get the distance.

Another reason why that makes it a little bit hard to read is because here we have

this dendrogram that looks like this.

So we have a dendrogram that has three components to it,

you can imagine they are these three clusters.

And one thing that you could do is if you label these one, two, and three, you

could just rotate around the axis and end up with a dendrogram that looks like this.