One of the most widely used and also one of the most widely abused techniques for
exploratory data analysis is exploratory data analysis with clustering.
So the idea here behind clustering is, can we identify data points that
are close to each other and then cluster those together into groups somehow?
And so, the first thing that you need to know about this is that it is
incredibly popular.
And so for example, this is a very highly cited paper that
discusses how do you apply hierarchical clustering to a set of data.
Before we can get into clustering though,
we need to be able to define how close two things are to each other.
And so as a simple example,
let's say we wanted to know the distance between Baltimore and Washington, DC.
So if we wanted to do that we could take the longitude measurements for
Washington DC and Baltimore and denote them by Y.
And the latitude measurements and denote them by X.
Then we could take the distance in longitude and
just by taking their difference and the distance in latitude.
And now we have two measures of the distance between Baltimore and DC,
if we wanted to combine them together,
one thing we could do is just add those two things up.
But it turns out sometimes this distance will be negative on the latitude and
positive on the longitude and they'll cancel each other out.
So you can square them.
That means that they'll always be positive distances but
now they're not necessarily on the same scale.
The scale that we care about.
And so you can take the square root to get something like the distance between
Washington DC and Baltimore that you might care of.
This is called the Euclidean distance.
And so Euclidean distance can be generalized even if you have lots and
lots and lots of genes.
You can take every single gene, take the difference between the two samples for
that gene, square them, sum them up, and take the square root.
And you'll get the Euclidean distance.
That's a way to measure distance between points.
Another way, especially when you're dealing with binary data,
is to look at something like the Manhattan or taxicab distance.
So the best way to think of this intuitively is imagine that you have two
places in a city.
You have this building here, and this building here, and
you want to get between them, you have to drive along the blocks.
So what's the distance between them?
Well we can measure the total number of blocks that you have to go
in the east west direction and the total number of blocks that you have to go
in the north south direction and that gives you the distance.
So you can do that by taking the difference in their
east west locations and taking the absolute values and
the difference in their north south locations and taking the absolute value.
One interesting thing about this distance metric is because everything is at a right
angle, as long as you follow any distance along blocks, the blue, red,
and yellow distances will all give you the same distance between the two points.
So now that we have a distance to find,
there's a couple of different ways that we can try to cluster points together.
So the first way is what was called hierarchical clustering.
And the basic idea is you start with the two nearest points, merge them together.
Then find the next two nearest points, merge them together and so forth.
So here's a really simple example.
Here I plotted some points with an X observation and a Y observation and
I want to look and see what are the clusters.
So you can kind of see from looking at the data right off that there's a cluster
down here in this corner, a cluster in this corner and maybe a cluster up here.
So the first thing that you do when doing hierarchical cluster is you find the two
points that are closest together and connect them.
In this case, it's points five and six.
So when we draw a line between five and
six representing the distance between those two points.
The next things that we need to do is find the distance, the next nearest distance.
But now it's a little bit tricky because points five and
six have sort of been merged together here.
So there are different ways that you can merge them together,
but one common way is to just take the average.
So you take the average y value and the average x value and
you get a new data point.
So now when I'm measuring the distance between seven and the cluster of five and
six, I measure the distance between seven and this center point.
So if I do that, it turns out that the next two nearest points are points 10 and
11, which are also very close together and so I draw a connection between them.
And then I continue going along doing this, and
if I find a point that say I want to connect the points 5 and 6 and 10 and
11, then I would draw a connection between these two groups of points.
So if I do this, I get what's called a Cluster Dendogram.
And so again, you can see, remember there were these three clusters we thought we
saw and it turns out they appear to be here in the dendogram.
And you apparently see them in the dendogram and the point eight was kind of
an outlier in the plot and you can see it's kind of an outlier here as well.
So a couple of things to keep in mind about this dendrogram.
One is the distance between two points
is defined by the distance along the line from one to the other.
So you can see that the distance between two and
three is closer than say from two to four.
But it's a little bit hard to read because you have to sort of follow the line
all around to get the distance.
Another reason why that makes it a little bit hard to read is because here we have
this dendrogram that looks like this.
So we have a dendrogram that has three components to it,
you can imagine they are these three clusters.
And one thing that you could do is if you label these one, two, and three, you
could just rotate around the axis and end up with a dendrogram that looks like this.