0:00

Now, for the sake of variety let's leave neural networks alone for a moment

then come back to

our general diagram of clustering methods that we introduce to our lair.

In the last video we spoke about the K means

algorithm as an example of the so-called flat clustering,

where all points in the cluster are equal

as clusters do not have any further internal structure.

This time let's talk about hierarchical clustering methods,

where results in clusters do have such internal structure via sub-clusters,

sub sub-clusters and so on.

The first thing that should be assert here is

that nested hierarchical structures are very

common among many different complex physical,

biological and social systems.

Even now perception and information encoding

reflex the hierarchical structure of the outside world.

For example, we locate a formulae in the book by a chapter then sector then a paragraph,

or be defined in address of person by a country,

state, city, street and the house number.

All these are examples of nested hierarchical incogent of some data.

But even more interesting is the fact that hierarchical structure of

interactions in complex systems also affects the dynamics of such systems.

This was discussed at length in beautiful paper by Phil Anderson,

the winner of the Nobel Prize in Physics of 1977,

that he published in Science magazine in 1972.

I highly recommend you read the paper entitled,

The more is different,

especially if you are interested in

complex systems in natural sciences which is the main topic of the paper.

But Anderson also touches upon economics and finance at the very end of his paper.

I can't resist the temptation to quote this for you here.

Anderson said, "In closing,

I offered two examples from the economics of what I hope to have said.

Marx said that the quantitative differences become qualitative wants,

but the dialogue in Paris in 1920s some it up even more clearly. "

Fitzgerald, "The rich are different from us."

Hemingway, " Yes, they have more money."

So, after this elegant illustration of a link between

quantitative and qualitative difference in the society,

let's go back to our main theme that is finance.

So, what would be the reasons to consider

hierarchical structures or hierarchical model in confinements?

Well, one reason is that some financial instruments are hierarchical in nature.

For example, there are global indices such as

S&P 500 in the marketplace as well as sector indices or geographic based indices.

These are examples of hierarchical structures in market instruments and market data.

This is also reflected in how asset managers

and risk managers tend to think about portfolio risks.

They often define risk factors at the level

of separate industries and geographical sectors,

and then aggregate risks along both sectors and geographies,

to have a view of the composition of the total portfolio risk along these dimensions.

A way to model such Hierarchical structures of risk factors

in their classical finance is via nested hierarchical factor models,

where factors are observable and can be identified

with market provided values of global and sector indices.

As opposed to this approach,

here we follow a machine learning paradigm,

and try to learn hierarchical structures directly from

data instead of being imposed by some predefined models.

It rules out that hierarchical sector model

with independent sector set different levels of

the hierarchy can be extracted from

a hierarchical cluster followed by such machine or in approach.

One particular way to find

the hierarchical clustering structure is called agglomerative hierarchical cluster.

Here's how it works.

You start with each data point forming it's own cluster,

then you update your clusters by edging to each point its closest neighbor.

After that, you merge similar clusters.

Then you repeat all these steps again and so on until

convergence or until a needed number of clusters is found.

Now the algorithm that I just described is an example of a bottom-up greedy algorithm.

It's called bottom up because it starts with four points

being in their old clusters and gradually builds more general clusters.

It's called greedy because each cluster merger is only guaranteed to be

optimal locally for these given step but not necessarily for the final objective.

Now, it turns out that complexity of

agglomerative cluster in algorithms is in general rather

high of the order of N square times log of N. There are two special cases.

So, for agglomerative clustering that reduced the complexity to N square.

These methods are called single linkage and complete linkage algorithms.

I will explain them shortly.

So, what I described above is a very generic agglomerative algorithm.

To make it more specific,

we need to define the notion of distance between points and distance between clusters.

Let's start with the first one.

The choice of a metric defines the distance between

two points x1 and x2 in their multi-dimensional space.

The most popular choice in

machine learning is the Euclidean metric given in first line here.

And alternatively is to use its square,

hence we get their squared euclidean distance.

Now, instead of suming the squares of

differences of x1 and x2 along different dimensions,

we can take their absolute values.

This will be equivalent to using the so-called Manhattan distance,

named so in reference to the geometry of Manhattan,

with its system of stick with perpendicular streets and avenues.

Yet another alternative is to use the so-called maximum distance,

that would correspond to the maximum of absolute differences across different dimensions.

And finally, another interesting alternative is provided by the Mahalanobis distance,

which is the same as the euclidean distance,

but after a rotation of input vectors by some orthogonal metrics.

Such orthogonal metrics can even be made free parameter and Lorenc from the data.

This is called metric Lorenc and it constitutes another class of supervised Albania.

Now, after we discussed how we define the distance between individual points,

let's discuss how we can define distances between clusters of points.

I remind you that we need the measure of distance between

clusters in order to nourish different clusters in the agglomerative cluster method.

There exist different ways to define

the distance between two clusters or set of points A and B.

One way to do this would be to define the distance between two clusters

say the minimum of four pairwise distances between points in two clusters.

This is schematically shown in this diagram.

If we adopt this rule,

we will get the single linkage clustering,

which as we said before has a quadratic complexity in the number of points.

Another possible choices to take the maximum of

four pairwise distances instead of the minimum.

This corresponds to another specification of

agglomerative clustering called complete linkage clustering.

This method also has quadratic complexities similar to the single linkage method.

Yet another method is to take an average of

four pairwise distances instead of taking a minimum or maximum.

Such algorithm is expected to behave in some ways in between of these two cases.

This definition alludes to the ever ritually disgusting method.

In your homework you will have a chance to try

all three such specifications of

linkage on stock each on date and see the differences that they produce.

Now I want to show you how you can represent results of hierarchical clustering.

This is done using dendrogram.

Dendrogram is a binary tree that represents

the process of merging sub-clusters in hierarchical clusters.

Steps in drawing of dendrogram corresponds to steps in construction hierarchical cluster.

You start with the bottom up with the bottom most leaves which in

this case represent individual items in a data set.

Sub-clusters that are nurtured are connected by lines where the height of

the vertical line measures the amount of dissimilarity between different sub-clusters.

Therefore the y axis on this graph measures

the dissimilarity between different sub-clusters.

Now, how many clusters do we have in

agglomerative clustering really depends on where we draw horizontal line on this diagram.

Such line corresponds to resolution with which we look at our hierarchical clusters.

If we draw such a line exactly at the level of the X-axis,

then we will get all the original data points as separate clusters.

If on the other hand we draw it at the very top of

the Dendrogram we will have only two clusters.

For any intermediate level so dissimilarity between difference sub-clusters,

we may have some intermediate numbers of resulting clusters.

So, the number of clusters would be an output of agglomerative clustering methods.

And this is very different from the K means clustering which

requires to specify the number K of clusters beforehand.

At this point let's check what we learned so far and

then move on with implications of clustering methods to finance..