0:01

Let's talk about the Curse of Dimensionality.

The curse of dimensionality is a collective term.

It has to do with various phenomena that emerge when dealing with multi-dimensional data.

The easiest way to explain the curse of dimensionality is to give an example.

So, let's take a very simple dataset.

It has just one input attribute and a target attribute.

And the purpose in this dataset is to

classify my observations in one of two clauses, red or green.

So, my input attribute is a number of observations here,

I have X_1, X_2 and so on until X_n and in my case n is 20.

I have 20 observations and X is in the interval of 0-20.

Okay and for t,

the target class for each X. I also have a target t_1,

t_2 and so on until

t_n and t is in let's say, red and green.

And I'd like to train a classifier that can give me

the probability of let's say the target being of class g given X.

And a very basic way of doing this is to partition the input space into regions,

compute the conditional probability for

each region and then use these data to make predictions.

So, let me show you what I mean.

I will just put the data here.

I will partition the input space which is the interval from 0 to 20 into four regions.

One here, one here, one here, one here.

I will name them r_1, r_2,

r_3 and r_4 and then I will compute

the conditional probability for each of these regions.

So, I would say, what is the condition of probability

of the class being g, given region one.

And I can do this by just counting

number of observations here and seeing how many of them are green.

So, in this case,

if I look at the data I will see that I have

one observation of class green out of four observations in the region in total.

So, this is a probability of 0.25.

And I do the same for the second region.

So, what's the probability of the class being g in region two,

then I'm looking at these guys here and I have two green points out of three.

So, my probability is about 0.67.

And I do the same for the third region which is r_3 and I

have four out of seven which gives me a probability of about 0.57.

And the last region equals g for region four,

I have three out of six which is 0.5.

Now, if I get a new observation,

if I get some unseen data and let's say this data falls

into region four so the value of X is between 15 and 20,

I can use this data and say that the probability of

this point being of class green is 0.5 or 50 percent.

This is a very simple,

very easy way of computing probabilities and making predictions based on the input space.

The question is, what happens with this approach,

if instead of a single input attribute in my dataset?

Instead of a single input attribute here,

I actually have two input attribute. Something like this.

So, my dataset is no longer X_1,

X_2, X_3 and so on.

It's a vector of tuples so I have X_11, X_12, X_21,

X_22 and so on until X_n1 X_n2.

So, what happens? Well, I can try to do the same thing.

I can plot the data,

partition to space four by four,

which will give me four by four,

four partitions for each dimension will give me 16 regions.

And if I plot data,

you immediately see the problem.

Now that I have 16 regions,

first of all, I have some regions with a single data point.

So, I'm not entirely sure that I can actually trust the predictions from this region,

because obviously I don't have enough data there.

Another problem is that,

I actually have some regions where I don't have any data at all.

Like here, here, here and so on.

And this kind of makes my classifier undefined for these regions.

I don't actually know what to predict.

If the input data,

the unseen data falls within a region where I don't actually have any data.

So, this problem gets worse as you keep adding dimensions.

So, if we go into a dataset with three mentions and the same number of observations,

we now have something like,

if we get the data points,

we have something like four by four by four.

This would be 64 regions.

So, it's even worse.

And if you think about it,

the density of my data in the one dimensional case where I had one input attribute.

So, 1D let's say.

In the 1D case,

the density of my samples was something like 20 samples of the four regions equals five,

about five observations per region on average.

In the 2D case,

where I have the same number of observations but now I have four times four, 16 regions.

My sampling density is something like 1.25 average observations per region.

And in the 3D case,

where I have 20 of the 64,

I now have something like 0.31 observations per region.

So, the more dimensions I add the worse it gets.

And you could say,

"Well, can we then increase N,

can we increase the number of observations here and get better density."

Absolutely, yes we can.

So, we can increase N,

but let's see if I want to keep

the same sampling density then I have to make proportional.

So, N_1/D where is the dimensionality it will give me basically the proportions.

So, what I can do is I can say,

I have 20 observations

in one dimensional dataset and I want to get the same number of density.

So, how many observations?

X that's why unknown,

will be needed if I want to keep this in 3D space one of the three.

And if I sold for X what I will get is that X will be 8000.

So, if I want

to keep the same sampling density is in

the 1D case will increase the dimensions to three,

I will have to increase my N to 8000.

So, you can see that adding dimensions

also increases the number of observations that are needed to keep the sampling density.

And this number grows very quickly.

So, to summarize.

One aspect of the curse of dimensionality is the data becomes sparse we add dimensions.

The second problem is the distances lose meaning in higher dimensions, which means that,

if you define some kind of a distance measure between your data points,

if you have many dimensions and

two data points are very distance in one of the dimensions,

you just have a very tiny impact on the overall distance measure,

because you are averaging essentially across many dimensions.

So, distances tend of lose meaning.

And the curse of dimentionality is well recognized

problem and there are different techniques for alleviating it.

So, let's see what we can do.