0:00
The data used in machine learning processes often have many variables.
This is what we call highly dimensional data.
Most of these dimensions may or may not matter in the context of our application
with the questions we are asking.
Reducing such high dimensions to a more manageable set of related and
useful variables improves the performance and accuracy of our analysis.
After this video, you will be able to explain what dimensionality reduction is,
discuss the benefits of dimensionality reduction, and
describe how PCA transforms your data.
The number of features or variables you have in your data set determines
the number of dimensions or dimensionality of your data.
If your dataset has two features, than it is two dimensional data.
If it has three features than it has three features and so on.
You want to use as many features as possible to capture the characteristics of
your data, but
you also don't want the dimension audio of your data to be too high.
As the dimensionality increases, the problem spaces you're looking at increases
requiring substantially more instances to adequately sample of that space.
So as the dimensionality increases,
the space that you are looking at grows exponentially.
As the space grows data becomes increasingly sparse.
In this diagram we see how the problem space grows as
the dimensionality increases from 1 to 2 to 3.
In the left plot, we have a one dimensional space partitioned into four
regions each with size of 5 units.
The middle plot shows a two dimensional space with 5x5 regions.
The number of regions has now going from 4 to 16.
In the third plot, the problem space is three dimensional with 5x5x5 regions.
The number of regions increased even more to 64.
We see that as the number of dimensions increases, the number of regions
increases exponentially and the data becomes increasingly sparse.
With a small dataset relative to the problem space, analysis results degrade.
In addition, certain calculations used in analysis become much more difficult to
define and calculate effectively.
For example, distances between samples are harder to compare since all samples
are far away from each other.
All of these challenges represent the difficulty of dealing with high
dimensional data and as referred to as the curse of dimensionality.
To avoid the curse of dimensionality,
you want to reduce the dimensionality of your data.
This means finding a smaller subset of features that can effectively capture
the characteristics of your data.
3:26
This reduces the dimensions of the data while eliminating the relevant features
making the subsequent analysis simple.
A technique commonly use to find the subset of most important dimensions is
called principal component analysis, or PCA for short.
The goal of PCA is to map the data from the original high dimensional space
to a lower dimensional space that captures as much of the variation in
the data as possible.
In other words,
PCA aims to find the most useful subset of dimensions to summarize the data.
4:05
Here, we have data samples in a two dimensional space that is defined
by the x axis and the y axis.
You can see that most of the variation in the data lies along the red diagonal line.
This means that the dat samples are best differentiated along this dimension
because they're spread out, not clumped together along this dimension.
This dimension indicated by the red line is the first principle component
labelled as PC1 in the part.
It captures the large amount of variance along a single dimension in data.
PC1, indicated by the red line does not correspond to either axis.
The next principle component is determined by looking in the direction that is
orthogonal, in other words perpendicular, to the first principle component which
captures the next largest amount of variance in the data.
This is the second principal component PC2 and
it's indicated by the green line in the plot.
This process can be repeated to find as many principal components as desired.
Note that the principal components do not align with either the x-axis or
the y-axis.
And that they are orthogonal, in other words, perpendicular to each other.
This is what PCA does.
It finds the underlined dimensions, the principal
components that capture as much of the variation in the data as possible.
These principal components form a new coordinates system to transform
the data to, instead of the conventional dimensions like X, Y, and Z.
So how does PCA help with dimensionality reduction?
Let's look again in this plot with the first principle component.
Since the first principle component captures most of the variations in
the data, the original data sample can be mapped to this dimension indicated by
the red line with minimum loss of information.
In this case then, we map a two-dimensional dataset to
a one-dimensional space while keeping as much information as possible.
Here are some main points about principal components analysis.
PCA finds a new coordinate system for your data,
such that the first coordinate defined by the first principal
component Captures the greatest variance in your data.
The second coordinate defined by the second principal component captures
the second greatest variance in a data, etc..
The first few principle components that capture most of the variance in a data
can be used to define a lower-dimensional space for your data.
PCA can be a very useful technique for dimensionality reduction,
especially when working with high-dimensional data.
While PCA is a useful technique for reducing the dimensionality of your
data which can help with the downstream analysis,
it can also make the resulting analysis models more difficult to interpret.
The original features in your data set have specific meanings such as income,
age and occupation.
By mapping the data to a new coordinate system defined by principal components,
the dimensions in your transformed data no longer have natural meanings.
This should be kept in mind when using PCA for dimensionality reduction.