In cluster analysis variables with large values

contribute more to the distance calculations.

Variables measured on different scales should be standardized

prior to clustering, so

that the solution is not driven by variables measured on larger scales.

We use the following code to standardize the clustering variables to have a mean of

0, and a standard deviation of 1.

To standardize the clustering variables we will first create a copy of the cluster

data frame and name it cluster var, then we use the preprocessing.scale function to

transform the clustering variables to have a mean of 0 and a standard deviation of 1.

First, we list the name of the clustering variable, then an equal sign, and

preprocessing.scale, then in parentheses we type the name of our variable again and

add .astype, and then in another set of parentheses, float64 in quotes.

Astype float64 ensures that my clustering variables have a numeric format,

and we will do this for all the clustering variables,

then we will use the train test split function in the sklearn cross validation

library to randomly split the clustering variable data set into a training data set

consisting of 70% of the total observations, and

a test data set consisting of the other 30% of the observations.

First we'll type the name of the data set which we'll call clus_train followed

by the name of the test data set which we'll call clus_test,

then we type the function name,

train_test_split, and in parenthesis we type the name of the full

standardized cluster variable data set which we called clustevar.

The test_size option tells Python to randomly place 0.3,

that is 30% of the observations in the test data set that we named clus_test.

By default the other 70% of the observations are placed in

the clus_train training dataset.

The random_state option specifies a random number seat

to ensure that the data are randomly split the same way if I run the code again.

Now we are ready to run our cluster analysis, because

we don't know how many clusters actually exist in the population for

a range of values on the number of clusters before we begin we'll import

the cdist function from the scipy.spatial.distance library.

In this example we will use it to calculate the average distance of

the observations from the cluster centroids.

Later, we can plot this average distance measure to help us figure

out how many clusters may be optimal, then we will create an object called clusters

that will include numbers in the range between 1 and 10.

We will use this object when we specify the number of clusters we want to test,

which will give us the cluster solutions for k equals 1 to k equals 9 clusters.

In the next line of code we create an object called meandist

that will be used to store the average distance values that we will calculate for

the 1 to 9 cluster solutions.

The for k in clusters: code tells Python to run the cluster analysis code below for

each value of k in the cluster's object.