The split criteria used in HPFOREST is the Gini index.

In terms of missing data, if the value of our target or

response variable is missing, the observation is excluded from the model.

If the value of an explanatory variable is missing,

PROC HPFOREST uses the missing value as a legitimate value by default.

Notice, too, that the the number of observations read from my data set was

6,504 while the number of observations used was 6,500.

Within the baseline fit statistics output,

you can see that the misclassification rate of the random forest is displayed.

Here we see that the forest misclassified 19.8% of the sample.

Suggesting that the forest correctly classified 80.2% of the sample.

Now I'll show the first ten and last ten observations of the fit statistics table.

PROC HPFOREST computes fit statistics for

a sequence of forests that have an increasing number of trees.

As the number of trees increases, the fit statistics usually improve, that is,

decrease at first and then they level off and fluctuate in a small range.

Forest models provide an alternative estimate of average square error and

misclassification rate, called the out of bag or OOB estimate.

The OOB estimate is a convenient substitute for

an estimate that is based on test data and

is a less biased estimate of how the model will perform on future data.

We end up with near perfect prediction in the training samples as the number of

trees grown gets closer to 100.

When those same models are tested on the out of bag sample,

the misclassification rate is around 16%.

The final table in our output represents arguably the largest contribution of

random forests.

Specifically, the variable importance rankings.

The number of rules column shows the number of splitting rules

that use a variable.

Each measure is computed twice, once on training data and

once on the out of bag data.

As with fit statistics, the out of bag estimates are less biased.

The rows are sorted by the out of bag Gini measure or OOB Gini measure.

The variables are listed from highest importance to lowest importance

in predicting regular smoking.

In this way, random forests are sometimes used as a data reduction technique,

where variables are chosen in terms of their importance to be

included in regression and other types of future statistical models.

Here we see that some of the most important variables in predicting regular

smoking include marijuana use, alcohol use, race,

cigarette availability in the home, cocaine use, deviant behavior, etc.

To summarize, like decision trees, random forests are a type of data mining

algorithm that can select from among a large number of variables,

those that are most important in determining the target or

response variable to be explained.

Also, like decision trees,

the target variable in a random forest can be categorical or quantitative.

And the group of explanatory variables can be categorical or

quantitative, or any combination.

Unlike decision trees, however,

the results of random forest generalize well to new data

since the strongest signals are able to emerge through the growing of many trees.

Further, small changes in the data do not impact the results of random forests.

In my opinion, the main weakness of random forests is simply that results

are somewhat less satisfying, since no trees are actually interpreted.

Instead, the forest of trees is used to rank the importance of variables

in predicting the target.

Thus, we get a sense of the most important predictive variables,

but not their relationship to one another.

[MUSIC]