For an example of logistic regression,

we're going to use the urine data set from the boot package in R.

First, we'll need to load the boot package.

And if it is not already installed, you'll have to do that as well.

After loading the package, we can load the data which is called "urine".

This data contain seven characteristics of 79 urine specimens.

The goal of the analysis is to determine

if certain physical characteristics of the urine are related to the formation

of calcium oxalate crystals.

Our seven variables are, first, R, which is an indicator

of the presence of a calcium oxalate crystal, so the value of R will be zero or one,

and our explanatory variables or covariates are the specific gravity,

the pH reading (this is for acidity),

the osmolarity, conductivity,

the urea concentration, and finally, the calcium concentration.

For more information about this data,

see the following source.

Let's see what the data looks like.

We'll take the head of the urine data set.

We can see that calcium oxalate crystals were not observed in the first six specimens,

and we have values for the six other explanatory variables.

Notice that we have missing values in this data set.

So, before we conduct analysis, let's remove those missing values.

'dat' is the data set that we're going to use.

We went from 79 rows down to 77 rows.

Now, let's look at a pairwise scatterplot for each

of the seven variables using pairs on our data set.

One thing that stands out

is that several of these variables are strongly correlated with one another.

For example, the specific gravity and the osmolarity

appear to have a very close linear relationship, as you can see in this scatterplot.

Collinearity between x variables or explanatory variables in linear regression models

can cause trouble for statistical inference.

Two correlated variables will compete for the ability to predict the response variable,

leading to unstable estimation.

This is not a problem for prediction of the response.

So, if prediction is our only goal of the analysis, then it's not a problem.

But, if our objective is to discover how the variables relate to the response,

we should avoid collinearity.

Because the primary goal of this analysis

is to determine which variables are related to the presence

of calcium oxalate crystals,

we will have to deal with the collinearity between the predictors.

This problem of determining which variables relate to the response

is often called variable selection.

We have already seen one way to do this.

We could fit several models that include different sets of variables

and see which one of those models has the best deviance information criterion value.

Another way to do this

is to use a linear model where the priors for the linear coefficients

favors values near zero.

Values near zero on the coefficients indicate weak relationships.

This way, the burden of establishing

an association between the variables lies with the data.

If there is not a strong signal, we assume that it does not exist

because the prior favors that scenario.

Rather than tailoring a specific prior for each individual beta coefficient

based on the scale of its covariate X, it is customary

to subtract the mean and divide by the standard deviation for each variable.

It's easy to do this in R.

We can create the design matrix X by scaling the data.

We want to omit the first column

because the first column contains our response variable, so we want the next six columns.

We want to scale those by centering them, so center=true,

and we also want to scale them, dividing by the standard deviation.

This only makes sense to do for variables that are continuous.

If we have categorical variables, we should exclude them from this

centering and scaling exercise.

Let's run the X.

Oops, I called it 'data' instead of 'dat'.

Let's rerun that,

and take a look at the first several values.