Ideally, when designing a study,

you would like to know all the potential confounding factors

and plan how to control for them in advance,

but some other confounding factors may only be identified as such when data is analyzed.

In this lecture, we will look at the two main strategies to control for confounding at

the data analysis stage: Stratification and Multi-variable regression models.

So, let's say we have conducted a study where the exposure is smoking,

and the outcome is chronic lung disease.

We suspect that age is a confounder in this association.

What can we do at the Data Analysis stage?

One option would be stratification.

The first step is to stratify our data by age group,

and obtain an estimate for the association between

smoking and chronic lung disease in each stratum.

This means that we calculate an odds ratio, for example,

for people 20-29 years old,

and now the odds ratio for those 30-39 years old, and so on.

In the second step of the process,

we calculate a weighted average of the stratum-specific odds ratios.

This weighted estimate is called Mantel-Haenszel adjusted odds ratio,

and this is essentially the results of

our study after controlling for confounding by age.

This method allows us to get a sense of what is happening within the different strata,

but it becomes really burdensome if you try to control for multiple confounders,

and it doesn't really work for confounding variables which are continuous.

A second option, which is what the majority of researchers do nowadays,

is statistical adjustment using regression models.

In our example, we can estimate the association between

smoking and chronic lung disease by fitting a logistic regression model,

where the exposure is the independent variable,

and the outcome is the dependent variable.

If smoking is the only independent variable we include in the model,

we will calculate an unadjusted odds ratio.

If we wish to control for confounding by age,

we simply need to add it as an additional independent variable in the regression model,

and we can easily calculate an odds ratio that is adjusted for age.

Modern statistical software packages have made this process very easy and quick.

The great advantage of multivariable regression is

that we can control for multiple confounding factors at the same time,

although including too many variables can sometimes cause problems.

You are now familiar with a range of methods to control for confounding,

both at the design and the data analysis state.

The choice of the most appropriate method may not always be straightforward,

but it is worth discussing all options with

your research team considering their strengths and weaknesses.