This eBook is based on a section of a short course on concepts and methods in causal inference run by the PATHWAYS node. The original practical, on mediation, asked participants to fit models using Stata themselves, whereas this eBook fits models in Stata on the user's behalf. You therefore have to have access to Stata, and to tell Stat-JR where it is: see the third-party software page on the Stat-JR site for more information on how to specify the path to Stata and other software packages.
Body mass index is known to affect the level of gamma-glutamyl transpeptidase, but is also potentially affected by alcohol consumption. Socio-economic position is potentially a common cause of alcohol consumption, body mass index and systolic blood pressure.
For simplicity let's assume that the following diagram represents the relationships between the variables (where BMI = body mass index, GGT = gamma-glutamyl transpeptidase, SEP = socio-economic position, and SBP = systolic blood pressure):
We will use a simulated dataset, inspired by the population-based controls from a case-control study (the Izhevsk Family Study), which was carried out to assess the effects of hazardous alcohol-drinking on premature mortality in men living in Izhevsk, Russia.
Here are the variables:
Please wait a moment whilst the dataset summary loads...
First, we will perform a traditional mediation analysis, as described, for example, in the PATHWAYS workshop 'Current and new approaches to estimating causal pathways from observational data'.
Below you'll see some quiz questions...
* NB: up to six model-fitting opportunities are available, but note you don't have to use all of these - fitting four should be sufficient to answer the questions.
The quiz questions will appear here once they've loaded...
Once you have submitted your choices, it'll indicate here how many you got correct...
Once the first model has been run, the compiled table of coefficient estimates and standard errors will appear here. As you then run each subsequent model (up to six), the estimates from those will be added too...
Parameter | MODEL 1, response: | MODEL 2, response: | MODEL 3, response: | MODEL 4, response: | MODEL 5, response: | MODEL 6, response: | ||||||
cons | () | () | () | () | () | () | ||||||
sep_2 | () | () | () | () | () | () | ||||||
sep_3 | () | () | () | () | () | () | ||||||
alc | () | () | () | () | () | () | ||||||
bmi | () | () | () | () | () | () | ||||||
log_ggt | () | () | () | () | () | () | ||||||
sbp | () | () | () | () | () | () | ||||||
log_ggt_alc | () | () | () | () | () | () |
The list of explanatory variables will appear here; you can then choose which of these you want to include in your model...
Using the box below you can choose which response variable and explanatory variables to include in your linear regression models...
...once you have made your choices, and pressed Submit, a do-file will be created; this will be run in Stata, and the results will be compiled in a table, above, as well as in more detailed results tables returned below.
Remember, you have up to six model-fitting opportunities.
Model 1: 6 model-fitting opportunities remaining...
Model 2: 5 model-fitting opportunities remaining...
Model 3: 4 model-fitting opportunities remaining...
Model 4: 3 model-fitting opportunities remaining...
Model 5: 2 model-fitting opportunities remaining...
Model 6: this is the last model-fitting opportunity...
The specific results table for Model 1 will appear here.
Parameter | Coefficient | Standard Error | 95% C.I. lower | 95% C.I. upper | t | p>|t| |
Again, once you have made your choices, the specific results table for Model 2 will appear here.
Parameter | Coefficient | Standard Error | 95% C.I. lower | 95% C.I. upper | t | p>|t| |
Again, once you have made your choices, the specific results table for Model 3 will appear here.
Parameter | Coefficient | Standard Error | 95% C.I. lower | 95% C.I. upper | t | p>|t| |
Again, once you have made your choices, the specific results table for Model 4 will appear here.
Parameter | Coefficient | Standard Error | 95% C.I. lower | 95% C.I. upper | t | p>|t| |
Again, once you have made your choices, the specific results table for Model 5 will appear here.
Parameter | Coefficient | Standard Error | 95% C.I. lower | 95% C.I. upper | t | p>|t| |
Again, once you have made your choices, the specific results table for Model 6 will appear here.
Parameter | Coefficient | Standard Error | 95% C.I. lower | 95% C.I. upper | t | p>|t| |
Once you have submitted your answers on the previous page, you'll find out here which questions you got correct / incorrect, and why...
We can be more flexible in our modelling approach using the g-computation formula (Robins, 1986), recently implemented in a Stata routine called gformula. This is what we will do next, comparing our inferences with those from the traditional approach explored on the previous page.
The gformula command is "an implementation of the g-computation procedure...used to estimate the causal effect of time-varying exposures on an outcome in the presence of time-varying confounders that are themselves also affected by the exposures. The procedure also addresses the related problem of estimating direct and indirect effects when the causal effect of the exposures on an outcome is mediated by intermediate variables, and in particular when confounders of the mediator-outcome relationships are themselves affected by the exposures" (Daniel et al, 2011).
If you wish to use it yourself, you can download it within any version of Stata, provided that you have an internet connection, by typing:
. ssc install gformula
We'll fit our first model using the gformula command on the next page...
Daniel, R.M., De Stavola, B.L. and Cousens, S.N. (2011) gformula: Estimating causal effects in the presence of time-varying confounding or mediation using the g-computation formula. The Stata Journal, 11(4), 479-517.
Robins, J (1986) A new approach to causal inference in mortality studies with a sustained exposure period - applicaiton to control of the healthy worker survivor effect. Mathmematical Modelling, 7, 1393-1512.
. gformula sep alc bmi log_ggt sbp, mediation out(Q1) ///
eq(bmi:i.sep alc, Q2, sbp:bmi alc log_ggt i.sep) ///
com(bmi:regress, log_ggt:regress, sbp:regress) ex(Q3) mediator(Q4) ///
control(log_ggt:3) baseline(alc:0) post_confs(bmi) base_confs(sep) sam(50)
What's missing? Enter your answers here...
The quiz questions will appear here once they've loaded...
Once you have submitted your choices, it'll indicate here how many you got correct...
How did you get on? Here's the complete command:
. gformula sep alc bmi log_ggt sbp, mediation out(sbp) ///
eq(bmi:i.sep alc, log_ggt:bmi alc, sbp:bmi alc log_ggt i.sep) ///
com(bmi:regress, log_ggt:regress, sbp:regress) ex(alc) mediator(log_ggt) ///
control(log_ggt:3) baseline(alc:0) post_confs(bmi) base_confs(sep) sam(50)
This command will be automatically run, in Stata, once you indicate you want it to do so, beneath.
The results will then be returned, below, but note this will take some time...
Once the model has run (this will take some time), the results will appear here...
G-computation estimate | Bootstrap Std. Err. | |
TCE | ||
NDE | ||
NIE | ||
CDE |
Note, due to the Monte Carlo nature of the procedure, you will obtain slightly different results each time you fit this model.
Once you have submitted your choice, it'll indicate here whether you were correct or not...
Make sure you understand the comparison being made here; we are comparing the population under different hypothetical conditions.
To test your understanding, choose the correct 'hypothetical populations', from the list that follows, to answer the question beneath:
Once you have submitted your choice, it'll indicate here whether you were correct or not...
Now we'll repeat the gformula analysis, allowing for exposure-mediator interaction as follows:
. gformula sep alc bmi log_ggt sbp log_ggt_alc, mediation out(sbp) ///
eq(bmi:i.sep alc, log_ggt:bmi alc, sbp:bmi alc log_ggt log_ggt_alc i.sep) ///
com(bmi:regress, log_ggt:regress, sbp:regress) ex(alc) mediator(log_ggt) ///
control(log_ggt:3) baseline(alc:0) post_confs(bmi) base_confs(sep) derived(log_ggt_alc) ///
derrules(log_ggt_alc:log_ggt*alc) sam(50)
Once the model has run (again, this will take some time!), the results will appear here...
G-computation estimate | Bootstrap Std. Err. | |
TCE | ||
NDE | ||
NIE | ||
CDE |
Once you have submitted your choice, it'll indicate here whether you were correct or not...
Turn to the next page to see our concluding remarks...
Let's now compare the results from the analyses we ran in gformula with those obtained from the traditional approach we adopted earlier in the eBook.
First of all, note that the results from the traditional analyses are not easily comparable with the ones from gformula since they relate to a unit change in alc and so the hypothetical worlds being compared are different. Nevertheless, we see that indirect effects implied by the standard approaches are rather larger than was found using the g-formula. This can be explained by the uncontrolled confounding due to bmi. Higher alcohol intake leads to higher BMI, but since high alcohol and high BMI both lead to higher GGT, controlling for log_GGT reduces the association between alcohol and BMI, and so the direct effect of alcohol appears reduced.
Controlling for BMI in fact makes things worse in this example. The direct effect of alcohol on SBP no longer includes the part of this effect which acts through BMI.
There is also apparently an important exposure--mediator interaction which cannot be taken into account in the traditional analysis.
So, all things considered, we would prefer the analysis using the gformula command which allows for intermediate confounding and exposure-mediator interaction.
Given that alcohol intake is notoriously poorly estimated, if there is measurement error, then GGT could in fact be a better proxy of true alcohol intake than the alcohol variable itself. Then, even if there were no mediation (suppose GGT has no causal effect on SBP), measurement error may still lead us to believe that an indirect effect (through GGT) exists.