In this eBook we introduce a Stat-JR super-template for 2-level data that allows for missing values in explanatory or response variables, and can handle normal or categorical variables.
Stat-JR super-templates are so-called because they call other templates; the super-template we will use, 2LevelImpute, allows the user separately to specify the imputation and model of interest (MOI). The multiply-imputed datasets are then produced, and the MOI fitted to each (via a variety of other Stat-JR templates 2LevelImpute calls); these are combined using Rubin's rules, and the results for the final MOI fit are then returned (together with results from a complete case analysis, for comparison). If your computer has multiple processors, these will be used in parallel for the imputation models.
In this eBook, we run through the following:
It will be assumed that you have a knowledge of the basics of multiple imputation using a joint modelling approach. If not, you might like to look at the Missing data web site for an introduction and also references to papers and recent developments.
The template we will be using incorporates the existing REALCOM procedures, but provides a very much faster implementation.
Authors of templates: | Core team templates written by Chris Charlton and William Browne |
---|---|
Authors of eBook documentation: | Harvey Goldstein and Richard Parker |
Here we give a brief overview of the inputs Stat-JR requires in order to run the 2LevelImpute super-template:
Note that all variables are stored as vectors of the same length as those in the full data set. Where these are declared as level 2 variables the first one in each level 2 unit is chosen. In fact the template checks to determine whether such values are actually constant within each level 2 unit and the user will be notified if not.
In the following examples we will consider the tutorial dataset that has been used many times as an example of a 2-level educational dataset (although we treat it as a 1-level dataset in the first example). See the MLwiN manual, for example:
Rasbash, J., Steele, F., Browne, W.J. and Goldstein, H. (2012). A user's guide to MLwiN Version 2.27. Centre for Multilevel Modelling, University of Bristol.
The dataset consists of a sample of records of school achievement for 4059 pupils within 65 schools - some missing values have been randomly introduced. The dataset (saved as tutmiss) is summarised in Table 1, below.
COLUMN | N | MISSING | MIN | MAX | DESCRIPTION |
---|---|---|---|---|---|
school | 4059 | 0 |
1.00 |
65.00 |
Numeric school identifier |
student | 4059 | 0 |
1.00 |
198.00 |
Numeric student identifier |
normexam | 4059 | 0 |
-3.67 |
3.67 |
Students' exam score at age 16, normalised to have approximately a standard Normal distribution. |
cons | 4059 | 0 |
1.00 |
1.00 |
A column of ones. If included as an explanatory variable in a regression model, its coefficient is the intercept. |
standlrt | 4059 | 0 |
-2.93 |
3.02 |
Students' score at age 11 on the London Reading Test (LRT), standardised using Z-scores. |
girl | 4059 | 0 |
0.00 |
1.00 |
Students' gender: 0=boy; 1=girl |
schgend | 4059 | 0 |
1.00 |
3.00 |
School gender: 1=mixed; 2=boys' school; 3=girls' school |
avslrt | 4059 | 0 |
-0.76 |
0.64 |
Average LRT score in school |
schav | 4059 | 0 |
1.00 |
3.00 |
Average LRT score in school, coded into 3 categories: 1=bottom 25%; 2=middle 50%; 3=top 25% |
vrband | 4059 | 0 |
1.00 |
3.00 |
Students' score in test of verbal reasoning at age 11, coded into 3 categories: 1=top 25%; 2=middle 50%; 3=bottom 25% |
schgendmiss | 4059 | 439 |
0.00 |
2.00 |
|
avslrtmiss | 4059 | 431 |
-0.76 |
0.64 |
|
standlrtmiss | 4059 | 400 |
-2.93 |
3.02 |
|
girlmiss | 4059 | 435 |
0.00 |
1.00 |
If the inputs don't appear in the box, below, you need to submit all the inputs for the step above first...
If the inputs don't appear in the box, below, you need to submit all the inputs for the steps above first...
Multiple imputation theory strictly applies only where the set of (response plus explanatory) variables for which we wish to impute missing values have a joint multivariate normal distribution for the variables all treated as responses. Where we have categorical data this is clearly not the case and we therefore use a latent normal formulation (See: Goldstein, H., Carpenter, J., Kenward, M. and Levin, K. (2009). Multilevel models with multivariate mixed response types. Statistical Modelling, 9(3), 173-197). This works as follows:
If the inputs don't appear in the box, below, you need to submit all the inputs for the steps above first...
Soon after all imputed datasets have been returned (see counter below), the results will appear beneath...
Imputed dataset #1 returned...
Imputed dataset #2 returned...
Imputed dataset #3 returned...
Imputed dataset #4 returned...
Imputed dataset #5 returned...
Imputed dataset #6 returned...
Imputed dataset #7 returned...
Imputed dataset #8 returned...
Imputed dataset #9 returned...
Imputed dataset #10 returned...
Imputed dataset #11 returned...
Imputed dataset #12 returned...
Imputed dataset #13 returned...
Imputed dataset #14 returned...
Imputed dataset #15 returned...
Imputed dataset #16 returned...
Imputed dataset #17 returned...
Imputed dataset #18 returned...
Imputed dataset #19 returned...
Imputed dataset #20 returned...
That's all the imputed datasets returned!
Once Stat-JR has finished executing, a large range of different results are returned, including both tables and graphs; you can view all of these via the 'Resources' button at the top of this page, and they are also available via the drop-down list in the right-hand frame of the results page if running the template in the Stat-JR:TREE interface.
For example 'ResultsTable:Imputation' returns the estimates from the multiple imputation we've requested. Below (Table 2) we've selected the estimates from that output, and placed them side-by-side, in a combined table, with those from an analysis of complete cases only (available as 'ResultsTable:CompleteCases'; NB if the table below does not render properly, both these resources are available via the collapsible menus below it).
Since the data were set missing at random we do not expect the estimates to differ much, but note the increased standard errors for the complete case analysis where 30% of the records have at least one variable in the MOI missing and have been deleted (due to rounding this is more apparent in the tables accessible via the collapsible menus below).
Parameter |
Variable name |
Estimates from Multiple Imputation |
Estimates from analysis of |
||
---|---|---|---|---|---|
\(\beta_0\) | () |
() |
|||
\(\beta_1\) | () |
() |
|||
\(\beta_2\) | () |
() |
|||
\(\beta_3\) | () |
() |
|||
\(\beta_4\) | () |
() |
|||
\(\beta_5\) | () |
() |
|||
\(\beta_6\) | () |
() |
|||
\(\beta_7\) | () |
() |
|||
\(\beta_8\) | () |
() |
|||
\(\beta_9\) | () |
() |
|||
\(\sigma^2\) | () |
() |
|||
\(\sigma^2_u\) | () |
() |
|||
Mean* DIC | |||||
Mean* pD |
* 'Mean' refers to multiple imputation only
You may like to try different variables, numbers of iterations etc. by re-running this ebook. See Stat-JR web site for further details about Stat-JR including instructions on how to use the templates using Stat-JR:TREE, and also the final page of this eBook.
We can also look at various diagnostics for either the imputation model or the MOI. We can obtain these by clicking on the appropriate 'svg' files in the drop down box, so let's look at one of the imputation traces 'Combined_sigma2.svg' which is the level 1 variance chains for the model of interest. Here there is one chain for each completed dataset and each of the chains is displayed in a different colour.
If the inputs don't appear in the box, below, you need to submit all the inputs for the step above first...
If the inputs don't appear in the box, below, you need to submit all the inputs for the steps above first...
If the inputs don't appear in the box, below, you need to submit all the inputs for the steps above first...
Multiple imputation theory strictly applies only where the set of (response plus explanatory) variables for which we wish to impute missing values have a joint multivariate normal distribution for the variables all treated as responses. Where we have categorical data this is clearly not the case and we therefore use a latent normal formulation (See: Goldstein, H., Carpenter, J., Kenward, M. and Levin, K. (2009). Multilevel models with multivariate mixed response types. Statistical Modelling, 9(3), 173-197). This works as follows:
If the inputs don't appear in the box, below, you need to submit all the inputs for the steps above first...
Soon after all imputed datasets have been returned (see counter below), the results will appear beneath...
Imputed dataset #1 returned...
Imputed dataset #2 returned...
Imputed dataset #3 returned...
Imputed dataset #4 returned...
Imputed dataset #5 returned...
Imputed dataset #6 returned...
Imputed dataset #7 returned...
Imputed dataset #8 returned...
Imputed dataset #9 returned...
Imputed dataset #10 returned...
Imputed dataset #11 returned...
Imputed dataset #12 returned...
Imputed dataset #13 returned...
Imputed dataset #14 returned...
Imputed dataset #15 returned...
Imputed dataset #16 returned...
Imputed dataset #17 returned...
Imputed dataset #18 returned...
Imputed dataset #19 returned...
Imputed dataset #20 returned...
That's all the imputed datasets returned!
Once Stat-JR has finished executing, a large range of different results are returned, including both tables and graphs; you can view all of these via the 'Resources' button at the top of this page, and they are also available via the drop-down list in the right-hand frame of the results page if running the template in the Stat-JR:TREE interface.
For example 'ResultsTable:Imputation' returns the estimates from the multiple imputation we've requested. Below (Table 2) we've selected the estimates from that output, and placed them side-by-side, in a combined table, with those from an analysis of complete cases only (available as 'ResultsTable:CompleteCases'; NB if the table below does not render properly, both these resources are available via the collapsible menus below it).
Since the data were set missing at random we do not expect the estimates to differ much, but note the standard errors tend to be increased for the complete case analysis where 30% of the records have at least one variable in the MOI missing and have been deleted.
Parameter |
Variable name |
Estimates from Multiple Imputation |
Estimates from analysis of |
||
---|---|---|---|---|---|
\(\beta_0\) | () |
() |
|||
\(\beta_1\) | () |
() |
|||
\(\beta_2\) | () |
() |
|||
\(\beta_3\) | () |
() |
|||
\(\beta_4\) | () |
() |
|||
\(\beta_5\) | () |
() |
|||
\(\beta_6\) | () |
() |
|||
\(\beta_7\) | () |
() |
|||
\(\beta_8\) | () |
() |
|||
\(\beta_9\) | () |
() |
|||
\(\sigma^2\) | () |
() |
|||
\(\sigma^2_u\) | () |
() |
|||
Mean* DIC | |||||
Mean* pD |
* 'Mean' refers to multiple imputation only
You may like to try different variables, numbers of iterations etc. by re-running this ebook. See Stat-JR web site for further details about Stat-JR including instructions on how to use the templates using Stat-JR:TREE, and also the final page of this eBook.
We can also look at various diagnostics for either the imputation model or the MOI. We can obtain these by clicking on the appropriate 'svg' files in the drop down box, so let's look at one of the imputation traces 'Combined_sigma2.svg' which is the level 1 variance chains for the model of interest. As you will see the mixing seems fine. There is one chain for each completed dataset and each of the chains is displayed in a different colour.
If you wish to try your own dataset, then you can easily do so via Stat-JR's Template Reading and Execution Environment (TREE) interface (formerly 'webtest'; you can run TREE alongside the DEEP interface you're reading now).
Once you've opened TREE (for further details see A Beginner's Guide to Stat-JR's TREE interface), and pressed Begin you'll see the black bar at the top of the TREE interface includes Dataset and Template options. If you select Choose from each of these, you will be able to change the dataset and template, respectively, away from their defaults.
Below we have a screenshot from TREE, in which we have selected the template 2LevelImpute and the dataset tutmiss (as used in this eBook), and have started to specify our inputs.
In the screenshot below, we have specified all the inputs, and have pressed the Next button a final time, after which the Run button appears, along with some other outputs in the pane at the bottom of the browser window (not shown here).
If you then press the Run button, the template will then run (the progress gauge in the black bar will change from Ready to Working to indicate it is busy). Once it has finished, and all the results have been returned, you can view the outputted files in the pane towards the bottom of the browser window (see below for details of the various outputs returned). Use the drop-down list above the pane to choose particular outputs to view: for example, to view the imputed datasets, choose "Imputation_Model_impute_datafile_chain0_iter1000", "Imputation_Model_impute_datafile_chain1_iter1000", etc. Note that the nomenclature here will depend on both the inputs you have chosen, and the number of processors on your machine. For example, if I chose Number of imputed data sets: 5, Number of iterations before first imputation: 1000 and Number of iterations between subsequent imputations: 500, on my machine with four processors, then I would see "...chain0_iter1000", "...chain1_iter1000", "...chain2_iter1000", "...chain3_iter1000", "...chain0_iter1500", "...chain1_iter1500", "...chain2_iter1500", and "...chain3_iter1500". Here it has ran a chain / thread on each of the four available processors, and derived a dataset from each after the 1000th iteration (as stated earlier, this interval includes the burnin). However, since this number is less than the requested 5 imputed datasets I asked for, it has also constructed four more datasets following a further 500 iterations (since I specified Number of iterations between subsequent imputations: 500). However, it will only use the first of these ("...chain0_iter1500") to make up the five datasets it needs.
The imputed datasets are available in the list of datasets accessible via Dataset > Choose in the black bar at the top. You can download these (as Stata format *.dta files) by first selecting the relevant dataset from the list, pressing the neighbouring Use button, and then downloading via Dataset > Download (again via the black bar at the top; note that both Imputation_Model_impute_datafile_chain0_iter1000, etc. and impute_datafile_chain0_iter1000, etc. will appear in this list, but there's no need to download both (they're the same, but simply saved twice, with and without the Imputation_Model prefix; also, don't confuse these with the level 2 datasets (e.g. impute__L2Data_chain0_iter1000 or Imputation_Model_impute__L2Data_chain0_iter1500).
Alternatively, one can press the green Download button to download all these outputted files (now supported for the 2LevelImpute template from Stat-JR version 1.0.2). Note that the .dta extension will have to be manually-added.