Introduction

In this eBook we introduce a Stat-JR super-template for 2-level data that allows for missing values in explanatory or response variables, and can handle normal or categorical variables.

Stat-JR super-templates are so-called because they call other templates; the super-template we will use, 2LevelImpute, allows the user separately to specify the imputation and model of interest (MOI). The multiply-imputed datasets are then produced, and the MOI fitted to each (via a variety of other Stat-JR templates 2LevelImpute calls); these are combined using Rubin's rules, and the results for the final MOI fit are then returned (together with results from a complete case analysis, for comparison). If your computer has multiple processors, these will be used in parallel for the imputation models.

In this eBook, we run through the following:

  1. Overview of inputs for the 2LevelImpute super-template
  2. The example dataset
  3. An example fitting a 1-level model
  4. An example fitting a 2-level model
  5. Running the 2LevelImpute super-template in Stat-JR's TREE interface

It will be assumed that you have a knowledge of the basics of multiple imputation using a joint modelling approach. If not, you might like to look at the Missing data web site for an introduction and also references to papers and recent developments.

The template we will be using incorporates the existing REALCOM procedures, but provides a very much faster implementation.


Authors of templates:Core team templates written by Chris Charlton and William Browne
Authors of eBook documentation:Harvey Goldstein and Richard Parker

Overview of inputs

Here we give a brief overview of the inputs Stat-JR requires in order to run the 2LevelImpute super-template:

Multilevel or not?
You are first asked if either the model of interest (MOI) and/or imputation have two levels (or just one). In general, you would want to fit the same number of levels in your imputation model as in the MOI, but there may be some situations where for simplicity, where the level 2 random effects are small, the MOI might be a single level model, but you would still wish to fit a 2-level imputation model. In such a case you will still need to specify the level 2 ID.

About your MOI
You are then asked a few questions about the structure of your MOI, including whether it is a 2-level model or not (see note, above), the distribution you would like to use to model the response variable, whether you would like to fit a random slope (or coefficient) model or not (if applicable), and your response and explanatory variables.

About your
imputation model
After that, you are prompted for the level 1 variables to be used as responses in the imputation model, and to specify their distribution and explanatory variables. Then, if applicable, you are asked about the variables at level 2 in the imputation model as well.
Typically these response variables are any that have missing values. Those variables without missing values, if they are to be used in the MOI, can be declared as either response or explanatory variables in the imputation model. In addition you may wish to include 'auxiliary' variables (not in the MOI) in the imputation model if these are associated with the propensity to be missing.
You may have a different set of explanatory variables for each response. This may be useful where you wish to have auxiliary variables relevant to certain responses only.

Estimation options
Finally, you are asked various questions about the estimation procedures, including the number of imputed datasets to use, the interval between iterations (to ensure approximate independence) at which to impute a complete dataset and, for the MOI, the burn in and number of iterations.

Note that all variables are stored as vectors of the same length as those in the full data set. Where these are declared as level 2 variables the first one in each level 2 unit is chosen. In fact the template checks to determine whether such values are actually constant within each level 2 unit and the user will be notified if not.

Example dataset

In the following examples we will consider the tutorial dataset that has been used many times as an example of a 2-level educational dataset (although we treat it as a 1-level dataset in the first example). See the MLwiN manual, for example:
Rasbash, J., Steele, F., Browne, W.J. and Goldstein, H. (2012). A user's guide to MLwiN Version 2.27. Centre for Multilevel Modelling, University of Bristol.

The dataset consists of a sample of records of school achievement for 4059 pupils within 65 schools - some missing values have been randomly introduced. The dataset (saved as tutmiss) is summarised in Table 1, below.


Table 1: Summary of the example dataset

COLUMN N MISSING MIN MAX DESCRIPTION
school 4059

0

1.00

65.00

Numeric school identifier
student 4059

0

1.00

198.00

Numeric student identifier
normexam 4059

0

-3.67

3.67

Students' exam score at age 16, normalised to have approximately a standard Normal distribution.
cons 4059

0

1.00

1.00

A column of ones. If included as an explanatory variable in a regression model, its coefficient is the intercept.
standlrt 4059

0

-2.93

3.02

Students' score at age 11 on the London Reading Test (LRT), standardised using Z-scores.
girl 4059

0

0.00

1.00

Students' gender: 0=boy; 1=girl
schgend 4059

0

1.00

3.00

School gender: 1=mixed; 2=boys' school; 3=girls' school
avslrt 4059

0

-0.76

0.64

Average LRT score in school
schav 4059

0

1.00

3.00

Average LRT score in school, coded into 3 categories: 1=bottom 25%; 2=middle 50%; 3=top 25%
vrband 4059

0

1.00

3.00

Students' score in test of verbal reasoning at age 11, coded into 3 categories: 1=top 25%; 2=middle 50%; 3=bottom 25%
schgendmiss 4059

439

0.00

2.00

 
avslrtmiss 4059

431

-0.76

0.64

 
standlrtmiss 4059

400

-2.93

3.02

 
girlmiss 4059

435

0.00

1.00

 

An example fitting a 1-level model

  • First you will be asked if either the MOI or imputation model has two levels; in this first example we're going to treat the data as if it is 1-level, so we'll answer No, they are both 1-level.

  • Next, you will be asked what distribution you would like to use to model the response variable in your MOI. Since our response variable, normexam is normalised we will select Normal in this example.

  • The 2LevelImpute super-template takes this input, and uses it to choose the best Stat-JR template to fit your MOI. If you make the choices suggested above, it will choose a template called 1LevelMod which fits 1-level models for Normal, binomial and Poisson responses.
The inputs will appear here once the template has loaded...
  • You will then be asked for the response and explanatory variables for the MOI. In this example we will fit:

    • normexam as the response;

    • cons, standlrtmiss and girlmiss as explanatory variables (as you click on these you'll see that they appear in the box beneath; to deselect any chosen in error, simply click on the variable name in the lower box). If any of our variables are categorical, we can specify as such by using the checkboxes which appear below the input box when you select your explanatory variables; here, though, we can leave these unchecked (girlmiss is a binary 0/1 variable, so will not be treat any differently if we specify it to be categorical).

If the inputs don't appear in the box, below, you need to submit all the inputs for the step above first...

The inputs will appear here once the template has loaded...
  • Once you have specified the MOI, you will then be asked to enter the responses for the imputation model.

    Note that all explanatory variables must have no missing values - if any variable does, then include it as a response.
  • Given our MOI, we suggest the variables normexam, standlrtmiss and girlmiss for level 1; you'll then need to specify their distributions, here as Normal, Normal, and Binary, respectively.

  • You will also be asked to enter the explanatory (predictor) variables for each of these (if you scroll down this page, you'll see all the boxes), as follows:

    • for normexam just use an intercept cons;

    • for standlrtmiss use cons and also vrband (making sure that you tick the box that asks whether vrband should be treated as categorical - this will then use appropriate dummy variables where the final category is taken as the reference);

    • for girlmiss use cons.

    Since we have selected values to be missing purely at random the use of a second predictor for standlrtmiss is not necessary, but you can include it to demonstrate that, as an auxiliary variable (not in the MOI) it can be used to help ensure missingness at random (MAR).

If the inputs don't appear in the box, below, you need to submit all the inputs for the steps above first...

The inputs will appear here once the template has loaded...

Multiple imputation theory strictly applies only where the set of (response plus explanatory) variables for which we wish to impute missing values have a joint multivariate normal distribution for the variables all treated as responses. Where we have categorical data this is clearly not the case and we therefore use a latent normal formulation (See: Goldstein, H., Carpenter, J., Kenward, M. and Levin, K. (2009). Multilevel models with multivariate mixed response types. Statistical Modelling, 9(3), 173-197). This works as follows:

  • For a binary response we utilise a probit model where we assume that the (0,1) response is derived from an underlying standard normal distribution with a mean value (determined by whatever explanatory variable predictors are in the imputation model for this response) that acts as a threshold - values above which are observed as a '1' and below which as a '0'. The MCMC algorithm incorporates a step that takes a random draw from the corresponding part of the standard normal distribution according to whether a '1' or '0' is observed, or imputes randomly if a value is missing.

  • For an ordered categorical response a similar procedure is used with additionally a set of thresholds defined on the standard normal scale that delineate the ordered categories.

  • For an unordered categorical variable with p categories a (p-1) dimensional multivariate normal distribution is generated.

  • For each of these the underlying normal distributions condition on the other responses (and explanatory variables) so that a joint multivariate normal is finally generated.
  • Next, you will be asked if you want to use the conditional algorithm or not; we suggest answering Yes, since the conditional algorithm is faster.

  • You will then be asked to specify the number of imputed datasets to use, the interval before the first imputation (this includes any burn-in period), the interval before any subsequent imputations (to ensure approximate independence) and, for the MOI, the burn in and number of iterations. In the interests of time to completion we suggest some small numbers here (e.g. 4, 1000, 50, 100, 250, respectively), but you will generally need considerably longer runs than this, and a greater interval between iterations!.

  • After clicking on the last Submit button (you'll also have to click Run if using the template via the Stat-JR:TREE interface), a counter at the top left will indicate that the software is running and tell you when results are available, plus on the next page you'll see a counter indicating when each imputed dataset has been returned...

    If your computer has more than one core processor, the imputation models and associated MOIs will in fact be run with parallel chains.

If the inputs don't appear in the box, below, you need to submit all the inputs for the steps above first...

The inputs will appear here once the template has loaded...

Inspecting the results

Soon after all imputed datasets have been returned (see counter below), the results will appear beneath...

Imputed dataset #1 returned...

Imputed dataset #2 returned...

Imputed dataset #3 returned...

Imputed dataset #4 returned...

Imputed dataset #5 returned...

Imputed dataset #6 returned...

Imputed dataset #7 returned...

Imputed dataset #8 returned...

Imputed dataset #9 returned...

Imputed dataset #10 returned...

Imputed dataset #11 returned...

Imputed dataset #12 returned...

Imputed dataset #13 returned...

Imputed dataset #14 returned...

Imputed dataset #15 returned...

Imputed dataset #16 returned...

Imputed dataset #17 returned...

Imputed dataset #18 returned...

Imputed dataset #19 returned...

Imputed dataset #20 returned...

That's all the imputed datasets returned!


Once Stat-JR has finished executing, a large range of different results are returned, including both tables and graphs; you can view all of these via the 'Resources' button at the top of this page, and they are also available via the drop-down list in the right-hand frame of the results page if running the template in the Stat-JR:TREE interface.

For example 'ResultsTable:Imputation' returns the estimates from the multiple imputation we've requested. Below (Table 2) we've selected the estimates from that output, and placed them side-by-side, in a combined table, with those from an analysis of complete cases only (available as 'ResultsTable:CompleteCases'; NB if the table below does not render properly, both these resources are available via the collapsible menus below it).

Since the data were set missing at random we do not expect the estimates to differ much, but note the increased standard errors for the complete case analysis where 30% of the records have at least one variable in the MOI missing and have been deleted (due to rounding this is more apparent in the tables accessible via the collapsible menus below).

Table 2: Estimates from multiple imputation and complete case analysis

Parameter

Variable name

Estimates from Multiple Imputation
(combined as per Rubin's rules)

Estimates from analysis of
Complete Cases only

\(\beta_0\)

()

 

()

 
\(\beta_1\)

()

 

()

 
\(\beta_2\)

()

 

()

 
\(\beta_3\)

()

 

()

 
\(\beta_4\)

()

 

()

 
\(\beta_5\)

()

 

()

 
\(\beta_6\)

()

 

()

 
\(\beta_7\)

()

 

()

 
\(\beta_8\)

()

 

()

 
\(\beta_9\)

()

 

()

 
\(\sigma^2\)  

()

 

()

 
\(\sigma^2_u\)  

()

 

()

 
Mean* DIC  

 

 
Mean* pD  

 

 

* 'Mean' refers to multiple imputation only

Separate tables for imputation and complete case analysis


You may like to try different variables, numbers of iterations etc. by re-running this ebook. See Stat-JR web site for further details about Stat-JR including instructions on how to use the templates using Stat-JR:TREE, and also the final page of this eBook.

We can also look at various diagnostics for either the imputation model or the MOI. We can obtain these by clicking on the appropriate 'svg' files in the drop down box, so let's look at one of the imputation traces 'Combined_sigma2.svg' which is the level 1 variance chains for the model of interest. Here there is one chain for each completed dataset and each of the chains is displayed in a different colour.

An example fitting a 2-level model

  • First you will be asked if either the MOI or imputation model has two levels; we're going to fit a 2-level MOI, so answer Yes.

    Note that normally you will wish to have the same number of levels or classifications for the MOI and the imputation model. In some situations, however, for example if level 2 effects are very small, you may want to fit a 1 level MOI for simplicity, while still carrying out imputation at 2 levels. If either the MOI or the imputation model is at 2 levels, you will be asked for the level 2 ID.
  • You will then be asked to enter the level 2 ID, which in this example is school; this will be used in the MOI and/or imputation model, as appropriate.

  • Next, you will be asked a series of questions about your MOI model: whether you want to model 1 or 2 levels (we're modelling 2 levels), the distribution you would like to use to model the response variable in your MOI (Normal, in our example), and finally whether you would like to fit random slopes / coefficients or not (we'll answer No; note, as indicated in the hover-over help available for this input, if you would like to fit a random slope/coefficicent(s) model, make sure that you include the response(s) in your MOI, AND the variables whose coefficients you wish to randomly-vary at level 2, as responses in the imputation model).

  • The 2LevelImpute template takes this input, and uses it to choose the best Stat-JR template to fit your MOI. If you make the choices suggested above, it will choose a template called 2LevelMod, which fits random intercept models for Normal, binomial and Poisson responses.
The inputs will appear here once the template has loaded...
  • You will then be asked for the response and explanatory variables for the MOI. We suggest a simple 2 level variance components model with:

    • normexam as the response;

    • cons, schgendmiss, standlrtmiss and girlmiss as explanatory variables (as you click on these you'll see that they appear in the box beneath; to deselect any chosen in error, simply click on the variable name in the lower box); remember to specify that schgendmiss is categorical (you can do so by using the checkboxes which appear below the input box when you select your explanatory variables).

If the inputs don't appear in the box, below, you need to submit all the inputs for the step above first...

The inputs will appear here once the template has loaded...
  • Once you have specified the MOI, you will then, for each level, be asked to enter the responses for the imputation model; we'll work through each of the two levels in the sections below.

    Note that all explanatory variables must have no missing values - if any variable does, then include it as a response.
  • Given our MOI, we suggest the variables normexam, standlrtmiss and girlmiss for level 1; you'll then need to specify their distributions, here as Normal, Normal, and Binary, respectively.

  • You will also be asked to enter the explanatory (predictor) variables for each of these (if you scroll down this page, you'll see all the boxes), as follows:

    • for normexam just use an intercept cons;

    • for standlrtmiss use cons and also vrband (making sure that you tick the box that asks whether vrband should be treated as categorical - this will then use appropriate dummy variables where the final category is taken as the reference);

    • for girlmiss use cons.

    Since we have selected values to be missing purely at random the use of a second predictor for standlrtmiss is not necessary, but you can include it to demonstrate that, as an auxiliary variable (not in the MOI) it can be used to help ensure missingness at random (MAR).
  • Finally, you'll be asked if there are any responses at level 2 for the imputation model; we'll answer Yes.

If the inputs don't appear in the box, below, you need to submit all the inputs for the steps above first...

The inputs will appear here once the template has loaded...
  • Next, you'll be asked to enter the responses for the imputation model at this level. We suggest using the variable schgendmiss which is coded 0 for mixed schools, 1 for boy's school and 2 for girl's school: i.e. when asked, indicate that it has an Unordered distribution, with 3 categories (note, as indicated in the hover-over help available for this input, your categories, as represented in your dataset, need to be numbered from zero, sequentially in steps of one (i.e. 0,1,2 if you had 3 categories); if they are not, an error message will be returned).

  • You will then be asked for explanatory variables for each category dummy - we suggest you use cons for each.

    Note: If we have a response at level 2, it is not meaningful, in general, to have explanatory variables at level 1. So you should not specify these for any level 2 responses in the imputation model. You may, of course, specify level 2 explanatory variables for level 1 responses in the imputation model: this may be particularly useful for auxiliary variables.

If the inputs don't appear in the box, below, you need to submit all the inputs for the steps above first...

The inputs will appear here once the template has loaded...

Multiple imputation theory strictly applies only where the set of (response plus explanatory) variables for which we wish to impute missing values have a joint multivariate normal distribution for the variables all treated as responses. Where we have categorical data this is clearly not the case and we therefore use a latent normal formulation (See: Goldstein, H., Carpenter, J., Kenward, M. and Levin, K. (2009). Multilevel models with multivariate mixed response types. Statistical Modelling, 9(3), 173-197). This works as follows:

  • For a binary response we utilise a probit model where we assume that the (0,1) response is derived from an underlying standard normal distribution with a mean value (determined by whatever explanatory variable predictors are in the imputation model for this response) that acts as a threshold - values above which are observed as a '1' and below which as a '0'. The MCMC algorithm incorporates a step that takes a random draw from the corresponding part of the standard normal distribution according to whether a '1' or '0' is observed, or imputes randomly if a value is missing.

  • For an ordered categorical response a similar procedure is used with additionally a set of thresholds defined on the standard normal scale that delineate the ordered categories.

  • For an unordered categorical variable with p categories a (p-1) dimensional multivariate normal distribution is generated.

  • For each of these the underlying normal distributions condition on the other responses (and explanatory variables) so that a joint multivariate normal is finally generated.
  • Next, you will be asked if you want to use the conditional algorithm or not; we suggest answering Yes, since the conditional algorithm is faster.

  • You will then be asked to specify the number of imputed datasets to use, the interval before the first imputation (this includes any burn-in period), the interval before any subsequent imputations (to ensure approximate independence) and, for the MOI, the burn in and number of iterations. In the interests of time to completion we suggest some small numbers here (e.g. 4, 1000, 50, 100, 250, respectively), but you will generally need considerably longer runs than this, and a greater interval between iterations!.

  • After clicking on the last Submit button (you'll also have to click Run if using the template via the Stat-JR:TREE interface), a counter at the top left will indicate that the software is running and tell you when results are available, plus on the next page you'll see a counter indicating when each imputed dataset has been returned...

    If your computer has more than one core processor, the imputation models and associated MOIs will in fact be run with parallel chains.

If the inputs don't appear in the box, below, you need to submit all the inputs for the steps above first...

The inputs will appear here once the template has loaded...

Inspecting the results

Soon after all imputed datasets have been returned (see counter below), the results will appear beneath...

Imputed dataset #1 returned...

Imputed dataset #2 returned...

Imputed dataset #3 returned...

Imputed dataset #4 returned...

Imputed dataset #5 returned...

Imputed dataset #6 returned...

Imputed dataset #7 returned...

Imputed dataset #8 returned...

Imputed dataset #9 returned...

Imputed dataset #10 returned...

Imputed dataset #11 returned...

Imputed dataset #12 returned...

Imputed dataset #13 returned...

Imputed dataset #14 returned...

Imputed dataset #15 returned...

Imputed dataset #16 returned...

Imputed dataset #17 returned...

Imputed dataset #18 returned...

Imputed dataset #19 returned...

Imputed dataset #20 returned...

That's all the imputed datasets returned!


Once Stat-JR has finished executing, a large range of different results are returned, including both tables and graphs; you can view all of these via the 'Resources' button at the top of this page, and they are also available via the drop-down list in the right-hand frame of the results page if running the template in the Stat-JR:TREE interface.

For example 'ResultsTable:Imputation' returns the estimates from the multiple imputation we've requested. Below (Table 2) we've selected the estimates from that output, and placed them side-by-side, in a combined table, with those from an analysis of complete cases only (available as 'ResultsTable:CompleteCases'; NB if the table below does not render properly, both these resources are available via the collapsible menus below it).

Since the data were set missing at random we do not expect the estimates to differ much, but note the standard errors tend to be increased for the complete case analysis where 30% of the records have at least one variable in the MOI missing and have been deleted.

Table 2: Estimates from multiple imputation and complete case analysis

Parameter

Variable name

Estimates from Multiple Imputation
(combined as per Rubin's rules)

Estimates from analysis of
Complete Cases only

\(\beta_0\)

()

 

()

 
\(\beta_1\)

()

 

()

 
\(\beta_2\)

()

 

()

 
\(\beta_3\)

()

 

()

 
\(\beta_4\)

()

 

()

 
\(\beta_5\)

()

 

()

 
\(\beta_6\)

()

 

()

 
\(\beta_7\)

()

 

()

 
\(\beta_8\)

()

 

()

 
\(\beta_9\)

()

 

()

 
\(\sigma^2\)  

()

 

()

 
\(\sigma^2_u\)  

()

 

()

 
Mean* DIC  

 

 
Mean* pD  

 

 

* 'Mean' refers to multiple imputation only

Separate tables for imputation and complete case analysis


You may like to try different variables, numbers of iterations etc. by re-running this ebook. See Stat-JR web site for further details about Stat-JR including instructions on how to use the templates using Stat-JR:TREE, and also the final page of this eBook.

We can also look at various diagnostics for either the imputation model or the MOI. We can obtain these by clicking on the appropriate 'svg' files in the drop down box, so let's look at one of the imputation traces 'Combined_sigma2.svg' which is the level 1 variance chains for the model of interest. As you will see the mixing seems fine. There is one chain for each completed dataset and each of the chains is displayed in a different colour.

Multiple Imputation via Stat-JR:TREE

If you wish to try your own dataset, then you can easily do so via Stat-JR's Template Reading and Execution Environment (TREE) interface (formerly 'webtest'; you can run TREE alongside the DEEP interface you're reading now).

Once you've opened TREE (for further details see A Beginner's Guide to Stat-JR's TREE interface), and pressed Begin you'll see the black bar at the top of the TREE interface includes Dataset and Template options. If you select Choose from each of these, you will be able to change the dataset and template, respectively, away from their defaults.

Below we have a screenshot from TREE, in which we have selected the template 2LevelImpute and the dataset tutmiss (as used in this eBook), and have started to specify our inputs.

Uploading your own dataset

In the screenshot below, we have specified all the inputs, and have pressed the Next button a final time, after which the Run button appears, along with some other outputs in the pane at the bottom of the browser window (not shown here).



If you then press the Run button, the template will then run (the progress gauge in the black bar will change from Ready to Working to indicate it is busy). Once it has finished, and all the results have been returned, you can view the outputted files in the pane towards the bottom of the browser window (see below for details of the various outputs returned). Use the drop-down list above the pane to choose particular outputs to view: for example, to view the imputed datasets, choose "Imputation_Model_impute_datafile_chain0_iter1000", "Imputation_Model_impute_datafile_chain1_iter1000", etc. Note that the nomenclature here will depend on both the inputs you have chosen, and the number of processors on your machine. For example, if I chose Number of imputed data sets: 5, Number of iterations before first imputation: 1000 and Number of iterations between subsequent imputations: 500, on my machine with four processors, then I would see "...chain0_iter1000", "...chain1_iter1000", "...chain2_iter1000", "...chain3_iter1000", "...chain0_iter1500", "...chain1_iter1500", "...chain2_iter1500", and "...chain3_iter1500". Here it has ran a chain / thread on each of the four available processors, and derived a dataset from each after the 1000th iteration (as stated earlier, this interval includes the burnin). However, since this number is less than the requested 5 imputed datasets I asked for, it has also constructed four more datasets following a further 500 iterations (since I specified Number of iterations between subsequent imputations: 500). However, it will only use the first of these ("...chain0_iter1500") to make up the five datasets it needs.

The imputed datasets are available in the list of datasets accessible via Dataset > Choose in the black bar at the top. You can download these (as Stata format *.dta files) by first selecting the relevant dataset from the list, pressing the neighbouring Use button, and then downloading via Dataset > Download (again via the black bar at the top; note that both Imputation_Model_impute_datafile_chain0_iter1000, etc. and impute_datafile_chain0_iter1000, etc. will appear in this list, but there's no need to download both (they're the same, but simply saved twice, with and without the Imputation_Model prefix; also, don't confuse these with the level 2 datasets (e.g. impute__L2Data_chain0_iter1000 or Imputation_Model_impute__L2Data_chain0_iter1500).

Alternatively, one can press the green Download button to download all these outputted files (now supported for the 2LevelImpute template from Stat-JR version 1.0.2). Note that the .dta extension will have to be manually-added.

What is returned in the results pane?