REALCOM: Developing multilevel models for REAListically COMplex social science data

Realcom logoThis software specialises in three areas: models with responses at several levels of a data hierarchy, multilevel structural equation models, and measurement error modelling. The models developed under the project were estimated using Markov Chain Monte Carlo (MCMC) estimation.

REALCOM downloads

We no longer support the original mixed-responses module of REALCOM as the functionality is included in REALCOM-Impute. Because of this the original REALCOM installer has been split into realcom-factor and realcom-measerr. Please note that these installers no longer contain the training manual, so you will need to download this from the link above.

Note: During installation you may get a message on your screen: ".Net Framework is not installed - do you want to stop this installation and install .Net first?". Answer: You do not need .Net to run the application. Further instructions are available in the training manual on page 2 (page 5 of the PDF).

Other materials

  • In the Realcom module "Structural equation modelling" you can also save the level 1 or level 2 factor score estimates at any specified iterations. Click on the Store Factor Scores button to specify these. They will be called "factorsiterlevel1n" or factorsiterleveltwon". Where n is the iteration number. After the model has been fitted you can also save the mean factor scores, together with their standard deviations by clicking on Save factor means. This will be a (N x 2) (level 1) or (m x 2) (level 2) matrix. You will be asked to specify the file names.
  • Training workshop presentation (PowerPoint, 4.3 mb)

Bug fixes

  • Sept 2014 - A new version of the measurement error software is now available. This fixes a previous problem whereby the estimates were biased unless the true predictor variables with measurement errors were uncorrelated with the remaining predictors. For most data sets this is unlikely to be the case and we recommend that you rerun any analyses with the new version. Note also that this is also the case for the measurement error option in MLwiN and we are seeking to correct that.
  • Version 16 Jan 2012 fixes a bug where a model contains one or more level 2 responses together with one or more level 1 responses with random effects at level 2. Results using the new version will generally differ only by a small amount from those using previous versions, unless the number of level 2 units is only just a little larger than the number of level 1 units.
  • For some models with missing data in categorical variables incorrect values may may have got imputed. Generally this will have given clearly incorrect results. This bug is now fixed (25-Mar-09)
  • A bug has now been fixed in the Realcom Factor Module. This affected models that were ordered categorical responses with more than 3 categories.
  • March 2009 version fixed certain bugs for level 2 responses that could have caused crashes.

Previous bugs (earlier versions of the software)

To resolve the following bugs please ensure you have the latest version of REALCOM.

  • When using MLwiN in conjunction with REALCOM-impute only the first imputed data set is used when running the imputation analysis. The workaround is to manually specify the data sets to use in the ISTA command. For example if you have 10 data sets to analyse then instead of selecting:
    Model > Imputation > Start Analysis
    go to the command window and run the command: ISTA 1 2 3 4 5 6 7 8 9 10
  • There was a bug in the mixed response modelling macros that would have affected some models with ordered categorical responses at level 2 (28-Nov-07). This has been corrected.
  • The class size data set originally supplied for the missing data example was incorrect. The new one is now part of the installation. The REALCOM training manual (PDF, 791kB) is now supplied with corrections to the data set description. Revised dataset (TXT, 27kB).

(Back to top)

The research project

The project developed new methodology and associated training materials in the following areas of multilevel modelling: structural equation models, measurement errors and multivariate mixed response types at more than one level of the data hierarchy. The models developed under the project were estimated using Markov Chain Monte Carlo (MCMC) estimation.

The methodology builds upon that already implemented in MLwiN which is described in the MLwiN manuals. The training materials are written in MATLAB. and are available as free-standing programs. They are designed to interface with MLwiN in terms of data transfer but have their own graphical user interfaces for setting up models and displaying results. There is a set of training materials (PDF, 791kB). which provides an introduction to the methodology and a guide to using the software.

Applications are to a variety of problems, including flexible prediction models, multiple imputation for missing data in multilevel models, and misclassification errors in social status data.

Three repeated 1-day workshops were held in Bristol, London and Birmingham, June/July 2007.

The ESRC has rated this project as outstanding. The outstanding grade indicates that a project has fully met its objectives and has provided an exceptional research contribution well above average or very high in relation to the level of award. Go to ESRC award details.

The methodology builds upon that already implemented in MLwiN version 2.02 which is described in the MLwiN manuals. The training materials are written in MATLAB and are available as free-standing programs. They are designed to interface with MLwiN in terms of data transfer but have their own graphical user interfaces for setting up models and displaying results.

Measurement errors

In many of the variables used in the social and medical sciences measurement errors are found. These can arise from unreliable measuring instruments, problems with variable definitions or simply reflect temporal fluctuations, for example within individual units. The errors we are concerned with are essentially considered as random and distinct from systematic errors which can lead to biases.

There is a large statistical literature on the modelling of such errors, mostly dealing with the case of continuously distributed variables in single level linear and non-linear models. Fuller (2006) provides a comprehensive treatment. In this work we develop existing work based upon MCMC estimation for multilevel models (Browne et al., 2001) and incorporated in the MLwiN software (Browne, 2004). We deal with the 2-level case in detail with extensions to three levels being relatively straightforward. Extensions to handle cross classified and multiple membership models (Goldstein, 2003, Chapters 11 &12) also involve just the addition of appropriate sampling steps within the MCMC algorithm. The consequences of ignoring measurement errors are well known and typically lead to underestimation of coefficients and biased standard errors. In multilevel models we will also obtain biased estimates of covariance matrices.

The innovations introduced are to handle correlated measurement errors and also misclassification errors in binary predictor variables. The main example is taken from a study of class size which involves both continuous predictors with correlated measurement errors and a binary predictor with misclassification error.

(Back to top)

Latent variable (factor) models

The MATLAB routines that have been developed extend existing models for multilevel factor analysis in t he following ways. First, they allow certain constraints across parameters that are important for interpretation. Secondly, they allow different ways of specifying level 2 latent variables and thirdly they use MCMC estimation rather than maximum likelihood (ML). One problem with ML estimation is that it becomes very slow when the number of parameters in the model becomes large, typically increasing factorially with the number of parameters; MCMC estimation, however, avoids this kind of dependence on the number of parameters.

The workshops presented two examples, one from demography and one from education, that illustrate, for two levels, how to set up and analyse such models.

(Back to top)

Responses at more than one level

Multivariate models, including those which incorporate a multilevel structure are traditionally confined to responses at the lowest level of the data hierarchy and usually also deal with Normally distributed responses. One exception to the latter, and implemented in MLwiN is where the responses are all binary or sometimes Normal as well. Browne (2004) discuss such models and gives examples. There are also some examples of the use of Normal responses jointly at levels 1 and 2; Steele et al (2007) model pupil and school level Normal responses in a multiprocess model for evaluating the impact of school resources on student achievement, and Goldstein (1989) fits a model with repeated measures on individuals during growth (level 1) jointly with their adult height (level 2) as the basis for an efficient prediction model. The MATLAB routines allow any of the responses additionally to be ordered or unordered categorical variables. This is particularly useful when we wish to carry out a multiple imputation for missing data, where missingness may occur with continuous or discrete data. Examples are given using growth data and class size data.

Papers

(Back to top)

  • Modelling measurement errors and category misclassifications in multilevel models (PDF, 144kB). Harvey Goldstein, Daphne Kounali and Anthony Robinson: Statistical Modelling 2008; 8 (3): 243-261. Models are developed to adjust for measurement errors in normally distributed predictor and response variables and categorical predictors with misclassification errors. The models allow for a hierarchical data structure and for correlations among the errors and misclassifications. Markov Chain Monte Carlo (MCMC) estimation is used. The models with examples are also described in the REALCOM training manual and users can fit these in the REALCOM software.
  • Multilevel structural equation models for the analysis of comparative data on educational performance (PDF, 220kB). Harvey Goldstein, Gérard Bonnet, Thierry Rocher Ministère de l’Education Nationale, de l’Enseignement Supérieur et de la Recherche, Direction de l’Évaluation et de la Prospective, Paris. The Programme for International Student Assessment comparative study of reading performance among 15-year-olds is reanalyzed using statistical procedures that allow the full complexity of the data structures to be explored. The article extends existing multilevel factor analysis and structural equation models and shows how this can extract richer information from the data and provide better fits to the data. It shows how these models can be used fully to explore the dimensionality of the data and to provide efficient, single-stage models that avoid the need for multiple imputation procedures. Markov Chain Monte Carlo methodology for parameter estimation is described.
  • Multilevel Models with multivariate mixed response types (Sage publication) Note: other useful resources for treating missing data can be found at the London School of Hygeine and Tropical Medicine's Missing data website Harvey Goldstein, James Carpenter, Michael G Kenward, Kate A Levin . We build upon the existing literature to formulate a class of models for multivariate mixtures of Gaussian, ordered or unordered categorical responses and continuous distributions that are not Gaussian, each of which can be defined at any level of a multilevel data hierarchy. We describe a MCMC algorithm for fitting such models. We show how this unifies a number of disparate problems, including partially observed data and missing data in generalised linear modelling. The 2-level model is considered in detail with worked examples of applications to a prediction problem and to multiple imputation for missing data. We conclude with a discussion outlining possible extensions and connections in the literature. Software for estimating the models is freely available.

(Back to top)

The REALCOM team

Harvey Goldstein, (project director), Jon Rasbash, Fiona Steele (co-directors), Christopher Charlton (research officer), Hilary Browne (web developer), Sophie Pollard (project assistant)

This three-year ESRC-funded research project developed multilevel modelling techniques, software and training materials in three areas: models with responses at several levels of a data hierarchy, multilevel structural equation models, and measurement error modelling. The models developed under the project were estimated using Markov Chain Monte Carlo (MCMC) estimation.

Missing data

Missing data are a persistent problem in social and other datasets. A standard technique for handling missing values efficiently is known as multiple imputation and the software REALCOM-IMPUTE is unique in that it has been designed to implement this procedure for 2-level data. Apart from being able to deal with 2-level data it can also handle properly categorical data, whether in the response or predictor variables in a model. An interface is provided with MLwiN that allows users to carry out the full procedure and fit their final model semi-automatically.

References

Blatchford, P., Goldstein, H., Martin, C. and Browne, W. (2002). A study of class size effects in English school reception year classes. British Educational Research Journal 28: 169-185.

Browne, W. J. (2004). MCMC estimation in MLwiN. Version 2.0. London, Institute of Education.

Browne, W., Goldstein, H., Woodhouse, G. and yang, M. (2001). An MCMC algorithm for adjusting for errors in variables in random slopes multilevel models. Multilevel modelling newsletter 13(1): 4-9. Fuller, W. A. (2006). Measurement Error Models. New York, Wiley: Goldstein, H. (1989). Models for Multilevel Response variables with an application to Growth Curves. Multilevel Analysis of Educational Data. R. D. Bock. New York, Academic Press: 107-125.

Goldstein, H. (2003). Multilevel Statistical Models. Third edition. London, Edward Arnold:

Goldstein, H. and Browne, W. (2005). Multilevel factor analysis models for continuous and discrete data. Contemporary Psychometrics. A Festschrift to Roderick P. McDonald. A. Olivares and J. J. McArdle. Mahwah, NJ:, Lawrence Erlbaum.

Lawley, D. N. and Maxwell, A. E. (1971). Factor analysis as a statistical method. London, Butterworth:

Mathworks (2004). Matlab

McDonald, R. P. and Goldstein, H. (1989). Balanced versus unbalanced designs for linear structural relations in two-level data. British Journal of mathematical and statistical psychology 42: 215-232.

Rabe-Hesketh, S., Pickles, A. and Skrondal, A. (2001). GLLAMM: a general class of multilevel models and a STATA program. Multilevel modelling newsletter 13(1): 17-23.

Steele, F., Vignoles, A. and Jenkins, A. (2007). The Impact of School Resources on Pupil Attainment: A Multilevel Simultaneous Equation Modelling Approach. Journal of the Royal Statistical Society, A. 170.

(Back to top)

Edit this page