Sample sizes for multilevel models
- Sample sizes for multilevel models
- Sample size, significance testing and suitability of data
- How can I calculate confidence intervals in MLwiN?
- What significance tests for both random and fixed parameters are available in MLwiN?
- How do I test whether two coefficients are equal?
- How do I test whether more than 2 coefficients are equal?
- How do I test if an interaction between a continuous and a categorical variable is significant?
Fitting a 2-level multilevel model means that we are assuming that
- we have observations which contain values that vary randomly according to a distribution e.g. normal, Poisson etc. and
- these observations are contained in identifiable clusters (represented by a categorical variable) which themselves are in some sense randomly drawn from some population e.g. a sample of schools from a population of schools.
It is clearly wrong to fit some categorical variables e.g. gender as a level in a model, more because it doesn't make much sense to think of male and female as a sample of genders from a population of genders than because there are only 2 of them! This said (and to complicate things) people do fit random effects when they have the full population of units e.g. all schools in a district. In such cases we consider an underlying superpopulation.
Different software packages and estimation algorithms have more or less success in fitting a multilevel model depending on the size of the data including the numbers of level 2 units and level 1 units per level 2 unit. Problems will occur if
- the data or model is too large - memory problems, slow execution etc.
- the data is too small - not enough data to properly estimate the model.
This is not just a problem for multilevel models. If you only collected 10 observations for a regression you might not be overly confident of the estimated regression line and the same is true if in a multilevel model if you only collect data on 10 schools - you will not get a very accurate estimate of their variability.
This doesn't mean you cannot attempt to fit a multilevel model with 10 schools however you may find that the software estimates the variance as zero and if the variance is estimated when you look at the 10 residuals it will be hard to justify that they fit a normal model (as opposed to another distribution).
Rules of thumb such as only doing multilevel modelling with 15 or 30 or 50 level 2 units can be found and are often personal opinions based on personal experience and varying reasons e.g. getting a non zero variance, being able to check the normality assumption etc. For example, in Gelman's new book recently he will attempt multilevel modelling with as little as 3 level 2 units! (note he is generally using MCMC so will not get a 0 variance estimate). It is clear that, if you use IGLS in MLwiN and want the estimate not to be 0 then the more units the better, and how many will depend on how variable the cluster means are relative to the data.
There are also sample size considerations which you might like to consider (prior to collecting data) which will give you desirable numbers of level 1 and level 2 units. This is generally to achieve power for testing fixed effects, see Sample Size calculations in multilevel modelling (PowerPoint, 0.5 mb) for some slides on this. Of course you could not worry about clustering by simply taking your sample from 1 cluster but this will mean your population and hence your conclusions also only correspond to the 1 cluster!
We are pleased to make available a new free piece of software MLPowSim that is designed for performing sample size/power calculations in multilevel models via simulation.
The software package has been developed as part of a UK ESRC funded-project and is an 'old-fashioned' text input program that creates files that can be used in conjunction with MLwiN or R to perform the necessary computations to perform complex power calculations.
The program is available along with an extensive 150 page manual from my web page
We will describe the software currently as a Beta version as we have only had time to do preliminary testing and we haven't included much error trapping. The software is FREE and as such comes with no guarantees in terms of producing correct answers (although we hope it does!) and no guarantee of fast response to fixing of any bugs reported (although we hope there aren't many). We will however be genuinely pleased if people use it and let us know of any bugs they find or if they have a 'wish list' of additional features they might like.
Good luck with the software,
You can calculate the confidence intervals by multiplying the standard errors of the parameters by the appropriate amount using the Calculate window (available from the Data Manipulation menu) or the CALC command, or in a different package of your choice, or using a pocket calculator. The standard errors can be found in the Equations window displayed in brackets after the parameter estimate. If you have many parameters, you may find it quicker to use a macro. There is a macro available on our website which will calculate the z-ratios and confidence intervals. If you want to write your own, then information on where the standard errors are stored in the worksheet can be found here
If you are using MCMC, then instead of confidence intervals based on standard errors and the assumption that the sampling distribution of each parameter is Normal, you may want confidence intervals derived from quantiles of the chains of parameter estimates. These can be obtained by selecting Trajectories from the Model menu, and then in the window that appears clicking on the graph for one of the parameters (and selecting Yes in answer to the question that appears). This brings up the MCMC diagnostics window which contains some information based on the chain of estimates for that parameter. The information you want is at the bottom, in the second row of the Summary Statistics section. This gives the 2.5%, 5%, 50%, 95% and 97.5% quantiles. If you want different quantiles or if you want to calculate the quantiles for many parameters at once without having to click on each graph individually to bring up the relevant window, then again you might want to write a macro. The MCMC manual ('MCMC Estimation in MLwiN') gives details of where to find the stored chains of parameter estimates and how to split the one long variable containing this information into a separate variable for each parameter (p58-59).
For Normal response models, Wald tests and, likelihood ratio tests as well as t-tests can be conducted for testing the significance of fixed parameters. They all produce similar conclusions when testing on a single parameter. The Wald test can be carried out through the Intervals and Tests window available from the Model drop down menu (for some examples of tests using this window see How do I test whether two coefficients are equal? and How do I test if an interaction between a continuous and a categorical variable is significant?). The likelihood ratio test can be calculated by taking the difference in the -2log-likelihood values of any two nested models, available from column C1091. The degrees of freedom for the test is the difference in the number of parameters between the two models. The t-test statistic is given by the ratio of the estimate over its standard error. For the random parameter estimates, we recommend using the Wald test or the likelihood ratio test. For non-linear models, no likelihood values are available as these models are fitted using quasi likelihood, hence the likelihood ratio test is not available. One can use the Wald test for an approximate answer. In some extreme cases, MCMC and bootstrapping estimation procedures available in MLwiN should be considered. Both methods provide parameter chains for the empirical distribution or confidence intervals for each parameter estimate.
Example question: I have a categorical variable with 6 categories which I have entered into my model as an explanatory variable, and I want to test whether the coefficients for two of the dummy variables are significantly different from each other. How do I do this?
After running the model, select Intervals and tests from the Model menu. In the window that appears, select fixed (at the bottom of the window) and in the box # of functions type 1 (if it does not already say 1). In the top part of the window type a 1 in the box next to one of the dummy variables and a -1 in the box next to the other (it does not matter which way round you do this). In the box next to constant(k) type 0 (if it does not already say 0). Now press the Calc button at the bottom of the window. The box chi sq, (f-k)=0. (1df) now contains the test statistic to be compared against the Χ2 distribution with 1 degree of freedom, and we can make the comparison using the Tail Areas window from the Basic Statistics menu. The null hypothesis is that there is no difference between the coefficients. (Important note: it may be necessary to resize the column in order to make sure you can see the entire number in the chi sq, (f-k)=0. (1df) box. To do this put the cursor over the column divider right at the top (in the blue row) and it should become a double arrow; you can then click and drag to resize).
tutorial dataset supplied with MLwiN and that we have set up in the Equations window a random intercepts model with explanatory variables standlrt and vrband (with vb1 as the reference category). Suppose we want to test whether the coefficients of vb2 and vb3 are equal. We run the model, and then select Intervals and tests from the Model menu. In the window that appears, we select fixed (at the bottom of the window) and in the box # of functions we type 1. In the top part of the window we type a 1 in the box next to fixed : vb2 and a -1 in the box next to the fixed : vb3. In the box next to constant(k) we type 0 (which is the default so may already be entered). Then we press the Calc button at the bottom of the window. We see from the box chi sq, (f-k)=0. (1df) that our test statistic is 73.879 (we check it really is 73.879 and not 273.879 or 5673.879 by resizing the column). We select Tail Areas window from the Basic Statistics menu. In the window that appears, we leave Chi Squared selected, in the Value box we type 73.879, and in the Degrees of freedom box we type 1. We press Calculate and the Output window appears giving us a probability of 8.3055x10-18 (= 0.0000000000000000083055). We can thus reject the null hypothesis that there is no difference between the coefficients of vb2 and vb3 and conclude that the coefficients are significantly different from each other.
An important point to note is that we put a 1 next to one of the coefficients we wish to test and a -1 next to the other. This is because the Intervals and tests window adds together everything in the column and tests whether the result is significantly different from 0. So by filling in the column in this way we are testing whether Β2-Β3=0, which is equivalent to testing whether Β2=Β3. We are testing whether there is a difference between the two coefficients as was our aim. If we instead put a 1 next to each of the coefficients we would be testing whether Β2+Β3=0, which is equivalent to testing whether Β2=-Β3. We would be testing whether the coefficients had the same magnitude but opposite signs, which is not what we wanted to test.
Note that an alternative way to test whether two categories have equal coefficients is to refit the model so that one of the categories is the reference. In that case the coefficient of the other category will be a contrast between that category and the omitted category, and the usual Z-test can be carried out.
Example question: I have a categorical variable with 6 categories which I have entered into my model as an explanatory variable, and I want to test whether there is a significant difference between any of the coefficients for the dummy variables, or whether all the coefficients are equal. How do I do this?
To test whether more than 2 coefficients are equal, we use the Intervals and tests window as in the FAQ How do I test whether two coefficients are equal? We proceed as described in the answer to that question, but in the box # of functions we type the number of pairs of coefficients that we need to compare. For example, if we wanted to test whether 3 coefficients Β1, Β2 and Β3 are equal, we would type 2 because we need enough pairs for each coefficient to enter into at least one comparison (so for example we might compare Β1 with Β2 and Β2 with Β3). If we wanted to test whether 5 coefficients were equal (as in the case of testing whether the coefficients for all dummy variables for a 6-category variable are equal) we would enter 4. In general, we enter n -1 where n is the number of coefficients we want to test.
We fill out the columns that appear in the top part of the window according to the same principle as when testing whether two coefficients are equal: each column corresponds to one comparison between two coefficients, and in that column we enter a 1 in the row corresponding to one of the coefficients and a -1 in the row corresponding to the other. There are many possible ways of choosing the pairs to compare, but one way that will always work is to take the first coefficient and compare it with each of the others. Thus when comparing 3 coefficients we can compare Β1 with Β2 in the first column and Β1 with Β3 in the second column; and when comparing 5 coefficients we can compare Β1 with Β2 in the first column, Β1 with Β3 in the second column, Β1 with Β4 in the third column, and Β1 with Β5 in the fourth column. So when comparing 3 coefficients, supposing that the model consists of cons as the first explanatory variable and the 4-category explanatory variable as the only other explanatory variable, one correct way to fill in the columns (down to the constant(k) row) would be:
|Column 1||Column 2|
and when comparing 5 coefficients (again with cons and the 6-category explanatory variable as the only explanatory variables in the model), one correct way of filling in the columns would be:
If taking this approach, the 1s should all be in the same row and the -1s should form a diagonal line going downwards from left to right.
As for testing whether 2 coefficients are equal, we put 0 in each column in the constant(k) row (and in any rows corresponding to other explanatory variables) and press Calc.
In the chi sq, (f-k)=0. (1df) row of each column appears a test statistic which allows us to make the comparison between the particular pair of coefficients set up in that column, but the statistic we need to make our test that all the coefficients are equal is to be found near the bottom of the window where it says joint chi sq test(?df) (where the question mark will be the number of comparisons we made in the top part of the window). We can get a p-value for this test statistic using the Tail Areas window as described in the answer to How do I test whether two coefficients are equal? (remembering to input the appropriate value for Degrees of freedom, which will not now be 1). This will be the p-value for the test with null hypothesis that all the coefficients are equal; thus if the p-value is small enough we can reject this hypothesis and conclude that (at least some of) the coefficients are significantly different from each other (although there may be no significant difference between some pairs and we would probably want to investigate further to determine for which pairs there is a significant difference and for which we cannot reject the hypothesis that there is no difference).
What exactly is the test statistic that appears at the bottom of the window next to joint chi sq test (?df) and why does it test that all the coefficients are equal? This test statistic is the test statistic for making all the comparisons specified in the columns simultaneously: for example, in the test described above to compare 3 coefficients, it is the statistic for the test that Β1=Β2 AND ALSO Β1=Β3. In other words, when we reject the null hypothesis (that both these equalities hold), this implies that there is a significant difference between Β1 and Β2 and/or a significant difference between Β1 and Β3 and/or a significant difference between Β2 and Β3 (at whatever level of significance we are using). Note that last point: although none of our columns made a comparison between Β2 and Β3 , if Β2≠Β3 then our null hypothesis is not true. This is because Β1=Β2 and Β1=Β3 means Β2=Β3 . If there is a significant difference between Β2 and Β3 then it cannot simultaneously be true that Β1=Β2 and Β1=Β3 (even though the test statistics in the chi sq, (f-k)=0. (1df) row may indicate that when we make the comparisons separately, we find there is no significant difference between Β1 and Β2 and that there is no significant difference between Β1 and Β3). Therefore another way to describe the joint test is to say that it is a test of whether Β1=Β2=Β3 . So we can see that indeed the joint test is the test that we wanted, that tests whether all the coefficients are equal.
After running the model, select Intervals and tests from the Model menu. In the window that appears, select fixed (at the bottom of the window) and in the box # of functions enter the number of terms in the interaction (which will usually be the number of dummy variables that the categorical variable is entered as). In the top part of the window you will now see a number of columns equal to the number you have just typed in the box. In the first column, enter a 1 in the row corresponding to the first interaction term, in the second column enter a 1 in the row corresponding to the second interaction term, and so on for all the columns. Check that all the other rows in each column from the top to the constant(k) row read 0.000 (if not change the number to 0). Select the Calc button at the bottom of the window. The numbers in the chi sq, (f-k)=0. (1df) row now give the test statistic for each separate test (that the coefficient of the interaction term is 0). So for example the number in this row in the first column is the test statistic for the test that the coefficient of the first interaction term is 0. If we compare any of these statistics to the Χ2 distribution with 1 degree of freedom we should get very similar results to if we were to divide the coefficient of the corresponding interaction term by its standard error and compare against the standard normal distribution. We use these statistics to test separately whether the coefficient of each interaction term is significantly different from 0. (Important note: it may be necessary to resize the columns in order to make sure you can see the entire numbers in the chi sq, (f-k)=0. (1df) row. To resize a column put the cursor over the column divider right at the top (in the blue row) and it should become a double arrow; you can then click and drag to resize). At the bottom of the window there appears a number next to joint chi sq test(?df) = (where the ? is the number of columns at the top of the window). This is the test statistic for the test of the null hypothesis that all of the coefficients are equal to 0. If the null hypothesis can be rejected then at least one of the coefficients of the interaction terms is significantly different from zero. In this case, it is common to leave all the interaction terms in the model, just as it is common to leave in all dummy variables for a categorical variable when the coefficients of only some of them are found to be significantly different from zero. In some cases, it may be possible to group together categories with interaction coefficients not significantly different from zero (retaining the full set of categories as main effects if these are significantly different from one another). However, as usual, care should be taken to ensure that it is substantively sensible to combine categories.
For example suppose that we are working with the tutorial dataset supplied with MLwiN and that we have set up in the Equations window a random intercepts model with explanatory variables standlrt, vrband (with vb1 as the reference category), and an interaction between standlrt and vrband (again with vb1 as the reference category). We now run the model and select Intervals and tests from the Model menu. In the window that appears, we select fixed (at the bottom of the window) and in the box # of functions we type 2 because there are 2 terms in the interaction, standlrt.vb2 and standlrt.vb3. In the top part of the window there are now two columns. In the first column, we enter a 1 in the fifth row since this row corresponds to standlrt.vb2, and in the second column we enter a 1 in the sixth row since this row corresponds to standlrt.vb3. We then select the Calc button at the bottom of the window. Looking at the bottom of the Intervals and tests window we see joint chi sq test(2df) = 5.141. Using the Tail Areas window, we find that the p-value for this test is 0.076497. Therefore we cannot reject at the 5% level the null hypothesis that the coefficients of both the interaction terms are zero, and we might wish to remove the interaction terms from our model. However, in the chi sq, (f-k)=0. (1df) row we now see 1.745 in the first column and 5.086 in the second column (we resize the columns to check these numbers really are 1.745 and 5.086, and not for example 11.745 and 35.086). Using the Tail Areas window from the Basic Statistics menu, we discover that the p-value for the first column is 0.18651 and the p-value for the second column is 0.024120. Thus looking at the interaction terms separately, we could reject at the 5% level the null hypothesis that the coefficient of standlrt.vb3 is 0, but we could not reject at the 5% level the null hypothesis that the coefficient of standlrt.vb2 is 0. From a statistical point of view, this might lead us to drop the standlrt.vb2 term from the model while retaining standlrt.vb3. This would imply that the effect of standlrt is the same in categories 1 and 2 of vb, but different in category 3 of vb. If the main effects of vb2 and vb3 are both significantly different from zero, we would retain these in the model. So the model would include main effects of standlrt, vb2 and vb3 and standlrt.vb3. Whether we decide to use this model, to drop the interaction terms altogether, or to retain both of them will depend on how much sense it makes substantively to fit a model which specifies the same effect of standlrt in categories 1 and 2 of vb.