Estimating the Context Effect in a Multilevel Latent Model with Small Sample Sizes: A Monte Carlo Simulation Study

In multilevel modeling, the relationships between the criterion and predictors are investigated at different levels. Often, the cluster-level predictors are measured by aggregating the individual-level measures. However, the aggregated clusterlevel predictors do not always reliably measure the cluster-level regression coefficient, and therefore the context coefficient. This study investigates an alternative approach: estimating cluster-level predictor on the latent cluster mean by using multilevel latent. A comparison is made of the accuracy of the context coefficient and standard error under a wide range of conditions. Results reveal that bias for context effect is small in multilevel latent model. Maximum likelihood (ML) estimator yields more accurate standard error estimation than robust maximum likelihood (MLR) when cluster number is small (less than 50). Very small cluster sample sizes (less than 10) should be avoided because they lack power and empirical sampling variance.


Introduction
Data collected in educational research are often multilevel, for example with students clustered within schools or repeated measures clustered within individuals. When data are multilevel and a predictor variable varies both within clusters and between clusters (such as individual social economic status, SES) scores varying within schools and average SES scores varying between schools), researchers are frequently interested in estimating within-cluster and between-cluster relationships of the predictor to the criterion. Often people are interested in estimating the context coefficient [1,2,3], that is, the difference in the regression coefficients for the between-and within-cluster relationship. Contextual analysis evaluates whether the aggregated group characteristic (L2) has an effect on the outcome variable after controlling for individual level characters (L1).
In many cases, L2 variables are based on the aggregation of L1 variables. One problematic aspect of the context effect analysis is that the observed group average obtained by aggregating individual observations may not be a very reliable measure of the unobserved group average if only a small number of L1 individuals is sampled from each L2 group [1,4]. A few researchers explored the integration of structural equation modeling (SEM) and multilevel modeling (MLM) to the issue of contextual analysis with the consideration of measurement error and sampling error [1,4,5]. The software Mplus is recommended as being particularly versatile for all forms of latent variable modeling, including the integration of SEM and MLM [1].
In the multilevel latent mean approach to estimating context effects the following equation is estimated [4,6]. In this equation, jX µ is the expected value of the predictor scores within the j th cluster, j δ is a betweencluster residual, ij ε is a within-cluster residual, 0 γ is the intercept, W γ is the within-cluster regression coefficient, and B γ is the between-cluster regression coefficient.
A multilevel latent mean approach corrects the bias in parameter estimates of contextual effects due to sampling error associated with aggregating L1 variables to L2 A Monte Carlo Simulation Study constructs [1]. The cluster-level averages ( j X ) are estimates of the cluster-level expected values ( ) In the traditional multilevel approach to estimating context effects, j X is used in place of jX µ : In this approach, the cluster-level averages are assumed to be measured without sampling error [3,7]. The unreliability of the sample cluster average will lead to biased estimation of the between-cluster regression coefficient which in turn leads to bias in the context coefficient.
In the present study, the multilevel latent model is the focus and an alternative set of conditions was investigated. Specifically, the study investigated level-2 sample sizes (smaller than were investigated by Ludtke et al. [4]) and a range of intraclass correlation coefficients (ICCs) for the outcome variable. Ludtke et al. did not investigate the latter factor and the range of ICCs no doubt varies across multilevel studies. The smaller level-2 sample size was investigated because it is not unusual to find multilevel studies with the number of groups smaller than 50 [8].
In multilevel analysis, the problem with sample size is usually at the group level [9,10,11]. Previous research shows that a small sample size at level two leads to biased estimates of the second-level standard errors [10]; increasing the higher level sample size will improve more on power than increasing the within-cluster sample size [11]. However, in practical research, increasing the number of groups may be difficult because of the cost of bringing in new organizations and the inconvenience of finding new organizations [10,12]. Thus seeking an alternative strategy to obtaining the accurate estimates of parameters and standard errors seems essential in the multilevel latent model.
A few researches have investigated the various estimation methods in the multilevel framework. Hox et al. [9] compared full information maximum likelihood (ML), robust maximum likelihood (MLR), and diagonally weighted least squares (DWLS) estimations in multilevel structural equation modeling. They found a clear interaction effect between number of clusters and estimation methods. They also found ML yield the most unbiased estimation than DWLS and MLR when the sample size is small. Maas and Hox [13] showed that restricted ML estimation had better coverage rates for the main fixed effects than robust estimation. Although the studies are not completely in agreement, they all conclude that the coefficients estimated are unbiased and the standard errors tend to be underestimated when the sample size is small [1,9,13]. In this study, MLR and ML are chosen to compare for context effect estimate. MLR is the default estimator in the multilevel model in Mplus because it offers some protection against the heterogeneity. The robust standard errors are developed to use the observed residual variance to correct the asymptotic standard errors. The likelihood function of the multilevel full ML approach in the context of SEM is defined as follows [9]: where the subscript i refers to the observed cases, i x refers to the variable observed for case i, and i µ and i ∑ contain the population means and covariances of the variables observed for case i. Multilevel data applies in the way that clusters are as observations and individuals as variables.
In this simulation study, the accuracy of context effect was examined under various conditions by using two estimators. The study varies the conditions at different levels and those conditions are within cluster sample size, number of clusters, ICC for predictor variable, ICC for criterion variable, between coefficient and context effect. The two estimators are MLR and ML.

Data Generation
Simulated data were generated by using the multilevel latent model. The first step was to generate the data on the predictor. The predictor variable was decomposed into two uncorrelated components: The corresponding decomposition of the variance of the predictor is The criterion variance was also set equal to one without loss of generality. Then The relationship between the cluster-level means for the criterion and predictor is [12]. Using standard results in regression theory, 2

Conditions
The within-cluster sample size was set to n = 5, 10, 15, or 30. A group size of 5 is usual in small-group educational research and in longitudinal research. A group size of 15 or 30 is a typical class size in school. The number of clusters was K =20 or 40. The reason we chose 20 and 40 is that a cluster sample size smaller than 50 is not unusual in multilevel empirical research, and simulation studies often focus on larger cluster sample size.

Data Analysis
For every condition, the generated data were analyzed by using the multilevel latent model to estimate the context effects with two estimators respectively in Mplus. Two estimators were MLR (maximum likelihood estimation with robust standard errors) and ML (maximum likelihood estimation). MLR is the default estimator for multilevel model in Mplus and is increasingly chosen by default in available software. MLR is assumed to offer the protection against unmodeled heterogeneity, however, Hox et al. [9] found that when number of clusters is small (less than 50) and the data follow the normality assumption, MLR does not perform as well as ML. Thus both MLR and ML were chosen for data analysis and compared the results.
To investigate accuracy of estimation, the interval estimation was estimated by using the coverage of the 95% confidence interval. Coverage, that is whether or not the CI contained the population context effect value, was coded 0-1 for each replication. Estimated coverage probability was then calculated as the mean of the dichotomous variable over the 5000 replications in each condition. Power was also investigated. Rejection of 0 : 0 = where ɵ Ciq γ is as defined earlier and ɵ Cq γ is the mean context effect for the q th condition. To investigate how the conditions in the study affected the empirical sampling variance, the recommendation by O'Brien [14] was followed and an ANOVA was conducted using The results of these analyses were used to calculate effect sizes. The PROC GENMOD analysis was selected to take into account the dichotomous nature of the dependent variable. To measure the relative size of the effects, the proportion of effect variance (PEV) was used. for the 63 effects in the ANOVA and is subsequently referred to as the total effect variance.

Coverage Probability
The percentiles of the coverage probability by the between coefficient and context effect using MLR and ML estimators are presented in Table 1. When using the MLR estimator, the coverage rates range from 0.895 to 0.946 among all the conditions, which indicates the estimated standard errors of the context effect are typically negatively biased. While using the ML estimator, the coverage rates range from 0.933 to 0.983, and the median coverage rates are closer to 0.95. It shows that the ML estimator improves the estimation accuracy of standard errors and has more appropriate control of Type I error rate. To further investigate which factors influence the coverage rates, the logistic regression analysis was conducted first by seven main factors: n, K, ICC X , ICC Y , B γ , C γ and estimator. Results showed the factor estimator accounted for the majority of the effect variance (62.2% of the total effect variance) and ML estimator showed more accurate estimation than MLR. Thus the following section focuses on the analysis of variance on ML results only. Six factors along with their interactions were investigated for the ML coverage results by using logistic regression, and the main factors were n, K, ICC X , ICC Y , B γ and C γ . A number of effects were significant, which is to be expected given that each cell of the design was replicated 5000 times, so the proportion of the effect variance was the focus. Cluster sample size (n) accounted for 50.6% of the total effect variance followed by ICC Y for 10.6% and n by K for 9.7%. Table 2 presents mean probability coverage as a function of sample size by using the ML estimator. The effect of n on coverage probability is different than expected. Inspection of the estimated standard errors indicated that there were exceptionally large estimated standard errors for some replications and the prevalence of these large standard errors was declined as n increased, especially when K =20. The appropriate estimation occurred when within cluster sample size i10 and 15. When numbers of clusters increase from 20 to 40, the estimated standard errors tend to be more accurate at all levels of within cluster sample size. For the factor

Power
Power for detecting the context effect is higher when C γ gets larger. ML estimator appropriately control type I error rates even it costs the price of power. Similar to the variance analysis of coverage, six factors along with their interactions were investigated for the ML power results by using logistic regression. The factors X ICC and C γ play an important role individually and interactively, which altogether accounted for 60.2% of the total effect variance. The within cluster sampler size (n) accounted for 9.6%. As shown in Table 3, power increases as X ICC and C γ increase. The effect of X ICC is much larger when C γ is larger. As expected, power increases when n increases. When n increases from 5 to 30, power increases from 0.076 to 0.212. Even though the number of clusters K does not account for more than 5% of the total effect (PEV of K is 4.2%), the results showed that power increases from 0.115 to 0.181 as K increases from 20 to 40.

Bias
Results indicate that bias tends to be small in most conditions. Percentiles of bias by the between coefficient and context coefficients by using two estimators were checked. Among all conditions when using MLR, bias ranged from -0.078 to 0.151. Similar results were found when using the ML estimator. Bias ranged from -0.004 to 0.175. Median bias was 0.024 or smaller for C γ being 0.3, and 0.010 or smaller for C γ being 0.1.
Since the estimator method does not have an impact on the estimation of coefficient itself, the six-way ANOVA was conducted based on the MLR and ML-combined results which contain 10000 replications in each condition. X ICC , n, C γ along with their interactions account for 79.2% of the total effect variance. They all have the considerate influence on the effect variance of bias. Table 4 shows mean bias by these three factors. When X ICC increases, bias decreases when n is 10 or larger. Bias tends to increase as the context effect C γ increases. When C γ is .3 and X ICC is .10 or larger, bias decreases as n increases.

Empirical Sampling Variance
The percentiles of the ESV of ɵ C γ by the between coefficient and context effect by using MLR and ML estimators were also checked. Overall, the ESVs range form from 0.002 to 7.239 across estimators. Similar to the variance analysis of bias, the six-way ANOVA was conducted based on the MLR and ML combined results of ESV. The factors X ICC , n and K accounted for 84.9% of the total effect variance for empirical sampling variance. These effects were large relative to the other effects. For the effect of X ICC , mean ESVs get smaller when X ICC gets larger (Figure 1 and Figure 2). Figure 1 also shows that the mean ESV declines as n gets larger. Results in Figure 2 indicate that the mean ESV decreases when K increases even though the number of clusters K is small (K = 20 and 40).

Discussion
One notable result in this study is that ML yields more accurate parameter estimates than MLR in terms of the appropriate standard error estimation. The parameter estimates for ML and MLR are identical, so the estimates of standard errors can be compared directly. MLR, as a robust standard error estimator, performs well only when the number of clusters is large. If the data violates the distributional assumption, robust-method MLR is found to be more accurate than ML, but still requires a large sample size [9]. In the simulation study, all data are normal distributed and the sample size (especially the level-2 sample size) is relatively small. The results clearly showed that ML has more accurate parameter estimates than MLR. It is worthwhile to note that a Bayesian approach may have a great potential for estimating the latent covariate model even with a small number of groups [15,16].
The results also indicate that the coverage probability improves as the number of clusters increases even they are both small. Unexpectedly, the coverage rates are not improved when the within-cluster sample size increased. The most appropriate estimates occurred when the cluster sample size are 10 and 15. The results in Ludtke et al. [4] demonstrated an inconsistent effect of cluster sample size on coverage probability. Mass and Hox [13] argued that more groups lead to a better coverage but having larger groups does not improve the estimation. However, Maas and Hox [10] found coverage rates slightly improved when cluster sample size increased, but the effect of cluster sample size is smaller than the number of clusters. Therefore, the findings suggest that number of groups has a positive effect on the coverage probability and the effect of group size shows an inconsistent situation. Besides, coverage probability decreases as Y ICC increases, but Y ICC has a smaller effect on coverage than sample size.
Statistical power, in essence, is the probability of detecting an effect when it does exist. When the effect is smaller, the power to detect such an effect would be lower as expected. The results show that when context effect increases the power increases. The simulation study of Scherbaum and Ferreter [10] found that, at a small effect size level, the estimates of statistical power varied from approximately .05 to .28, and they considered ES=0.20 as small effects and ES=0.50 as medium effects. To investigate the power of detecting context effect ( C γ ) in this study, the choice of the size of context effect is relatively small. At C γ = 0.1 the power ranges from 0.05 to 0.106 when X ICC increases from .05 to .20; at C γ = 0.3 the power ranges from 0.07 to 0.427. A number of factors can influence statistical power in both single-level design and multilevel design. One of these factors is the Type I error. There is an inverse relationship between the Type I error and the power. In other words, when Type I error increases, the power decreases. This study showed that the MLR estimator consistently underestimated the Type I error given the studied conditions. The ML estimator, on the other hand, improved the accuracy of the standard error estimates. Thus the loss of power in the study is partially due to the appropriate control of Type I error by using the ML estimator.
Statistical power for multilevel models is more complicated than the single-level design since some additional factors need to be taken into account. The results show that, as expected, power increases as , C γ , X ICC n, and K increase. These results are expected and support the validity of the simulation method. Further, B γ and Y ICC did not play an important role in power. Scherbaum and Ferreter [11] argued that the intraclass correlation, the total sample size and the sample size at each level, and the inclusion of covariates all affected the computation of power. They also found that increasing the number of clusters will improve more on the power than increasing the cluster sample size. However, the results of K and n on power do not reflect this argument. Power increased much more rapidly when n increased from 5 to 10 than from 10 to a higher level. Thus it is suggested that cluster sample size should be at least 10 in terms of power in practical research.
Bias and variance are the two parameters that assess the accuracy of parameter estimation. In regard to bias, the results show that even when the number of clusters was quite small, that is between 20 and 40, bias of the context effect estimator was quite small and relatively unaffected by B γ , K, and .
Y ICC The factors with the largest effects on bias were X ICC and n. Bias decreased as X ICC and n increased. The direction of each of the effects of n and X ICC is consistent with results in Ludtke et al. [4]. The study also indicated that bias increases as size of context effect ( C γ ) increases. The effect of X ICC and C γ on the bias of the context effect estimator was unstable when n was 5 but not when n was 10 or larger. Consequently, we suggest that the within-cluster sample size should be at least 10. Furthermore, , Y ICC B γ and K do not play an important role in the bias of the context effect estimator under the condition in the present study.

Conclusion
The context effect is often an important feature of multilevel data analysis. Past research has shown that, when the within-cluster sample is regarded as a sample from a larger population, the multilevel latent model, rather than the traditional multilevel model, should be used to estimate the context effect. The results suggest that the estimator ML should be used rather than MLR when the sample size is small especially the higher level sample size. ML yields more accurate estimates of the standard error so that it appropriately controls the Type I error rate. The results also suggest that bias of context effect estimation tends to be small even when the number of clusters is small. Very small within-cluster sample sizes (less than 10) should be avoided in term of power and empirical sampling variance. If the ICC for the predictor is small (.10 or less), bias is more of a problem. The fact that bias is small even when the number of clusters is small should not be taken as an argument to routinely use a small number of clusters. The number of clusters has a relatively strong effect of the sampling variance of the context effect estimator. Therefore, even an increase from 20 to 40 of cluster number is desirable in practical educational research.