| Sign In to gain access to subscriptions and/or personal tools. |
Approximate Confidence Intervals for Standardized Effect Sizes in the Two-Independent and Two-Dependent Samples DesignUniversity of Maastricht
Standardized effect sizes and confidence intervals thereof are extremely useful devices for comparing results across different studies using scales with incommensurable units. However, exact confidence intervals for standardized effect sizes can usually be obtained only via iterative estimation procedures. The present article summarizes several closed-form approximations to the exact confidence interval bounds in the two-independent and two-dependent samples design. Monte Carlo simulations were conducted to determine the accuracy of the various approximations under a wide variety of conditions. All methods except one provided accurate results for moderately large sample sizes and converged to the exact confidence interval bounds as sample size increased.
Key Words: effect size standardized mean difference confidence intervals two-independent samples design two-dependent samples design
There is a growing consensus that all null hypothesis significance tests should be supplemented with effect size estimates and confidence intervals (e.g., American Psychological Association, 2001; Cohen, 1994; Cumming & Finch, 2001; Hyde, 2001; Kirk, 1996; Schmidt, 1996; Thompson, 2002; Wilkinson & APA Task Force on Statistical Inference, 1999). Procedures for obtaining confidence intervals (CIs) in the raw (unstandardized) units for the two-independent and two-dependent samples design are covered in most introductory textbooks on statistics, commonly known to researchers, and implemented in statistical software packages. However, the units of the measurement scales used by researchers are often chosen arbitrarily. Reporting effect sizes and corresponding CIs in standardized units allows comparisons between measurements on scales that use incommensurable units. For example, Marcus, Marquis, and Sakai (1997) conducted a study to investigate the effectiveness of eye movement desensitization and reprocessing (EMDR), a controversial treatment for a variety of psychological disorders including posttraumatic stress disorder (PTSD). Patients in the EMDR group scored on average 19.76 points below patients in a standard care (SC) group following treatment, as measured by the modified PTSD Symptom Scale. Another study investigating EMDR treatment for PTSD was conducted by Carlson, Chemtob, Rusnak, Hedlund, and Muraoka (1998). Here, the mean difference between the EMDR and a control condition amounted to 3.2 points on a self-report measure devised by the authors of this study. Those two outcomes are not directly comparable because the scales are based on a different number of items and scoring criteria. A common solution is to standardize the mean difference by the pooled standard deviation of the two groups. Doing so yields standardized mean differences of 0.76 and 1.33 points in the first and second study, respectively. In other words, the EMDR group scored 0.76 standard deviations below the SC group in the study by Marcus et al. (1997), whereas Carlson et al. (1998) found a difference of 1.33 standard deviations between the two groups. The differences in treatment efficacy between the two studies are now more apparent. Moreover, CIs can be used to indicate the precision of these effect size estimates. The corresponding 95% CIs are given by (0.26, 1.25) and (0.39, 2.25), respectively. Obtaining exact CIs in standardized units usually requires the use of noncentral distributions and iterative estimation procedures (Cumming & Finch, 2001; Hedges & Olkin, 1985; Smithson, 2003a; Steiger & Fouladi, 1997). At the time of this writing, the methods required to find exact CIs are not covered in most textbooks, are not commonly known to researchers, and have not been implemented in most statistical analysis software. In an effort to address this problem, six articles published in the August 2001 issue of Educational and Psychological Measurement (Thompson, 2001) provided researchers with the necessary information to calculate effect sizes and CIs for a wide variety of experimental designs. In addition, a monograph dealing with CIs based on central and non-central distributions was published recently (Smithson, 2003a). The article by Steiger and Fouladi (1997) also provides an excellent introduction to this topic. Finally, specialized software and scripts to be used in conjunction with standard statistical software packages are available (Cumming, 2003; Smithson, 2003b). Researchers can either familiarize themselves with the specialized tools or rely on various approximate methods based on central distributions and closed-form expressions to calculate CIs for standardized effect sizes. Numerous such approximations have been suggested in the literature. The purpose of the present article is to examine the accuracy of such approximations in the context of the two-independent and the two-dependent samples design.
In the two-independent samples design, participants are randomly assigned to an experimental (E) or a control (C) group. Assume that the scores within each group are sampled from normal distributions with expectations µE and µC and common variance 2. The null hypothesis H0: µE – µC = 0 (i.e., the absence of a difference in the population means) can be tested with the familiar two-independent samples t test.
This null hypothesis significance test provides us with a simple dichotomous decision rule, namely, whether to reject H0 or not, but neither informs us about the direction, magnitude, or precision of the measured effect. Clearly,
where tm,1– In the ideal case where one is investigating a particular outcome variable whose raw units can be compared across related experiments, the unstandardized mean difference µE – µC represents a reasonable choice for the population effect size. However, as discussed earlier, the measurement units are often chosen arbitrarily. Therefore, working with standardized units can be more informative as this allows comparisons of parameter estimates across scales using different units. The population effect size is then defined as
which reflects the difference in the population means in standard deviation units. An estimate of
However, d2 is a positively biased estimator of
where
Based on Hedges (1981, 1982, 1983), we note the following set of results. The exact variances of d2 and g2 are given by Equations 21 and 22 in Table 1. However,
Researchers often choose to measure the same set of n participants on two different occasions, such as before and after receiving some treatment. Because the same group of participants is measured twice, the two sets of scores are no longer independent. Assume that the scores X1 and X2 obtained at Time 1 and Time 2 are sampled from normal distributions with expected values µ1 and µ2 and common variance 2. Now define the random variable D = X2 – X1. It follows that D is normally distributed with expected value µD = µ2 – µ1 and variance D2 = 2 2 (1 – ), where is the correlation between the scores at Time 1 and Time 2.
The null hypothesis H0: µ2 – µ1 = 0 (i.e., H0: µD = 0) can be tested by carrying out a one-sample t test on the D scores. The value of µD = µ2 – µ1 is easily estimated with
where sD2 is the observed variance in the D scores. Again, we would like to obtain a standardized point estimate. Two different standardized parameters have been suggested in the literature (Becker, 1988; Gibbons, Hedeker, & Davis, 1993; Morris, 2000; Morris & DeShon, 2002), namely,
based on the standard deviation in the D scores, and
which is of the same form as
If
and
respectively, where m = n – 1. The exact variances and variance estimates for these effect sizes are given in Table 1 with ñ = n and N = n (Becker, 1988; Gibbons et al., 1993; Morris, 2000; Morris & DeShon, 2002).
Working with
and
have been suggested in the literature and are biased and unbiased estimators of
Note that one must use an unbiased estimate of in Equations 36 and 37 to obtain unbiased estimates of the sampling variances of dD2 and gD2. An unbiased estimate of , derived by Olkin and Pratt (1958), is given by
where
denotes the hypergeometric function. Olkin and Pratt also suggested
as a simple yet accurate approximation to the unbiased estimator of
In the following sections, refers to one of the three different effect size parameters discussed previously, namely, 2, D, or D2. We will now consider methods for finding a CI for . Finding the exact CI bounds for is problematic because the shape of the distribution of d and g depends directly on the parameter for which the interval is being constructed. Therefore, the CI cannot be given as a closed-form expression.
Iterative methods to find the exact CI for the two-independent and two-dependent samples design with parameter
We will now consider several approximations to the exact CI bounds. Let q be equal to the 100 x (1 –
Another approach to obtain approximate CI bounds for is to first use a variance stabilizing transformation. The variance stabilizing transformation for the standardized mean difference, as suggested by Hedges and Olkin (1985), is here generalized to include the two-dependent samples design. Based on the delta method, the random variable
can be shown to be asymptotically normal with expectation h(
and then transforms these bounds back to the original metric with
the inverse function of h(g). This method (denoted by the letter H) could also be applied using the biased estimate d in place of g.
Finally, Fidler and Thompson (2001) suggested that a CI for In total, this yields 21 different approximations (see again Table 3). Specifically, methods B, U, L1, L2, and H can either be based on d or g and can either employ critical values from the normal or the t distribution, whereas method F is by default always based on d and the t distribution. The letter z or t will be appended to the name of the approximation to indicate which distribution was used. For example, method B based on g and the normal distribution will be denoted by gBz.
Three examples will illustrate the various approximations discussed in the previous section. First, consider the two-independent samples case. Assume that (25, 19, 21, 14, 16, 23, 24, 24, 22, 22) and (28, 26, 27, 19, 23, 29, 25, 31, 32, 30) represent test scores from students randomly assigned to a control and an experimental group, respectively, in an experiment investigating the impact of a new teaching technique on students performance. The mean difference in test scores between the two groups is 6.0 points with a 95% CI given by (2.44, 9.56). To obtain the standardized estimate g2, we use Equation 4 with m = 18 and c(18) = .96. In standardized units, it turns out that the students in the experimental group scored g2 = 1.52 standard deviations above the control group. Using iterative estimation procedures, we find that an exact 95% CI for 2 is given by (.55, 2.58).
We now compare the exact CI with the approximate CI bounds obtained by the methods discussed earlier (for conciseness, the examples in this section will be based on the normal distribution and g only). For method gBz, we first calculate
Next, consider the dependent samples case where interest is focused on
Using Equation 10 with m = 9 and c(9) = 0.91, we find gD = 2.16. An exact CI for
As a final example, consider the case where an estimate and CI for Method gBz in this case is based on Equation 35, which yields an approximate CI of (.60, 2.43). Continuing through the list of variance estimates in Table 2 and approximation methods in Table 3, we obtain the bounds (.64, 2.38) with method gUz, (.70, 2.33) with method gL1z, and (.73, 2.30) with method gL2z. Method gHz, based on the variance stabilizing transformation, yields the approximate bounds (.86, 2.47). And finally, an approximate CI is given by (1.11, 2.04) when using method F.
The examples illustrate that the methods can differ widely in terms of how well they approximate the exact CI bounds. Hedges (1982) studied method gL2z in the two-independent samples design (using sample sizes in the range 10
Three sets of simulations were conducted. The first set of simulations corresponds to the two-independent samples case. Values of 2 between –2 and 2 in steps of .25, seven different CI widths (1 – = .50, .60, .70, .80, .90, .95, and .99.), and various sample size configurations were used: (a) equal sample sizes of nE = nC = 4, 8, 16, 32, and 64 participants per group; (b) unequal sample sizes with (nE, nC) = (2, 6), (4, 12), (8, 24), (16, 48), and (32, 96), each corresponding to a 25/75% split of participants; and (c) unequal sample sizes with (nE, nC) = (2, 14), (4, 28), (8, 56), and (16, 112), each corresponding to a 12.5/87.5% split of participants. Therefore, there were a total of 17 2 x 14 sample size x7 CI width = 1, 666 conditions. On each of the 100,000 iterations for a particular condition, a d2 value was directly simulated from , where Z is a random normal variable with distribution N( 2, 1/nE + 1/nC) and X is a random chi-square variable with m = nE ) nC – 2 degrees of freedom. The exact (1 – ) x100% CI bounds for 2 were then determined using iterative methods. Next, approximate CI bounds were obtained with each of the 21 methods discussed earlier.
The second set of simulations corresponds to the two-dependent samples case with parameter
The third set of simulations corresponds to the two-dependent samples case with parameter The accuracy of the various approximations was assessed with two measures: the empirical coverage probability and the ratio of the length of the approximate CI compared to the exact interval. Specifically, the empirical coverage probability was estimated with
where The empirical coverage probability indicates whether the approximation captures the parameter as often as it should. On the other hand, the ratio of the length of the approximate CI to the true CI indicates to what degree the width of the true interval was over- or underestimated. The average width ratio for a particular method was estimated with
where CLi and CUi are the lower and upper bounds of the exact CI.
The error in the empirical coverage probability of an approximation is given by
Two-Independent Samples Design The accuracy of methods B, U, L1, L2, and H depended only on the total sample size (N = nE + nC) and not on the proportion of scores falling into each group. Also, all these methods converged to the nominal coverage probabilities and interval widths as N increased, regardless of whether the method was based on d2, g2, the normal, or the t distribution critical values. In fact, the maximum value of ( – (1 – )) x100% over all conditions and methods (excluding method F) was never larger than 6.5% for N = 16, 3.0% for N = 32, 1.5% for N = 64, and 1.0% for N = 128. In terms of width ratios, convergence was somewhat slower. The maximum value of ( – 1) x100% was 28.6%, 11.9%, 5.5%, and 2.6% for a total sample size of 16, 32, 64, and 128 participants, respectively.
The convergence of the empirical coverage probabilities of methods B, U, L1, L2, and H to the nominal (1 –
However, for small sample sizes, some approximations were noticeably more accurate than others. Table 4 indicates the maximum value of (
Table 4 also indicates that methods gUz, gL1z, dL1z, dL2z, gHz, and dHz provided the most accurate results when considering empirical coverage probabilities and width ratios simultaneously. To determine whether one of these methods is the most accurate, one should examine individual values of and instead of focusing on the maximum errors as done in Table 4. Accordingly, the empirical coverage probabilities and width ratios of these methods are plotted by 2 values for the 95% CI condition in Figure 1 when N = 8. In addition, the results for method F are shown to illustrate its distinctive performance. Clearly, method dL1z provided the most accurate approximation to the exact CI bounds for 2. Its width ratios were slightly above 1 but with no substantial impact on coverage probabilities.
Method F showed a very different pattern of results when compared to the other methods. At 2 = 0, the empirical coverage probabilities were equal to the nominal (1 – ) values for all sample sizes (even for N = 8) and all values of (1 – ), although its width ratios were quite inaccurate. For larger values of | 2|, fell below the nominal (1 – ) value, with no improvement in accuracy for larger sample sizes (see last rows in Table 4). This finding is not surprising as method F essentially assumes that 2 is equal to zero. Therefore, no matter how large N becomes, method F will not provide accurate results when 2 0. Finally, method F actually showed an increase in accuracy in the unequal sample size conditions, with more extreme splits yielding more accurate values (again, see last rows in Table 4).
Two-Dependent Samples Design With Parameter
Figure 2 shows the empirical coverage probabilities and width ratios for methods gUz, gL1z, dL1z, dL2z, gHz, dHz, and F for the 95% CI condition when n = 8, which again indicates that method dL1z performed favorably. Because of the wider range of D values considered in this set of simulations, method F performed even worse when compared to the two-independent samples case and again showed no sign of improvement with larger sample sizes.
Two-Dependent Samples Design With Parameter D2Once again, all of the approximations except method F converged to the exact CI bounds in terms of empirical coverage probabilities and widths. The largest error in the coverage probabilities amounted to 8.3% for n = 16, 3.9% for n = 32, 2.7% for n = 64, and 1.7% for n = 128. For the width ratios, the largest errors were 27.1%, 16.4%, 8.5%, and 4.4%, respectively. Table 6 provides the maximum value of ( – (1 – )) x 100% and ( – 1) x 100% for each method over all values of D2 and in the two-dependent samples case when n = 8. The approximations were slightly less accurate than in the previous two sets of simulations, which can be attributed to the additional error introduced by having to estimate . Methods gL1z, gL2z, dL2z, and gHz provided the most accurate coverage probabilities and width ratios for small sample sizes.
Empirical coverage probabilities and width ratios for these four methods are plotted in Figure 3 across individual values of D2 for the 95% CI condition when = .7 and n = 8. Method dL1z, which provided the most accurate results in the previous two sets of simulations, is also plotted for comparison purposes, whereas method gHz was omitted, as it made the graphs difficult to read (it did not provide more accurate results than the two methods discussed in the following). Here, none of the methods could be considered generally superior in all aspects. In fact, all of these approximations tended to capture the parameter not often enough despite the fact that intervals provided by methods dL1z and dL2z were usually too wide on average. Overall, methods gL1z and dL2z appear to be most accurate in terms of interval widths while still providing quite accurate coverage probabilities.
Finding the exact CI for requires iterative estimation procedures. However, the present article shows that various approximate methods can be used without concern as long as sample sizes are at least moderately large. On the other hand, when sample sizes are small, then researchers can consult Tables 4, 5, and 6 when choosing an approximation. These tables provide the maximum error over conditions that are not under the control of the experimenter, namely, the true value of and the true value of in the two-dependent samples case. In other words, because these parameters are unknown in practice, one cannot pick a method that would be optimal for particular and values. A reasonable alternative is a mini-max approach, choosing a method that minimizes the maximum possible error.
Ideally, one method would be most accurate for all the designs considered. Method dL1z was extremely accurate in the two-independent and two-dependent samples case with parameter
A final word of caution is warranted when discussing standardized effect sizes. The standardized effect size Cohen (1994) emphasized that researchers must begin to respect the units they work with (p. 1001). In the ideal case where the raw units of measurement have a natural interpretation and are consistent across multiple studies, it is not necessary to standardize the effect size. Using raw units eliminates the dependency of the effect size on the population variance. CIs for effect sizes in raw units are easily obtained and are exact. However, in the social sciences, the multitude of scales and measurement instruments will necessitate the use of standardized effect sizes in the future.
WOLFGANG VIECHTBAUER is Assistant Professor, Department of Methodology and Statistics, at the University of Maastricht; P.O. Box, 6200 MD Maastricht, The Netherlands; e-mail:wolfgang.viechtbauer{at}stat.unimaas.nl. His research interests include mixed-effects models, meta-analysis, multilevel modeling, longitudinal analysis, and effect size measures. I would like to thank David Thissen and three anonymous reviewers for their valuable comments on an earlier draft of this article. Manuscript received June 28, 2004. Accepted for publication May 4, 2005.
American Psychological Association. (2001). Publication manual of the American Psychological Association (5.). Washington, DC: Author.Becker, BJ. (1988). Synthesizing standardized mean-change measures. British Journal of Mathematical and Statistical Psychology (41), 257-278Bird, KD. (2002). Confidence intervals for effect sizes in analysis of variance. Educational and Psychological Measurement, 62, 197-226.
Journal of Educational and Behavioral Statistics, Vol. 32, No. 1,
39-60 (2007)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

2. The null hypothesis H0: µE – µC = 0 (i.e., the absence of a difference in the population means) can be tested with the familiar two-independent samples t test.
E –
) x100% CI for µE – µC, given by 

2 is given by 


d22 (B) and
E[g22]
), where
= 









. However, in practice,
and
are distributed central t with m degrees of freedom, which suggests use of the t distribution for small 
in the two-independent samples case,
in the two-dependent samples case with parameter
in the two-dependent samples case with parameter 

. This in turn reveals that the Fidler and Thompson approach is essentially the same as using the biased estimate d2 and constructing a CI based on the t distribution and the large sample variance
) (Equation 27) where
would provide an approximate CI for
provides approximate CI bounds (.48, 2.55). For method gUz, we find that
then provides the approximate bounds (.50, 2.54). Continuing with methods gL1z and gL2z, we find the CI bounds to be equal to (.51, 2.52) and (.52, 2.51), respectively. For the variance stabilizing transformation (method gHz), we use Equation 17 with
and obtain zg = .73. We then calculate the CI in the transformed units with Equation 17, yielding
. Transforming these bounds back into the original units using Equation 18 results in (.58, 2.60). Finally, for method F, we divide the bounds of the CI based on the raw units by sp = 3.79, yielding (.65, 2.52).
nE = nC
, where Z is a random normal variable with distribution N(
Li and
(
values is .002 for R = 100, 000 and .005 for R = 10, 000. 
– 1. Because of the large number of approximations considered in the present article, the methods were first examined in terms of their maximum error in coverage probability and width ratio. These values represent the worst-case results and allow us to rule out methods that can yield grossly inaccurate approximations (the full set of results can be obtained by contacting the author). 






