Advanced Search

Journal Navigation

Journal Home

Subscriptions

Archive

Contact Us

Table of Contents

Click here to sign up for SAGE Journal Email Alerts today!

Sign In to gain access to subscriptions and/or personal tools.
Journal of Educational and Behavioral Statistics
This Article
Right arrow Abstract Freely available
Right arrow Free Full Text (Free PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to Saved Citations
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Request Reprints
Right arrow Add to My Marked Citations
Citing Articles
Right arrow Citing Articles via Google Scholar
Right arrow Citing Articles via Scopus
Google Scholar
Right arrow Articles by Viechtbauer, W.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati   Add to Twitter  
What's this?

Articles

Approximate Confidence Intervals for Standardized Effect Sizes in the Two-Independent and Two-Dependent Samples Design

Wolfgang Viechtbauer

University of Maastricht


    Abstract
 Top
 Abstract
 Introduction
 The Two-Independent Samples...
 The Two-Dependent Samples Design
 Confidence Intervals for {delta}
 Examples
 Method
 Results
 Discussion
 References
 
Standardized effect sizes and confidence intervals thereof are extremely useful devices for comparing results across different studies using scales with incommensurable units. However, exact confidence intervals for standardized effect sizes can usually be obtained only via iterative estimation procedures. The present article summarizes several closed-form approximations to the exact confidence interval bounds in the two-independent and two-dependent samples design. Monte Carlo simulations were conducted to determine the accuracy of the various approximations under a wide variety of conditions. All methods except one provided accurate results for moderately large sample sizes and converged to the exact confidence interval bounds as sample size increased.

Key Words: effect size • standardized mean difference • confidence intervals • two-independent samples design • two-dependent samples design


    Introduction
 Top
 Abstract
 Introduction
 The Two-Independent Samples...
 The Two-Dependent Samples Design
 Confidence Intervals for {delta}
 Examples
 Method
 Results
 Discussion
 References
 
There is a growing consensus that all null hypothesis significance tests should be supplemented with effect size estimates and confidence intervals (e.g., American Psychological Association, 2001; Cohen, 1994; Cumming & Finch, 2001; Hyde, 2001; Kirk, 1996; Schmidt, 1996; Thompson, 2002; Wilkinson & APA Task Force on Statistical Inference, 1999). Procedures for obtaining confidence intervals (CIs) in the raw (unstandardized) units for the two-independent and two-dependent samples design are covered in most introductory textbooks on statistics, commonly known to researchers, and implemented in statistical software packages. However, the units of the measurement scales used by researchers are often chosen arbitrarily. Reporting effect sizes and corresponding CIs in standardized units allows comparisons between measurements on scales that use incommensurable units.

For example, Marcus, Marquis, and Sakai (1997) conducted a study to investigate the effectiveness of eye movement desensitization and reprocessing (EMDR), a controversial treatment for a variety of psychological disorders including posttraumatic stress disorder (PTSD). Patients in the EMDR group scored on average 19.76 points below patients in a standard care (SC) group following treatment, as measured by the modified PTSD Symptom Scale. Another study investigating EMDR treatment for PTSD was conducted by Carlson, Chemtob, Rusnak, Hedlund, and Muraoka (1998). Here, the mean difference between the EMDR and a control condition amounted to 3.2 points on a self-report measure devised by the authors of this study.

Those two outcomes are not directly comparable because the scales are based on a different number of items and scoring criteria. A common solution is to standardize the mean difference by the pooled standard deviation of the two groups. Doing so yields standardized mean differences of 0.76 and 1.33 points in the first and second study, respectively. In other words, the EMDR group scored 0.76 standard deviations below the SC group in the study by Marcus et al. (1997), whereas Carlson et al. (1998) found a difference of 1.33 standard deviations between the two groups. The differences in treatment efficacy between the two studies are now more apparent. Moreover, CIs can be used to indicate the precision of these effect size estimates. The corresponding 95% CIs are given by (0.26, 1.25) and (0.39, 2.25), respectively.

Obtaining exact CIs in standardized units usually requires the use of noncentral distributions and iterative estimation procedures (Cumming & Finch, 2001; Hedges & Olkin, 1985; Smithson, 2003a; Steiger & Fouladi, 1997). At the time of this writing, the methods required to find exact CIs are not covered in most textbooks, are not commonly known to researchers, and have not been implemented in most statistical analysis software. In an effort to address this problem, six articles published in the August 2001 issue of Educational and Psychological Measurement (Thompson, 2001) provided researchers with the necessary information to calculate effect sizes and CIs for a wide variety of experimental designs. In addition, a monograph dealing with CIs based on central and non-central distributions was published recently (Smithson, 2003a). The article by Steiger and Fouladi (1997) also provides an excellent introduction to this topic. Finally, specialized software and scripts to be used in conjunction with standard statistical software packages are available (Cumming, 2003; Smithson, 2003b).

Researchers can either familiarize themselves with the specialized tools or rely on various approximate methods based on central distributions and closed-form expressions to calculate CIs for standardized effect sizes. Numerous such approximations have been suggested in the literature. The purpose of the present article is to examine the accuracy of such approximations in the context of the two-independent and the two-dependent samples design.


    The Two-Independent Samples Design
 Top
 Abstract
 Introduction
 The Two-Independent Samples...
 The Two-Dependent Samples Design
 Confidence Intervals for {delta}
 Examples
 Method
 Results
 Discussion
 References
 
In the two-independent samples design, participants are randomly assigned to an experimental (E) or a control (C) group. Assume that the scores within each group are sampled from normal distributions with expectations µE and µC and common variance {sigma}2. The null hypothesis H0: µE – µC = 0 (i.e., the absence of a difference in the population means) can be tested with the familiar two-independent samples t test.

This null hypothesis significance test provides us with a simple dichotomous decision rule, namely, whether to reject H0 or not, but neither informs us about the direction, magnitude, or precision of the measured effect. Clearly, XE XC provides an unbiased estimate of µE – µC, where XE and XC denote the sample means of the nE and nC scores in the two groups. The precision of this estimate can be indicated by a (1 – {alpha}) x100% CI for µE – µC, given by


Formula(1)

where tm,1–{alpha}/2 denotes the 100 x (1 – {alpha}/2)th quantile of a central t distribution with m degrees of freedom and sp2 the pooled variance of the two groups.

In the ideal case where one is investigating a particular outcome variable whose raw units can be compared across related experiments, the unstandardized mean difference µE – µC represents a reasonable choice for the population effect size. However, as discussed earlier, the measurement units are often chosen arbitrarily. Therefore, working with standardized units can be more informative as this allows comparisons of parameter estimates across scales using different units. The population effect size is then defined as


Formula(2)

which reflects the difference in the population means in standard deviation units. An estimate of {delta}2 is given by


Formula(3)

However, d2 is a positively biased estimator of {delta}2. The bias of d2 was first demonstrated by Hedges (1981), who also derived the unbiased estimator


Formula(4)

where


Formula(5)

Based on Hedges (1981, 1982, 1983), we note the following set of results. The exact variances of d2 and g2 are given by Equations 21 and 22 in Table 1. However, {sigma}d22 and {sigma}g22 depend on the unknown value {delta}2, which in practice is replaced by either d2 or g2, leading to estimates Formulad22 (B) and Formulag22 (B) (Equations 23 and 24). This introduces a certain amount of bias into the estimated sampling variances because E[d22] != E[g22] != {delta}22. Unbiased estimates of {sigma}d22 and {sigma}g22, denoted by Formulad22(U) and Formulag22 (U), are given by Equations 25 and 26. Finally, d2 and g2 are asymptotically normal with mean {delta}2 and variance 1/ñ + {delta}22/(2m), where ñ = nEnC/(nE + nC). Replacing the unknown value of {delta}2 by either sample estimate leads to Formulad22 (L1)and Formulag22 (L1), the large sample variance estimators (Equations 28 and 29). However, in the literature, one usually finds m replaced with the total sample size N = nE + nC in Equation 27. This leads to the large sample estimators Formulad22 (L2) and Formulag22 (L2) (Equations 30 and 31).


View this table:
[in this window]
[in a new window]

 
TABLE 1 Variances and Variance Estimators in the Two-Independent Samples Case ({delta}2) and the Two-Dependent Samples Case ({delta}D)

 

    The Two-Dependent Samples Design
 Top
 Abstract
 Introduction
 The Two-Independent Samples...
 The Two-Dependent Samples Design
 Confidence Intervals for {delta}
 Examples
 Method
 Results
 Discussion
 References
 
Researchers often choose to measure the same set of n participants on two different occasions, such as before and after receiving some treatment. Because the same group of participants is measured twice, the two sets of scores are no longer independent. Assume that the scores X1 and X2 obtained at Time 1 and Time 2 are sampled from normal distributions with expected values µ1 and µ2 and common variance {sigma}2. Now define the random variable D = X2X1. It follows that D is normally distributed with expected value µD = µ2 – µ1 and variance {sigma}D2 = 2{sigma}2 (1 – {rho}), where {rho} is the correlation between the scores at Time 1 and Time 2.

The null hypothesis H0: µ2 – µ1 = 0 (i.e., H0: µD = 0) can be tested by carrying out a one-sample t test on the D scores. The value of µD = µ2 µ1 is easily estimated with D = X2X1, with a (1 – {alpha}) x 100% CI given by


Formula(6)

where sD2 is the observed variance in the D scores.

Again, we would like to obtain a standardized point estimate. Two different standardized parameters have been suggested in the literature (Becker, 1988; Gibbons, Hedeker, & Davis, 1993; Morris, 2000; Morris & DeShon, 2002), namely,


Formula(7)

based on the standard deviation in the D scores, and


Formula(8)

which is of the same form as {delta}2 defined in Equation 2 for the two-independent samples design. When {rho} = .5, then {delta}D = {delta}D2. However, in many cases, one would expect the correlation between the scores at the two occasions to be greater than .5, which implies {delta}D > {delta}D2 (see Ray & Shadish, 1996, for some empirical evidence relevant to this issue). Simply assuming that {rho} = .5 is not recommended, and estimates of {delta}D are therefore not directly comparable to estimates of {delta}D2.

If {delta}D is chosen as the effect size parameter of interest, then biased and unbiased estimates are given by


Formula(9)

and


Formula(10)

respectively, where m = n – 1. The exact variances and variance estimates for these effect sizes are given in Table 1 with ñ = n and N = n (Becker, 1988; Gibbons et al., 1993; Morris, 2000; Morris & DeShon, 2002).

Working with {delta}D2 is usually preferable because it is directly comparable to {delta}2 from two-independent samples designs. However, estimating {delta}D2 from dependent samples data poses some additional difficulties. Naturally, one might consider estimators of the form given by d2 and g2 (Equations 3 and 4) to be appropriate here. The problem with this approach is that their exact distributions, expected values, and variances are unknown. Instead, the estimators


Formula(11)

and


Formula(12)

have been suggested in the literature and are biased and unbiased estimators of {delta}D2, respectively, where m = n – 1 and s1 is the pretreatment standard deviation (Becker, 1988; Morris, 2000; Morris & DeShon, 2002). The exact variance of dD2 and gD2 and the large sample approximations are given in Table 2 (Becker, 1988; Morris & DeShon, 2002). In addition, biased estimates of the exact sampling variance are obtained by replacing the parameter {delta}D2 by either dD2 or gD2 and {rho} by the sample correlation r. Unbiased estimates of the exact sampling variance can also be derived and are given in Table 2 (Equations 36 and 37).


View this table:
[in this window]
[in a new window]

 
TABLE 2 Variances and Variance Estimators in the Two-Dependent Samples Case ({delta}D2)

 
Note that one must use an unbiased estimate of {rho} in Equations 36 and 37 to obtain unbiased estimates of the sampling variances of dD2 and gD2. An unbiased estimate of {rho}, derived by Olkin and Pratt (1958), is given by


Formula(13)

where


Formula(14)

denotes the hypergeometric function. Olkin and Pratt also suggested


Formula(15)

as a simple yet accurate approximation to the unbiased estimator of {rho}.


    Confidence Intervals for {delta}
 Top
 Abstract
 Introduction
 The Two-Independent Samples...
 The Two-Dependent Samples Design
 Confidence Intervals for {delta}
 Examples
 Method
 Results
 Discussion
 References
 
In the following sections, {delta} refers to one of the three different effect size parameters discussed previously, namely, {delta}2, {delta}D, or {delta}D2. We will now consider methods for finding a CI for {delta}. Finding the exact CI bounds for {delta} is problematic because the shape of the distribution of d and g depends directly on the parameter for which the interval is being constructed. Therefore, the CI cannot be given as a closed-form expression.

Iterative methods to find the exact CI for the two-independent and two-dependent samples design with parameter {delta}D have been discussed in the literature (Cumming & Finch, 2001; Hedges & Olkin, 1985; Smithson, 2003a; Steiger & Fouladi, 1997). Exact CI bounds for the two-dependent samples design with parameter {delta}D2 can also be obtained when {rho} is known by multiplying the bounds for {delta}D by Formula. However, in practice, {rho} must be estimated from the data. The additional variability in estimates of {rho} must be considered when using an iteration procedure to find the exact CI for {delta}D2. No such method has been developed yet.

We will now consider several approximations to the exact CI bounds. Let q be equal to the 100 x (1 – {alpha}/2)th quantile of either the standard normal or the central t distribution with m degrees of freedom. Approximate (1 – {alpha}) x 100% CIs for {delta} are given by methods B, U, L1, and L2 as shown in Table 3. Two different sets of approximations are given, depending on whether one uses the biased or the unbiased estimate of the corresponding population parameter. Also, one can use either the normal distribution or the t distribution to construct such approximate CIs. Use of the normal distribution for obtaining CIs can be justified based on the fact that d and g are asymptotically normal with expectation {delta} and variances given by Equations 27 or 38. On the other hand, when {delta} = 0, then Formula and Formula are distributed central t with m degrees of freedom, which suggests use of the t distribution for small {delta}.


View this table:
[in this window]
[in a new window]

 
TABLE 3 Methods to Obtain Approximate Confidence Intervals for {delta}

 
Another approach to obtain approximate CI bounds for {delta} is to first use a variance stabilizing transformation. The variance stabilizing transformation for the standardized mean difference, as suggested by Hedges and Olkin (1985), is here generalized to include the two-dependent samples design. Based on the delta method, the random variable


Formula(16)

can be shown to be asymptotically normal with expectation h({delta}) and variance 1/N, where Formula in the two-independent samples case, Formula in the two-dependent samples case with parameter {delta}D, and Formula in the two-dependent samples case with parameter {delta}D2. Therefore, one calculates the lower and upper bounds of a CI using the distribution of zg with


Formula(17)

and then transforms these bounds back to the original metric with


Formula(18)

the inverse function of h(g). This method (denoted by the letter H) could also be applied using the biased estimate d in place of g.

Finally, Fidler and Thompson (2001) suggested that a CI for {delta}2 in the two-independent samples design could be approximated by dividing each XE and XC score by sp and obtaining a CI for the raw mean difference as described by Equation 1 using the transformed data. It is easy to show that this method is identical to dividing the CI bounds for µE – µC obtained from Equation 1 by sp. A similar approach was also discussed by Bird (2002). However, after dividing Equation 1 by sp, we obtain Formula. This in turn reveals that the Fidler and Thompson approach is essentially the same as using the biased estimate d2 and constructing a CI based on the t distribution and the large sample variance {sigma}d/g2 ({infty}) (Equation 27) where {delta}2 is assumed to be zero. The same principle could be extended to the two-dependent samples design where a CI for {delta}D is sought by dividing the CI bounds for µD obtained from Equation 6 by sD. Finally, dividing the CI bounds for µD obtained from Equation 6 by Formula would provide an approximate CI for {delta}D2. This method of finding approximate CIs for {delta} will be denoted by the letter F.

In total, this yields 21 different approximations (see again Table 3). Specifically, methods B, U, L1, L2, and H can either be based on d or g and can either employ critical values from the normal or the t distribution, whereas method F is by default always based on d and the t distribution. The letter z or t will be appended to the name of the approximation to indicate which distribution was used. For example, method B based on g and the normal distribution will be denoted by gBz.


    Examples
 Top
 Abstract
 Introduction
 The Two-Independent Samples...
 The Two-Dependent Samples Design
 Confidence Intervals for {delta}
 Examples
 Method
 Results
 Discussion
 References
 
Three examples will illustrate the various approximations discussed in the previous section. First, consider the two-independent samples case. Assume that (25, 19, 21, 14, 16, 23, 24, 24, 22, 22) and (28, 26, 27, 19, 23, 29, 25, 31, 32, 30) represent test scores from students randomly assigned to a control and an experimental group, respectively, in an experiment investigating the impact of a new teaching technique on students’ performance. The mean difference in test scores between the two groups is 6.0 points with a 95% CI given by (2.44, 9.56). To obtain the standardized estimate g2, we use Equation 4 with m = 18 and c(18) = .96. In standardized units, it turns out that the students in the experimental group scored g2 = 1.52 standard deviations above the control group. Using iterative estimation procedures, we find that an exact 95% CI for {delta}2 is given by (.55, 2.58).

We now compare the exact CI with the approximate CI bounds obtained by the methods discussed earlier (for conciseness, the examples in this section will be based on the normal distribution and g only). For method gBz, we first calculate Formulag2 (B) (Equation 24), which is equal to .28. Next, we find that Formula provides approximate CI bounds (.48, 2.55). For method gUz, we find that Formulag2 (U) is equal to .27 and Formula then provides the approximate bounds (.50, 2.54). Continuing with methods gL1z and gL2z, we find the CI bounds to be equal to (.51, 2.52) and (.52, 2.51), respectively. For the variance stabilizing transformation (method gHz), we use Equation 17 with Formula and obtain zg = .73. We then calculate the CI in the transformed units with Equation 17, yielding Formula. Transforming these bounds back into the original units using Equation 18 results in (.58, 2.60). Finally, for method F, we divide the bounds of the CI based on the raw units by sp = 3.79, yielding (.65, 2.52).

Next, consider the dependent samples case where interest is focused on {delta}D. Assume that the scores given earlier were obtained from 10 participants tested before and after receiving some treatment. The change scores D = X2X1 are equal to (3, 7, 6, 5, 7, 6, 1, 7, 10, 8). The unstandardized mean difference is estimated by D, which is 6.0 as in the two-independent sample case. A 95% CI for µD, obtained with Equation 6, yields the bounds (4.18, 7.82). Because the scores are highly correlated (r = .78), this CI is narrower than the one obtained from the same data when treating the two sets of scores as coming from two independent samples.

Using Equation 10 with m = 9 and c(9) = 0.91, we find gD = 2.16. An exact CI for {delta}D is given by (1.11, 3.59). Approximate 95% CIs are obtained in the same manner as before except that ñ = n = 10, N = n = 10, and Formula. The approximate bounds are equal to (.84, 3.48) using method gBz, (.89, 3.43) using method gUz, (.99, 3.33) using method gL1z, (1.03, 3.29) using method gL2z, (1.20, 3.54) using method gHz, and (1.65, 3.08) using method F.

As a final example, consider the case where an estimate and CI for {delta}D2 is sought. Using Equation 12 with m = 9 and c(9) = 0.91, we find gD2 = 1.51. An exact CI for {delta}D2 is obtained by multiplying the bounds for {delta}D, namely, (1.11, 3.59), by Formula. The scores given earlier were drawn from a population where {rho} = .80, and therefore, the exact CI for {delta}D2 is given by (.70, 2.27).

Method gBz in this case is based on Equation 35, which yields an approximate CI of (.60, 2.43). Continuing through the list of variance estimates in Table 2 and approximation methods in Table 3, we obtain the bounds (.64, 2.38) with method gUz, (.70, 2.33) with method gL1z, and (.73, 2.30) with method gL2z. Method gHz, based on the variance stabilizing transformation, yields the approximate bounds (.86, 2.47). And finally, an approximate CI is given by (1.11, 2.04) when using method F.

The examples illustrate that the methods can differ widely in terms of how well they approximate the exact CI bounds. Hedges (1982) studied method gL2z in the two-independent samples design (using sample sizes in the range 10 ≤ nE = nC ≤ 100 and values of {delta}2 between 0.25 and 1.50) and found the approximation to be quite accurate. Morris (2000) studied method dUz in the two-dependent samples design with parameter {delta}D2. However, {delta}D2 and {rho} were treated as known in the simulations, which bears little relevance to practice, where only sample estimates of {delta}D2 and {rho} are available. It is still unknown how well CIs based on the remaining methods approximate the exact CI bounds and in particular, whether one method should be preferred over the others. Results for unequal and very small sample sizes and values of |{delta}| above 1.5 are also still warranted. Finally, the two-dependent samples design with parameter {delta}D has not been studied at all so far. Therefore, Monte Carlo simulations were conducted to compare the accuracy of the various approximations.


    Method
 Top
 Abstract
 Introduction
 The Two-Independent Samples...
 The Two-Dependent Samples Design
 Confidence Intervals for {delta}
 Examples
 Method
 Results
 Discussion
 References
 
Three sets of simulations were conducted. The first set of simulations corresponds to the two-independent samples case. Values of {delta}2 between –2 and 2 in steps of .25, seven different CI widths (1 – {alpha} = .50, .60, .70, .80, .90, .95, and .99.), and various sample size configurations were used: (a) equal sample sizes of nE = nC = 4, 8, 16, 32, and 64 participants per group; (b) unequal sample sizes with (nE, nC) = (2, 6), (4, 12), (8, 24), (16, 48), and (32, 96), each corresponding to a 25/75% split of participants; and (c) unequal sample sizes with (nE, nC) = (2, 14), (4, 28), (8, 56), and (16, 112), each corresponding to a 12.5/87.5% split of participants. Therefore, there were a total of 17 {delta}2 x 14 sample size x7 CI width = 1, 666 conditions. On each of the 100,000 iterations for a particular condition, a d2 value was directly simulated from Formula, where Z is a random normal variable with distribution N({delta}2, 1/nE + 1/nC) and X is a random chi-square variable with m = nE ) nC – 2 degrees of freedom. The exact (1 – {alpha}) x100% CI bounds for {delta}2 were then determined using iterative methods. Next, approximate CI bounds were obtained with each of the 21 methods discussed earlier.

The second set of simulations corresponds to the two-dependent samples case with parameter {delta}D. Because values of {delta}D tend to be larger than {delta}2 values, values of {delta}D between –4 and 4 in steps of .50 were used. Five sample size conditions (n = 8, 16, 32, 64, and 128) were simulated. In total, this yields a total of 17{delta}D x 5 sample size x 7 CI width = 595 conditions. Again, 100,000 values of dD were directly simulated for each condition from Formula, where Z is a random normal variable with distribution N({delta}D, 1/n) and X is a random chi-square variable with m = n – 1 degrees of freedom. Exact and approximate CI bounds were then obtained for {delta}D.

The third set of simulations corresponds to the two-dependent samples case with parameter {delta}D2. For each trial, two vectors of random standard normal data were generated for various values of n with population correlation coefficient {rho} using the Cholesky decomposition. Next, a constant was added to one of the two sets such that all values of {delta}D2 between –2 and 2 in steps of .25 were represented in the simulations. Sample sizes of n equal to 8, 16, 32, 64, and 128 and values of {rho} equal to 0, .1, .3, .5, .7, and .9 were included in the simulations. This yields a total of 17{delta}D2 x 5 sample size x 6{rho} x 7 CI width = 3, 570 conditions. After generating the data, dD2 and gD2 were calculated and the exact CI bounds for {delta}D2 were determined using iterative methods (assuming known {rho}). Next, approximate CI bounds for {delta}D2 were obtained with each of the approximation methods. The third set of simulations was based on 10,000 trials per condition because simulating the raw data required substantially more computing time.

The accuracy of the various approximations was assessed with two measures: the empirical coverage probability and the ratio of the length of the approximate CI compared to the exact interval. Specifically, the empirical coverage probability was estimated with


Formula(19)

where CLi and CUi denote the lower and upper CI bounds on the ith iteration, I(CLi,CUi)[{delta}] = 1 if {delta} isin (CLi, CUi) and 0 otherwise, and R = 100, 000 for the first two sets of simulations, and R = 10, 000 for the third set. The maximum standard error of the P values is .002 for R = 100, 000 and .005 for R = 10, 000.

The empirical coverage probability indicates whether the approximation captures the parameter as often as it should. On the other hand, the ratio of the length of the approximate CI to the true CI indicates to what degree the width of the true interval was over- or underestimated. The average width ratio for a particular method was estimated with


Formula(20)

where CLi and CUi are the lower and upper bounds of the exact CI.

The error in the empirical coverage probability of an approximation is given by P – (1 – {alpha}). The error in the width ratio is given by W – 1. Because of the large number of approximations considered in the present article, the methods were first examined in terms of their maximum error in coverage probability and width ratio. These values represent the worst-case results and allow us to rule out methods that can yield grossly inaccurate approximations (the full set of results can be obtained by contacting the author).


    Results
 Top
 Abstract
 Introduction
 The Two-Independent Samples...
 The Two-Dependent Samples Design
 Confidence Intervals for {delta}
 Examples
 Method
 Results
 Discussion
 References
 
Two-Independent Samples Design
The accuracy of methods B, U, L1, L2, and H depended only on the total sample size (N = nE + nC) and not on the proportion of scores falling into each group. Also, all these methods converged to the nominal coverage probabilities and interval widths as N increased, regardless of whether the method was based on d2, g2, the normal, or the t distribution critical values. In fact, the maximum value of (P – (1 {alpha})) x100% over all conditions and methods (excluding method F) was never larger than 6.5% for N = 16, 3.0% for N = 32, 1.5% for N = 64, and 1.0% for N = 128. In terms of width ratios, convergence was somewhat slower. The maximum value of (W – 1) x100% was 28.6%, 11.9%, 5.5%, and 2.6% for a total sample size of 16, 32, 64, and 128 participants, respectively.

The convergence of the empirical coverage probabilities of methods B, U, L1, L2, and H to the nominal (1 – {alpha}) values and CI widths was to be expected due to three reasons. First, the distributions of d and g are asymptotically normal with expectation {delta}. Second, these methods are based on consistent variance estimates. And finally, for large sample sizes, the normal and t distributions converge, yielding very similar quantiles.

However, for small sample sizes, some approximations were noticeably more accurate than others. Table 4 indicates the maximum value of (P – (1 – {alpha})) x 100% and (W – 1) x100% for each method and each CI width over all values of {delta}2 for N = 8 in the equal sample size condition. Negative signs indicate underestimation of coverage probabilities and widths, whereas positive signs indicate overestimation. The table shows that the approximate bounds based on the normal distribution generally yielded more accurate results than the corresponding bounds based on the t distribution, especially with respect to width ratios and 90% or wider CIs.


View this table:
[in this window]
[in a new window]

 
TABLE 4 Maximum Error (in %) in Empirical Coverage Probabilities and Width Ratios Over All Values of {delta}2 for N = 8 in the Two-Independent Samples Case (Equal Group Sizes)

 
Table 4 also indicates that methods gUz, gL1z, dL1z, dL2z, gHz, and dHz provided the most accurate results when considering empirical coverage probabilities and width ratios simultaneously. To determine whether one of these methods is the most accurate, one should examine individual values of P and W instead of focusing on the maximum errors as done in Table 4. Accordingly, the empirical coverage probabilities and width ratios of these methods are plotted by {delta}2 values for the 95% CI condition in Figure 1 when N = 8. In addition, the results for method F are shown to illustrate its distinctive performance. Clearly, method dL1z provided the most accurate approximation to the exact CI bounds for {delta}2. Its width ratios were slightly above 1 but with no substantial impact on coverage probabilities.


Figure 10320039
View larger version (13K):
[in this window]
[in a new window]

 
FIGURE 1 Empirical coverage probability and width ratio in the two-independent sample case for N = 8 and 95% confidence intervals (CIs) (Equal Group Sizes).

 
Method F showed a very different pattern of results when compared to the other methods. At {delta}2 = 0, the empirical coverage probabilities were equal to the nominal (1 – {alpha}) values for all sample sizes (even for N = 8) and all values of (1 – {alpha}), although its width ratios were quite inaccurate. For larger values of |{delta}2|, P fell below the nominal (1 {alpha}) value, with no improvement in accuracy for larger sample sizes (see last rows in Table 4). This finding is not surprising as method F essentially assumes that {delta}2 is equal to zero. Therefore, no matter how large N becomes, method F will not provide accurate results when {delta}2 != 0. Finally, method F actually showed an increase in accuracy in the unequal sample size conditions, with more extreme splits yielding more accurate P values (again, see last rows in Table 4).

Two-Dependent Samples Design With Parameter {delta}D
All methods except F again converged to the correct coverage probabilities and interval widths as sample size increased. Specifically, (P – (1 – {alpha})) x 100% never exceeded 7.6% when n = 16, 3.9% when n = 32, 2.1% when n = 64, and 1.1% when n = 128 over all values of {delta}D and (1 – {alpha}) for methods B, U, L1, L2, and H. For n = 16, 32, 64, and 128, the highest width ratio error was 31.7%, 13.9%, 7.5%, and 3.9%, respectively. Table 5 provides the maximum value of (P – (1 – {alpha})) x 100% and (W – 1) x 100% for each method and each CI width over all values of {delta}D for n = 8. The results are similar to those from the two-independent samples case.


View this table:
[in this window]
[in a new window]

 
TABLE 5 Maximum Error (in %) in Empirical Coverage Probabilities and Width Ratios Over All Values of {delta}Dfor n = 8 in the Two-Dependent Samples Case With Parameter {delta}D

 
Figure 2 shows the empirical coverage probabilities and width ratios for methods gUz, gL1z, dL1z, dL2z, gHz, dHz, and F for the 95% CI condition when n = 8, which again indicates that method dL1z performed favorably. Because of the wider range of {delta}D values considered in this set of simulations, method F performed even worse when compared to the two-independent samples case and again showed no sign of improvement with larger sample sizes.


Figure 20320039
View larger version (13K):
[in this window]
[in a new window]

 
FIGURE 2 Empirical coverage probability and width ratio in the two-dependent sample case with parameter {delta}D for n = 8 and 95% confidence intervals (CIs).

 
Two-Dependent Samples Design With Parameter {delta}D2
Once again, all of the approximations except method F converged to the exact CI bounds in terms of empirical coverage probabilities and widths. The largest error in the coverage probabilities amounted to 8.3% for n = 16, 3.9% for n = 32, 2.7% for n = 64, and 1.7% for n = 128. For the width ratios, the largest errors were 27.1%, 16.4%, 8.5%, and 4.4%, respectively. Table 6 provides the maximum value of (P – (1 {alpha})) x 100% and (W 1) x 100% for each method over all values of {delta}D2 and {rho} in the two-dependent samples case when n = 8. The approximations were slightly less accurate than in the previous two sets of simulations, which can be attributed to the additional error introduced by having to estimate {rho}. Methods gL1z, gL2z, dL2z, and gHz provided the most accurate coverage probabilities and width ratios for small sample sizes.


View this table:
[in this window]
[in a new window]

 
TABLE 6 Maximum Error (in %) in Empirical Coverage Probabilities and Width Ratios Over All Values of {delta}D2 and {rho} for n = 8 in the Two-Dependent Samples Case With Parameter {delta}D2

 
Empirical coverage probabilities and width ratios for these four methods are plotted in Figure 3 across individual values of {delta}D2 for the 95% CI condition when {rho} = .7 and n = 8. Method dL1z, which provided the most accurate results in the previous two sets of simulations, is also plotted for comparison purposes, whereas method gHz was omitted, as it made the graphs difficult to read (it did not provide more accurate results than the two methods discussed in the following). Here, none of the methods could be considered generally superior in all aspects. In fact, all of these approximations tended to capture the parameter not often enough despite the fact that intervals provided by methods dL1z and dL2z were usually too wide on average. Overall, methods gL1z and dL2z appear to be most accurate in terms of interval widths while still providing quite accurate coverage probabilities.


Figure 30320039
View larger version (11K):
[in this window]
[in a new window]

 
FIGURE 3 Empirical coverage probability and width ratio in the two-dependent sample case with parameter {delta}D2 for n = 8, {rho} = .7, and 95% confidence intervals (CIs).

 

    Discussion
 Top
 Abstract
 Introduction
 The Two-Independent Samples...
 The Two-Dependent Samples Design
 Confidence Intervals for {delta}
 Examples
 Method
 Results
 Discussion
 References
 
Finding the exact CI for {delta} requires iterative estimation procedures. However, the present article shows that various approximate methods can be used without concern as long as sample sizes are at least moderately large. On the other hand, when sample sizes are small, then researchers can consult Tables 4, 5, and 6 when choosing an approximation. These tables provide the maximum error over conditions that are not under the control of the experimenter, namely, the true value of {delta} and the true value of {rho} in the two-dependent samples case. In other words, because these parameters are unknown in practice, one cannot pick a method that would be optimal for particular {delta} and {rho} values. A reasonable alternative is a mini-max approach, choosing a method that minimizes the maximum possible error.

Ideally, one method would be most accurate for all the designs considered. Method dL1z was extremely accurate in the two-independent and two-dependent samples case with parameter {delta}D. However, it did not perform quite as well in the two-dependent samples case with parameter {delta}D2, where methods gL1z and dL2z provided the most accurate results. The results in the two-dependent samples case were of particular interest because methods to find the exact CI of {delta}D2 without knowledge of {rho} have not been developed yet. Therefore, researchers must rely on approximations in practice and methods gL1z and dL2z are to be recommended here. The fact that methods based on the large-sample variances provided some of the most accurate approximations to the exact CI bounds is certainly a welcomed finding because these methods are also the easiest to compute among all the suggested approximations.

A final word of caution is warranted when discussing standardized effect sizes. The standardized effect size {delta} is useful when comparing results from multiple studies using measurement instruments whose raw units are not directly comparable. If the different instruments provide scores that are linear transformations of each other, then standardizing the raw effect sizes allows comparisons across different instruments. The problem with standardized effect sizes is their dependence on the amount of variability in the population. The problem is twofold. First of all, {delta} assumes homoscedasticity of the scores in the groups or across repeated measures. When {sigma} is not homogeneous, use of {delta} might be problematic. However, using standardized units creates an even more notable problem because {delta} depends on the particular characteristics of the population being studied, specifically, its variance (Cohen, 1994). In other words, two d or g values for the same outcome measure obtained from two experiments could be incommensurable if the samples were drawn from populations with unequal variances.

Cohen (1994) emphasized that researchers must begin to ‘‘respect the units they work with’’ (p. 1001). In the ideal case where the raw units of measurement have a natural interpretation and are consistent across multiple studies, it is not necessary to standardize the effect size. Using raw units eliminates the dependency of the effect size on the population variance. CIs for effect sizes in raw units are easily obtained and are exact. However, in the social sciences, the multitude of scales and measurement instruments will necessitate the use of standardized effect sizes in the future.


    Footnotes
 
WOLFGANG VIECHTBAUER is Assistant Professor, Department of Methodology and Statistics, at the University of Maastricht; P.O. Box, 6200 MD Maastricht, The Netherlands; e-mail:wolfgang.viechtbauer{at}stat.unimaas.nl. His research interests include mixed-effects models, meta-analysis, multilevel modeling, longitudinal analysis, and effect size measures.

I would like to thank David Thissen and three anonymous reviewers for their valuable comments on an earlier draft of this article.

Manuscript received June 28, 2004. Accepted for publication May 4, 2005.


    References
 Top
 Abstract
 Introduction
 The Two-Independent Samples...
 The Two-Dependent Samples Design
 Confidence Intervals for {delta}
 Examples
 Method
 Results
 Discussion
 References
 
American Psychological Association. (2001). Publication manual of the American Psychological Association (5.). Washington, DC: Author.

Becker, BJ. (1988). Synthesizing standardized mean-change measures. British Journal of Mathematical and Statistical Psychology (41), 257-278

Bird, KD. (2002). Confidence intervals for effect sizes in analysis of variance. Educational and Psychological Measurement, 62, 197-226.[Abstract/Free Full Text]

Carlson, JG, Chemtob, CM, Rusnak, K, Hedlund, NL, & Muraoka, MY. (1998). Eye movement desensitization and reprocessing (EMDR) treatment for combat-related posttraumtic stress disorder. Journal of Traumatic Stress, 11, 3-24.[CrossRef][Web of Science][Medline] [Order article via Infotrieve]

Cohen, J. (1994). The earth is round (p < :05). American Psychologist, 49, 997-1003.[CrossRef]

Cumming, G. (2003). Exploratory software for confidence intervals [Computer software]. Retrieved January 25, 2007, from http://www.latrobe.edu.au/psy/esci/index.html

Cumming, G, & Finch, S. (2001). A primer on the understanding, use, and calculation of confidence intervals that are based on central and noncentral distributions. Educational and Psychological Measurement, 61, 532-574.[Abstract/Free Full Text]

Fidler, F, & Thompson, B. (2001). Computing correct confidence intervals for ANOVA fixed- and random-effects effect sizes. Educational and Psychological Measurement, 61, 575-604.[Abstract/Free Full Text]

Gibbons, RD, Hedeker, DR, & Davis, JM. (1993). Estimation of effect size from a series of experiments involving paired comparisons. Journal of Educational Statistics, 18, 271-279.[CrossRef][Web of Science]

Hedges, LV. (1981). Distribution theory for Glass’s estimator of effect size and related estimators. Journal of Educational Statistics, 6, 107-128.[CrossRef]

Hedges, LV. (1982). Estimation of effect size from a series of independent experiments. Psychological Bulletin, 92, 490-499.[CrossRef][Web of Science]

Hedges, LV. (1983). A random effects model for effect sizes. Psychological Bulletin, 93, 388-395.[CrossRef][Web of Science]

Hedges, LV, & Olkin, I. (1985). Statistical methods for meta-analysis. San Diego, CA: Academic Press.

Hyde, JS. (2001). Reporting effect sizes: The roles of editors, textbook authors, and publication manuals. Educational and Psychological Measurement, 61, 225-228.[Free Full Text]

Kirk, RE. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement, 56, 746-759.[Abstract]

Marcus, SV, Marquis, P, & Sakai, C. (1997). Controlled study of treatment of PTSD using EMDR in an HMO setting. Psychotherapy, 34, 307-315.[Web of Science]

Morris, SB. (2000). Distribution of the standardized mean change effect size for meta-analysis on repeated measures. British Journal of Mathematical and Statistical Psychology, 53, 17-29.[CrossRef]

Morris, SB, & DeShon, RP. (2002). Combining effect size estimates in meta-analysis with repeated measures and independent-groups designs. Psychological Methods, 7, 105-125.[CrossRef][Web of Science][Medline] [Order article via Infotrieve]

Olkin, I, & Pratt, JW. (1958). Unbiased estimation of certain correlation coefficients. Annals of Mathematical Statistics, 29, 201-211.[Medline] [Order article via Infotrieve]

Ray, JW, & Shadish, WR. (1996). How interchangeable are different estimators of effect size? Journal of Consulting and Clinical Psychology, 64, 1316-1325.[CrossRef][Web of Science][Medline] [Order article via Infotrieve]

Schmidt, FL. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115-129.[CrossRef][Web of Science]

Smithson, MJ. (2003a). Confidence intervals. Thousand Oaks, CA: Sage.

Smithson, MJ. (2003b). Scripts and software for noncentral confidence interval and power calculations [Computer software]. Retrieved January 25, 2007, from http://www.anu.edu.au/psychology/people/smithson/details/CIstuff/CI.html

Steiger, JH, & Fouladi, RT. In Harlow, LL, Mulaik, SA, & Steiger, JH (Eds.). (1997). Noncentrality interval estimation and the evaluation of statistical models. What if there were no significance tests? (p. 221-257). Mahwah, NJ: Lawrence Erlbaum.

Thompson, B (Ed.). (2001). Confidence intervals for effect sizes [Special section]. Educational and Psychological Measurement, 61 (p. 517-667.[Abstract/Free Full Text]

Thompson, B. (2002). What future quantitative social science research could look like: Confidence intervals for effect sizes. Educational Researcher, 31, 25-32.[Abstract/Free Full Text]

Wilkinson, L. APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.[CrossRef]

Journal of Educational and Behavioral Statistics, Vol. 32, No. 1, 39-60 (2007)
DOI: 10.3102/1076998606298034


Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati   Add to Twitter Twitter    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Free Full Text (Free PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to Saved Citations
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Request Reprints
Right arrow Add to My Marked Citations
Citing Articles
Right arrow Citing Articles via Google Scholar
Right arrow Citing Articles via Scopus
Google Scholar
Right arrow Articles by Viechtbauer, W.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati   Add to Twitter  
What's this?

AER home page RER home page JEB home page EPA home page RRE home page