Advanced Search

Journal Navigation

Journal Home

Subscriptions

Archive

Contact Us

Table of Contents

Click here to sign up for SAGE Journal Email Alerts today!

Sign In to gain access to subscriptions and/or personal tools.
Journal of Educational and Behavioral Statistics
This Article
Right arrow Abstract Freely available
Right arrow Free Full Text (Free PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to Saved Citations
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Request Reprints
Right arrow Add to My Marked Citations
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Right arrow Citing Articles via Scopus
Google Scholar
Right arrow Articles by Monahan, P. O.
Right arrow Articles by Perkins, A. J.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati   Add to Twitter  
What's this?

Articles

Odds Ratio, Delta, ETS Classification, and Standardization Measures of DIF Magnitude for Binary Logistic Regression

Patrick O. Monahan

Indiana University

Colleen A. McHorney

Merck & Co., Inc

Timothy E. Stump
Anthony J. Perkins

Regenstrief Institute, Inc. and Indiana University


    Abstract
 Top
 Abstract
 Theoretical Foundation and...
 Examples
 Discussion
 References
 
Previous methodological and applied studies that used binary logistic regression (LR) for detection of differential item functioning (DIF) in dichotomously scored items either did not report an effect size or did not employ several useful measures of DIF magnitude derived from the LR model. Equations are provided for these effect size indices. Using two large data sets, the authors demonstrate the usefulness of these effect sizes for judging practical importance: the LR adjusted odds ratio and its conversions to the delta metric, the Educational Testing Service (ETS) classification system, and the p metric; the LR model-based standardization indices, using various weights for averaging stratum-specific differences in fitted probabilities; and a p metric classification system. Pros and cons of these effect sizes are discussed. Recommendations are offered. These LR effect sizes will be valuable to practitioners, particularly for preventing flagging of statistically significant but practically unimportant DIF in large samples.

Key Words: Keywords: differential item functioning • logistic regression • effect sizes

In differential item functioning (DIF) analyses, groups are compared on item performance after adjusting for overall performance on the measured trait (Holland & Wainer, 1993). Since Swaminathan and Rogers (1990) applied the binary logistic regression (LR) procedure to the detection of DIF in dichotomous test items, the LR method has become increasingly popular for this purpose. However, Swaminathan and colleagues focused on hypothesis testing (Narayanan & Swaminathan, 1996; Rogers & Swaminathan, 1993; Swaminathan & Rogers, 1990). It is important to incorporate an effect size into flagging rules, especially in large samples, because high power can yield significance for practically unimportant effect sizes (e.g., Kirk, 1996).

Several methodological and applied studies investigating binary LR for DIF have flagged items for DIF based only on statistical significance (Clauser, Nungester, Mazor, & Ripkey, 1996; Huang & Dunbar, 1998; Kwak, Davenport, & Davison, 1998; Marshall, Mungas, Weldon, Reed, & Haan, 1997; Mazor, Kanjee, & Clauser, 1995; Whitmore & Schumacker, 1999; Woodard, Auchus, Godsall, & Green, 1998). Previous attempts to report effect sizes for binary LR have included using the LR Wald chi-square value (Huang & Dunbar, 1998), reporting raw or standardized LR coefficients on the log odds scale (Borsboom, Mellenbergh, & Heerden, 2002; Clauser & Mazor, 1998; Millsap & Everson, 1993; Swanson, Clauser, Case, Nungester, & Featherman, 2002), presenting R2-like measures (Swanson et al., 2002; Zumbo, 1999), calculating the partial gamma (Groenvold, Bjorner, Klee, & Kreiner, 1995), listing eta-squared (Whitmore & Schumacker, 1999), adopting a chance-corrected proportion of correct classification (Hess, Olejnik, & Huberty, 2001), and plotting fitted probabilities or fitted logits (Schmitt, Holland, & Dorans, 1993). These attempts contributed to DIF literature. However, none of these works focused on several intuitive effect sizes that can easily be derived from binary LR: the adjusted odds ratio, delta statistic, Educational Testing Service (ETS) classification system, adjusted odds ratio reported on the p metric, and model-based standardization indices of conditional differences in proportions. We found only one DIF study that reported odds ratios for binary LR (Volk, Cantor, Steinbauer, & Cass, 1997).

The purposes of this article are to (a) provide and explain the equations for obtaining these useful effect sizes for the LR procedure, (b) demonstrate the application of these effect sizes, and (c) present the pros and cons of these effect sizes and offer guidance in how to use them. We focus here on effect sizes for uniform DIF. We are investigating LR effect sizes for nonuniform DIF. Although a strength of LR is powerful detection of nonuniform DIF, corresponding effect sizes requires more research because the choice of weights for averaging stratum-specific measures is especially critical when interactions are present (e.g., Mosteller & Tukey, 1977).


    Theoretical Foundation and Effect Size Formulas
 Top
 Abstract
 Theoretical Foundation and...
 Examples
 Discussion
 References
 
The Logistic Regression (LR) Procedure for DIF Detection
In the binary LR model, the probability of endorsing a dichotomously scored item is


Formula(1)

and the log odds (or logit) of endorsing the item is modeled as


Formula(2)

where ln is the natural logarithm, x is a measure of overall proficiency (usually total score), g is a dummy variable representing group membership (traditionally, 1 = reference group, 0 = focal group), xg is the interaction term between total score and group membership, and beta0 is the intercept (Swaminathan & Rogers, 1990). The 1 df test of beta3 = 0 is a test of nonuniform DIF. If nonuniform DIF is absent, the xg term can be deleted from the model, and then the 1 df test of beta2 = 0 provides a test of uniform DIF.1 Effect sizes complement these tests. We describe two categories of effect sizes distinguished by the metric of defining departures from the null hypothesis: conditional log odds ratios versus conditional differences in proportions.

Effect Sizes for LR Based on the Conditional-Log-Odds-Ratio Definition of DIF
A natural effect size for LR is the odds ratio. LR coefficients (Formulaj) are estimated on the log odds scale. The exponential of Formulaj [i.e., exp(Formulaj)] yields the maximum likelihood estimated odds ratio of the event of interest for every one-unit increase in the jth predictor, adjusted for other covariates in the LR model (Hosmer & Lemeshow, 2000). Thus, the exponential of Formula2 provides the reference-to-focal odds ratio of endorsing the item, conditional on proficiency:2


Formula(3)

Odds ratios range from 0 to {infty}. Values of FormulaLR further from 1.0 represent greater DIF magnitude. An odds ratio and its reciprocal are equivalent in strength but not symmetrical in distance from the null value of 1.0 (e.g., 4.0 and 0.25).

Another option for an effect size is to transform FormulaLR to the logistic definition of the delta scale, used by ETS to measure item difficulty. We use the formula Holland and Thayer (1988) used to convert the Mantel-Haenszel (MH) odds ratio (FormulaMH) to the MH delta-DIF statistic (MH-D-DIF or FormulaMH):


Formula(4)

It is apparent that LR-D-DIF is a simple linear rescaling of the regression coefficient, Formula2.3

In addition, one can calculate the ETS classification system (Dorans & Holland, 1993):

Category A. Items with negligible or nonsignificant DIF. Defined by LR-D-DIF not significantly different from zero or absolute value less than 1.0.
Category B. Items with slight to moderate magnitude of statistically significant DIF. Defined by LR-D-DIF significantly different from zero and absolute value of at least 1.0 and either less than 1.5 or not significantly greater than 1.0.
Category C. Items with moderate to large magnitude of statistically significant DIF. Defined by absolute value of LR-D-DIF of at least 1.5 and significantly greater than 1.0.

Notice that assigning Categories A and B entails using LR to test Ho : beta2 = 0. Assigning Categories B and C requires testing Ho : |LR-D-DIF| ≤ 1.0. Practitioners can perform the latter test in LR by testing Ho: |beta2| ≤.4255 (i.e., {Delta}LR of 1.0 equals beta2 of –.4255).

It is also possible for reporting purposes to convert a conditional log-odds-ratio-based index to the metric of differences in item-proportion-correct called the p metric. We use the formula that Dorans and Holland (1993) used to convert FormulaMH to MH-P-DIF:


Formula(5)

where,


Formula(6)

The term Pr{Psi} is the predicted proportion of examinees endorsing the item in the reference group based on FormulaLR, and Pf is the proportion of examinees endorsing the item in the focal group.

Effect Sizes for LR Based on the Conditional-Difference-in-Proportions Definition of DIF
The contingency-table standardization (STD) procedure defines departures from the null DIF hypothesis with conditional differences in proportions, and the resulting measure is usually reported in the p metric (STD-P-DIF) (Dorans & Holland, 1993; Dorans & Kulick, 1986). One could estimate a LR model-based STD measure of DIF:


Formula(7)

where PfmLR and PrmLR are predicted from the LR model.

This index is reminiscent of item response theory (IRT) model-based standardization (Wainer, 1993) except instead of integrating over {theta}, averaging occurs over total scores. Historically, absolute values between .05 and .10 are inspected to ensure that no possible DIF is overlooked, and absolute values above .10 are considered more unusual and should be examined (Dorans & Kulick, 1986). One could implement a p metric classification system for LR, applicable to LR-STD-P-DIF or LR-P-DIF:

Category A. Items with negligible or nonsignificant DIF. Defined by p index not significantly different from zero or 0 ≤ |p index| ≤.05.
Category B. Items with marginal magnitude of statistically significant DIF. Defined by p index significantly different from zero and .05 < |p index| ≤.10.
Category C. Items with definite magnitude of statistically significant DIF. Defined by p index significantly different from zero and |p index|> .10.

The weights (wm) for averaging conditional differences in proportions in the STD procedure have traditionally been based on intuitive rationale. In DIF studies, wm is often chosen to be the number of focal group examinees at each stratum (Nfm) (Dorans & Kulick, 1986). Other plausible weights include (Dorans & Holland, 1993; Mosteller & Tukey, 1977) (a) the number of reference group examinees at each stratum (Nrm), (b) the number of the total examinees at each stratum (Ntm), or (c) the relative frequency of some real or hypothetical standard group. One could also use Cochran’s (1954) statistically driven weights (Dorans & Holland, 1993):


Formula(8)

Another option, available in the STDIF software (Robin, 2001; Zenisky, Hambleton, & Robin, 2003), is the equal weight (wm = 1), which yields an unweighted average.


    Examples
 Top
 Abstract
 Theoretical Foundation and...
 Examples
 Discussion
 References
 
We performed a gender (1 = male, 0 = female) DIF analysis in two data sets. The first data set was the Supplement on Aging (SOA) to the 1984 National Health Interview Survey (U.S. Department of Health and Human Services, 1997). This study was designed to assess the future needs of the elderly in the United States. Participants were 55 and older (n = 12, 943). We analyzed 23 dichotomous functional status items. Each item measured whether participants reported a problem (1 = yes, 0 = no) performing an activity. The second data set was from the Established Populations for Epidemiologic Studies of the Elderly (EPESE). Persons (age ≥ 65) were interviewed to identify predictors of mortality (Taylor, Wallace, Ostfeld, & Blazer, 1998). We analyzed the 20 dichotomous items of the Center for Epidemiologic Studies Depression Scale (CES-D; Radloff, 1977) obtained at baseline on 3,401 participants from the Duke site. The CES-D is a widely used self-report measure of depressive symptomatology for the general population. Each item was scored for presence (1) or absence (0) of a depressive symptom.

Statistical Methods
We modeled Equation 2, without the interaction term, using binary LR with the SAS LOGISTIC procedure. The matching score was the total score including the studied item (Holland & Thayer, 1988). For the purposes of this article, we examined only uniform DIF. Graphs of empirical logits and LOWESS smoothed curves indicated the LR assumption of linearity of the logit was reasonably satisfied for all items (Monahan, 2004). The SOA and EPESE items were approximately unidimensional because the cross-validated DETECT index was .16 and .28, respectively (Monahan, Stump, Finch, & Hambleton, in press; Roussos & Ozbek, 2006). In SOA data, there were 7,822 women and 5,121 men. In EPESE data, there were 2,203 women and 1,198 men. For both data sets, total score was skewed right (less skewed for EPESE), and men had a slightly lower mean and variance than women.

Result of Using LR Statistical Test Alone to Detect Uniform DIF
Table 1 displays DIF effect sizes for functional status items from the SOA data set. Each row (studied item) in Table 1 represents a different LR model. The right-most column shows the observed significance of the two-sided LR Wald chi-square test of uniform DIF. Based on this test alone, even if we used a conservative Bonferroni-adjusted significance level of .00217 (i.e., .05/23), 11 of 23 items would be flagged for DIF (Table 1, bold items). Most DIF studies using LR have relied on Wald tests alone. We now illustrate the importance of effect sizes.


View this table:
[in this window]
[in a new window]

 
TABLE 1 Logistic Regression (LR) Effect Sizes for Measuring Uniform Differential Item Functioning (DIF): Gender DIF in Supplement on Aging (SOA) Functional Status Items

 
Effect Sizes for LR Based on the Conditional-Log-Odds-Ratio Definition of DIF
We will interpret the effect sizes in Table 1 from left to right, beginning with the LR odds ratio (FormulaLR). By sorting on ascending LR-D-DIF (equivalently, descending FormulaLR), items were conveniently grouped by direction and magnitude of DIF. Thus, for the 11 statistically significant items, the 6 items at the top were more greatly endorsed by men and the 5 items at the bottom were more greatly endorsed by women after adjusting for total score. The FormulaLR for these 11 items varied in strength from 1.55 to 2.94 (men displayed greater functional problems after adjustment) and from 0.74 to 0.21 (women displayed greater functional problems after adjustment) (Table 1). For example, the LR model estimated that the odds that men reported having a problem with lifting and carrying 25 pounds was about one fifth (0.21) times the odds that women reported this problem, adjusted for overall functional status. The LR-estimated odds of having a problem with using the telephone was almost 3 (2.94) times greater for men compared to women, controlling for overall functional status (Table 1). However, the odds of having a problem with walking was only 11/2 times greater for men. Therefore, FormulaLR indicates that not all 11 statistically significant items exhibited equally important DIF magnitude.

Likewise, LR-D-DIF (FormulaLR) for the 11 statistically significant items varied in strength from –1.02 to –2.53 and from 0.71 to 3.63 (Table 1). According to the LR-based ETS classification system (LR-ETS-class in Table 1), only 5 of 11 items displayed moderate to large magnitude of statistically significant DIF (C), and 5 items showed moderate DIF (B). We also used the transformation of FormulaLR to the p metric (LR-P-DIF in Table 1) to classify items according to the p metric system described earlier (LR-P-DIF-class in Table 1). This LR-P-DIF classification system indicated weaker DIF than the LR-ETS classification system for 5 items and stronger DIF than LR-ETS classification for 2 items (Table 1). This was because the nonlinear relationship between LR-D-DIF and LR-P-DIF depends on the proportion of focal group examinees endorsing the item (Pf ). Notice that Pf for Items 2, 3, 9, 11, and 23 were closer to zero than Pf for Items 14 and 16 (Table 1). Using Equations 4, 5, and 6, it can be shown that for a given value of FormulaLR or FormulaLR, LR-P-DIF will be less for lowly or highly endorsed items than for moderately endorsed items. Likewise, for a given value of LR-P-DIF, FormulaLR and FormulaLR will be greater for Pf near zero or one than for Pf near .50 (see Discussion).

Effect Sizes for LR Based on the Conditional-Difference-in-Proportions Definition of DIF
Using the traditional weight for DIF analyses (wm = Nfm), the magnitudes of the LR-based standardization index (LR-STD-P-DIF-focal in Table 1) for the 11 statistically significant items were even smaller than the log-odds-ratio-based p metric magnitudes of LR-P-DIF (Table 1). A p metric classification system based on LR-STD-P-DIF-focal (i.e., LR-STD-P-DIF-focal-class in Table 1) resulted in flagging only 1 item as definite DIF (C) and only 1 item as marginal DIF (B). For both data sets, the LR standardization effect size was very similar when two other weights were used [total group distribution (Ntm) and Cochran (cm)], differing at most from LR-STD-P-DIF-focal by .02 for any item but usually by .01 or less (data not shown). The standardization index using wm = 1 (LR-STD-P-DIF-equal in Table 1) differed from the other three LR-STD-P-DIF indices (wm = Nfm, Ntm, cm), generally displaying slightly greater absolute values, resulting in 1 item demonstrating definite DIF (C) and 5 items revealing marginal DIF (B).

Abbreviated Results for EPESE Data
Table 2 shows effect sizes for the CES-D depression items. Again, fewer items were flagged when effect sizes supplemented the LR Wald test. Of five statistically significant items, only one item displayed moderate to large DIF (C) by any classification system (Table 2). The main difference between Table 1 and Table 2 is that compared to functional status items (Table 1), depression items (Table 2) showed less difference between the log-odds-ratio-based p index (LR-P-DIF) and the focal-group standardization p index (LR-STD-P-DIF-focal). This was because depression items revealed less DIF magnitude and less skewness of total score. Specifically, using Equations 5, 6, and 7, where wm = N fm, it can be shown that the difference between LR-P-DIF and LR-STD-P-DIF-focal depends on FormulaLR, PfmLR, and N fm (Pf is a function of N fm and PfmLR; PrmLR is a function of FormulaLR and PfmLR because the odds ratio is assumed to be constant across strata in uniform-DIF LR). In short, the magnitude of LR-STD-P-DIF-focal increases as PfmLR values near .50 receive greater weight (i.e., larger N fm) than PfmLR values near zero or one.


View this table:
[in this window]
[in a new window]

 
TABLE 2 Logistic Regression (LR) Effect Sizes for Measuring Uniform Differential Item Functioning (DIF): Gender DIF in Established Populations for Epidemiologic Studies of the Elderly (EPESE) Depression Items

 
Sensitivity of Results
Results were very similar after deleting examinees at the floor and ceiling. We computed Cochran’s (1954) test criterion by specifying wm = cm in LR-STD-P-DIF and by using LR-predicted proportions in the standard error; the observed significance was extremely similar to the LR Wald observed significance for all items in both data sets (differed at most by .008). This is not surprising given that Cochran (1954) derived these weights for a test criterion that would be powerful for detecting an alternative hypothesis of a constant difference on either the logit or probit scale. Thus, in LR, although Cochran weights are an option when computing STD-LR-P-DIF, the Cochran test might be an unnecessary adjunct to the LR Wald test.


    Discussion
 Top
 Abstract
 Theoretical Foundation and...
 Examples
 Discussion
 References
 
Choosing an Effect Size: Pros and Cons
The effect sizes can be contrasted on a number of dimensions. First, as for ease of interpretation, indices reported on the delta and p metric are symmetrical around their null value of zero, which facilitates interpreting DIF in opposite directions. However, those experienced with interpreting odds ratios may find FormulaLR easier to interpret than LR-D-DIF, which is on the ETS-preferred delta metric. For data conforming to the two-parameter logistic (2PL) IRT model, one advantage of LR-D-DIF is that the MH-D-DIF parameter ({Delta}2PL) can be written as a linear rescaling of the difference between b parameters (Roussos, Schnipke, & Pashley, 1999):4 {Delta}2PL = 4a(bRbF ). The {alpha}MH parameter also shares the advantage of being related, although nonlinearly, to IRT b-DIF (Roussos et al., 1999). The p metric is probably the most universally understood metric, conveniently connected to total and true score metrics. Practitioners should choose an effect size that they and their readership can easily interpret.

Second, practitioners should choose an effect size whose metric for defining departures from the null hypothesis supplies the most valid definition of DIF for their purpose. Specifically, relative to conditional odds ratios (FormulaLR) or log odds ratios (FormulaLR), conditional differences between proportions (LR-STD-P-DIF) will be compressed for items with low or high endorsement rates. However, from another perspective, odds ratios and log odds ratios play up small differences in proportions that are near zero or near one. LR-P-DIF is based on the conditional log-odds-ratio definition but is reported on the p metric. Therefore, LR-P-DIF shares some properties with LR-STD-P-DIF (ease of interpretation and lower magnitude, relative to odds ratios, for lowly or highly endorsed items).

Third, in terms of fundamental connections to the LR model, FormulaLR is a natural estimator, fundamentally connected to a parameter of the LR model. LR-D-DIF is also a simple rescaling of an estimated LR parameter and therefore is fundamentally connected to the LR model. A disadvantage of the LR-STD-P-DIF index is that it is the most removed from the LR parameter estimation method. However, this does not invalidate it as a descriptive measure of DIF.

Fourth, as far as ease of programming, standard software for LR automatically provides FormulaLR. LR-D-DIF can easily be calculated. LR-P-DIF requires slightly more programming to convert FormulaLR to LR-P-DIF. LR-STD-P-DIF requires the most additional programming because predicted proportions at each total score level must be first computed and then weighted.

Fifth, the purpose of weights in standardization is not only to standardize according to the distribution of interest but also to yield smaller weight to sparse strata that provide less precise information. Using equal weights could be dangerous if one or more strata are sparse. (None of the values for total score were sparse for the present data sets.) If one chooses an outside (real or hypothetical) standard distribution, one must be careful to not combine large weights with ill-determined differences in proportions (Mosteller & Tukey, 1977).

Recommendations on How to Use the Effect Sizes
First, we recommend using an effect size and a statistical test when deciding whether items exhibit DIF. The effect size prevents flagging unimportant differences in large samples, and the statistical test prevents flagging noise in small samples. Unimportant differences can be significant, as demonstrated here, when using a statistical test alone, even if statistical tests are conservatively adjusted for multiple comparisons. In addition, effect sizes and their classifications help distinguish between levels of nonnegligible DIF (e.g., B vs. C).

Second, practitioners must decide what values of the effect size represent negligible, moderate, and large magnitudes for the intended purpose. For example, ETS uses thresholds of 1.0 and 1.5 on the absolute value of the delta metric, which are equivalent to odds ratios greater than 1.53 (or less than 0.65) and greater than 1.89 (or less than 0.53), respectively. Users of STD procedures often use .05 and .10 thresholds on the absolute value of the p metric. In the medical sciences, the FormulaLR thresholds of 1.5 and 2.0 are common due to convenient interpretations of one and one-half and twice the odds, respectively. However, smaller thresholds are used (e.g., 1.1) if the exposure is prevalent and disease serious (e.g., when determining whether risk of heart disease is associated with hormone supplements). Interestingly, FormulaLR thresholds of 1.5 and 2.0 are nearly equivalent to the delta thresholds used in the ETS classification system.

Third, one can take steps to facilitate interpretations. One could calculate the reciprocal of odds ratios less than one. By calculating 1/.74 = 1.35, one can readily see that .74 for Item 21 in Table 1 is not as strong as 1.55 for Items 5 and 16. One can use graphs (e.g., scatter, line, and bar), which help discern relative distances between DIF magnitudes. In addition, sorting items by direction and magnitude of DIF in tables, as we did here, aids interpretation.

Fourth, these effect sizes can be used to facilitate comparisons of DIF procedures. One could compare MH, LR, SIBTEST, STD, and IRT procedures on the p metric (using LR-P-DIF or LR-STD-P-DIF for LR). Likewise, one could compare procedures on the odds ratio or delta metric, where STD-P-DIF and the latent-true-score adjusted difference in proportions of SIBTEST are converted using a formula similar to Equation 22 in Dorans and Holland (1993).

Limitations
The present analyses employed large sample sizes. In smaller sample sizes, the discrepancy between statistical significance versus flagging items with the combination of effect sizes and statistical significance should not be as great. The degree of discrepancy observed here between LR-D-DIF, LR-P-DIF, and LR-STD-P-DIF may differ for other data sets.

Conclusions
These effect sizes and classification systems have received little attention in the DIF literature for binary LR: the adjusted odds ratio (FormulaLR), LR-D-DIF (FormulaLR), LR-P-DIF, LR model-based standardization indices (LR-STD-P-DIF), and the ETS and p metric classification systems. The present examples demonstrate that these effect sizes are quite useful for preventing practically unimportant DIF from being flagged, especially in large samples. There are various pros and cons for choosing among these effects sizes. When steps are taken for their proper use, these effect sizes should be of great benefit to practitioners.


    Footnotes
 
1 The original proposal was to use the two df simultaneous test of uniform and nonuniform differential item functioning (DIF); however, when only uniform DIF is present, including the interaction term in the test may decrease power (Swaminathan & Rogers, 1990). Back

2 We considered the ML subscript (maximum likelihood estimation); however, the LR subscript in Equation 3 reminds practitioners that the odds ratio was estimated by assuming a logistic regression (LR) model. Back

3 Jodoin and Gierl (2001) suggested that R2-like indices are preferable to effect sizes based on Formula2 because the latter would depend on the coding of the group variable [i.e., reference cell (0/1) vs. deviations-from-means method (–1/1)]. However, we agree with Hosmer and Lemeshow (2000) that the reference cell method is more useful for LR because the exponential of beta2 is interpreted as a ratio of odds for one group versus the other group. If one codes the focal group as 0 and the reference group as 1 and then models item endorsement, as we did here, FormulaLR and FormulaLR have the same interpretations as for the Mantel-Haenszel (MH) procedure. Reference cell coding is no more arbitrary than row and column specification in the MH procedure. Back

4 In this formula (i.e., Equation 16 in Roussos, Schnipke, & Pashley, 1999), item discrimination (a) for the two-parameter logistic (2PL) item response theory (IRT) model varies over items and the MH delta-DIF (MH-D-DIF) parameter is conditional on theta, whereas in Equation 13 in Donoghue, Holland, and Thayer (1993), a is constant across items because the MH-D-DIF parameter is conditional on observed total score where the corresponding IRT model is the Rasch model. Back

PATRICK O. MONAHAN is assistant professor, Division of Biostatistics, Department of Medicine, School of Medicine, Indiana University, 410 West 10th Street Suite 3000, Indianapolis, IN 46202;pmonahan{at}iupui.edu. His area of interest is measurement and statistics applied to the behavioral and social sciences. Back

COLLEEN A. McHORNEY, PhD, is director of outcomes research at Merck & Co., Inc., WP39-166, 770 Sumneytown Pike, West Point, PA 19486-0004. Her areas of expertise relate to the measurement and evaluation of patient-reported outcomes, including health status, quality of life, patient satisfaction, and patient preferences. Back

TIMOTHY E. STUMP is statistician, Regenstrief Institute, Inc. and the Indiana University Center for Aging Research;tstump{at}regenstrief.org. His area of interest is measurement and statistics in the medical sciences. Back

ANTHONY J. PERKINS is a statistical consultant for the Regenstrief Institute, Inc. and the Indiana University Center for Aging Research;tperkins348{at}sbcglobal.net. His area of interest is item bias in quality of life instruments. Back

This research was supported by NIA Grant R01 AG022067, NCI Grant R03 CA 113099-01, and the Mary Margaret Walther Program for Cancer Care Research. Suggestions by the editor and two anonymous reviewers led to improved presentation. Back

Manuscript received July 15, 2004. Accepted for publication August 2, 2005.


    References
 Top
 Abstract
 Theoretical Foundation and...
 Examples
 Discussion
 References
 
Borsboom, D, Mellenbergh, GJ, & Heerden, Jv. (2002). Different kinds of DIF: A distinction between absolute and relative forms of measurement invariance and bias. Applied Psychological Measurement, 26, 433-450.[Abstract/Free Full Text]

Clauser, BE, & Mazor, KM. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17(1), 31-44.

Clauser, BE, Nungester, RJ, Mazor, K, & Ripkey, D. (1996). A comparison of alternative matching strategies for DIF detection in tests that are multidimensional. Journal of Educational Measurement, 33, 202-214.[CrossRef][Web of Science]

Cochran, WG. (1954). Some methods for strengthening the common {chi}2 tests. Biometrics, 10, 417-451.[Medline] [Order article via Infotrieve]

Donoghue, JR, Holland, PW, & Thayer, DT. In Holland, PW, & Wainer, H (Eds.). (1993). A Monte Carlo study of factors that affect the Mantel-Haenszel and standardization measures of differential item functioning. Differential item functioning (p. 137-166). Hillsdale, NJ: Lawrence Erlbaum.

Dorans, NJ, & Holland, PW. In Holland, PW, & Wainer, H (Eds.). (1993). DIF detection and description: Mantel-Haenszel and standardization. Differential item functioning (p. 35-66). Hillsdale, NJ: Lawrence Erlbaum.

Dorans, NJ, & Kulick, E. (1986). Demonstrating the utility of the standardization approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23, 355-368.[CrossRef][Web of Science]

Groenvold, M, Bjorner, JB, Klee, MC, & Kreiner, S. (1995). Test for item bias in a quality of life questionnaire. Journal of Clinical Epidemiology, 48, 805-816.[CrossRef][Web of Science][Medline] [Order article via Infotrieve]

Hess, B, Olejnik, S, & Huberty, CJ. (2001, April). The efficacy of two improvement-over-chance effect size measures for two-group univariate comparisons under variance heterogeneity and nonnormality, Paper presented at the annual meeting of the American Educational Research Association, Seattle, WA

Holland, PW, & Thayer, DT. In Wainer, H, & Braun, HI (Eds.). (1988). Differential item performance and the Mantel-Haenszel procedure. Test validity (p. 129-145). Hillsdale, NJ: Lawrence Erlbaum.

Holland, PW, & Wainer, H (Eds.). (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum.

Hosmer, DW, & Lemeshow, S. (2000). Applied logistic regression (2.). New York: John Wiley.

Huang, C-Y, & Dunbar, SB. (1998, April). Factors influencing the reliability of DIF detection methods, Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA.

Jodoin, MG, & Gierl, MJ. (2001). Evaluating Type I error and power rates using an effect size measure with the logistic regression procedure for DIF detection. Applied Measurement in Education, 14, 329-349.[CrossRef][Web of Science]

Kirk, RE. (1996). Practical significance: A concept whose time has come. Educational and Psychological Measurement, 56, 746-759.[Abstract]

Kwak, N, Davenport, EC, Jr, & Davison, ML. (1998, April). A comparative study of observed score approaches and purification procedures for detecting differential item functioning, Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.

Marshall, SC, Mungas, D, Weldon, M, Reed, B, & Haan, M. (1997). Differential item functioning in the Mini-Mental State Examination in English- and Spanish-speaking older adults. Psychology and Aging, 12, 718-725.[CrossRef][Web of Science][Medline] [Order article via Infotrieve]

Mazor, KM, Kanjee, A, & Clauser, BE. (1995). Using logistic regression and the Mantel-Haenszel with multiple ability estimates to detect differential item functioning. Journal of Educational Measurement, 32, 131-144.[CrossRef][Web of Science]

Millsap, RE, & Everson, HT. (1993). Methodological review: Statistical approaches for assessing measurement bias. Applied Psychological Measurement, 17, 297-334.[Abstract/Free Full Text]

Monahan, PO. (2004, April). Examining the assumption of linearity of the logit in the logistic regression procedure for detecting DIF, Paper presented at the annual meeting of the National Council on Measurement in Education, San Diego, CA.

Monahan, PO, Stump, TE, Finch, H, & Hambleton, RK. Bias of exploratory and cross-validated DETECT index under unidimensionality. Applied Psychological Measurement, in press.

Mosteller, F, & Tukey, JW. (1977). Data analysis and regression: A second course in statistics. Reading, MA: Addison-Wesley.

Narayanan, P, & Swaminathan, H. (1996). Identification of items that show nonuniform DIF. Applied Psychological Measurement, 20, 257-274.[Abstract/Free Full Text]

Radloff, LS. (1977). The CES-D scale: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1, 385-401.[Abstract]

Robin, F. (2001). STDIF: Standardization-DIF analysis program. Amherst: University of Massachusetts, School of Education.

Rogers, HJ, & Swaminathan, H. (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, 105-116.[Abstract]

Roussos, LA, & Ozbek, O. (2006). Formulation of the DETECT population parameter and evaluation of DETECT estimator bias. Journal of Educational Measurement, 43, 215-243.[CrossRef][Web of Science]

Roussos, LA, Schnipke, DL, & Pashley, PJ. (1999). A generalized formula for the Mantel-Haenszel differential item functioning parameter. Journal of Educational and Behavioral Statistics, 24, 293-322.[Abstract/Free Full Text]

Schmitt, AP, Holland, PW, & Dorans, NJ. In Holland, PW, & Wainer, H (Eds.). (1993). Evaluating hypotheses about differential item functioning. Differential item functioning (p. 281-315). Hillsdale, NJ: Lawrence Erlbaum.

Swaminathan, H, & Rogers, J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361-370.[CrossRef][Web of Science]

Swanson, DB, Clauser, BE, Case, SM, Nungester, RJ, & Featherman, C. (2002). Analysis of differential item functioning (DIF) using hierarchical logistic regression models. Journal of Educational and Behavioral Statistics, 27, 53-75.[Abstract/Free Full Text]

Taylor, JO, Wallace, RB, Ostfeld, AM, & Blazer, DG. (1998). Established populations for epidemiologic studies of the elderly, 1981–1993. Ann Arbor, MI: Inter-university Consortium for Political and Social Research. ((3rd ICPSR version) [Electronic version]. ).

U.S. Department of Health and Human Services. (1997). Longitudinal study of aging, 1984–1990. Ann Arbor, MI: Inter-university Consortium for Political and Social Research. ((6th ICPSR version) [Electronic version]).

Volk, RJ, Cantor, SB, Steinbauer, JR, & Cass, AR. (1997). Item bias in the CAGE screening test for alcohol use disorders. Journal of General Internal Medicine, 12, 763-769.[CrossRef][Web of Science][Medline] [Order article via Infotrieve]

Wainer, H. In Holland, PW, & Wainer, H (Eds.). (1993). Model-based standardized measurement of an item’s differential impact. Differential item functioning (p. 123-135). Hillsdale, NJ: Lawrence Erlbaum.

Whitmore, ML, & Schumacker, RE. (1999). A comparison of logistic regression and analysis of variance differential item functioning detection methods. Educational and Psychological Measurement, 59, 910-927.[Abstract/Free Full Text]

Woodard, JL, Auchus, AP, Godsall, RE, & Green, RC. (1998). An analysis of test bias and differential item functioning due to race on the Mattis Dementia Rating Scale. Journals of Gerontology. Series B, Psychological Sciences and Social Sciences, 53, P370-P374.[Web of Science]

Zenisky, AL, Hambleton, RK, & Robin, F. (2003). Detection of differential item functioning in large-scale state assessments: A study evaluating a two-stage approach. Educational and Psychological Measurement, 63, 51-64.[Abstract/Free Full Text]

Zumbo, BD. (1999). A handbook on the theory and methods of differential item functioning (DIF): Logistic regression modeling as a unitary framework for binary and Likert-type (ordinal) item scores. Ottawa, Canada: Directorate of Human Resources Research and Evaluation, Department of National Defense.

Journal of Educational and Behavioral Statistics, Vol. 32, No. 1, 92-109 (2007)
DOI: 10.3102/1076998606298035


Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati   Add to Twitter Twitter    What's this?


This article has been cited by other articles:


Home page
Am. J. PsychiatryHome page
J. Schiffman, H. J. Sorensen, J. Maeda, E. L. Mortensen, J. Victoroff, K. Hayashi, N. M. Michelsen, M. Ekstrom, and S. Mednick
Childhood Motor Coordination and Adult Schizophrenia Spectrum Disorders
Am J Psychiatry, September 1, 2009; 166(9): 1041 - 1047.
[Abstract] [Full Text] [PDF]


Home page
JOURNAL OF EDUCATIONAL AND BEHAVIORAL STATISTICSHome page
C. E. DeMars
Modification of the Mantel-Haenszel and Logistic Regression DIF Procedures to Incorporate the SIBTEST Regression Correction
Journal of Educational and Behavioral Statistics, June 1, 2009; 34(2): 149 - 170.
[Abstract] [Full Text] [PDF]


Home page
Educational and Psychological MeasurementHome page
A. Robitzsch and A. A. Rupp
Impact of Missing Data on the Detection of Differential Item Functioning: The Case of Mantel-Haenszel and Logistic Regression Analysis
Educational and Psychological Measurement, February 1, 2009; 69(1): 18 - 34.
[Abstract] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Free Full Text (Free PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to Saved Citations
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Request Reprints
Right arrow Add to My Marked Citations
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Right arrow Citing Articles via Scopus
Google Scholar
Right arrow Articles by Monahan, P. O.
Right arrow Articles by Perkins, A. J.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati   Add to Twitter  
What's this?

AER home page RER home page JEB home page EPA home page RRE home page