| Sign In to gain access to subscriptions and/or personal tools. |
Testing for Local Dependence in Raschs Multiplicative Gamma Model for Speed TestsUniversity of Groningen, the Netherlands
The author considers a latent trait model for the response time on a (set of) pure speed test(s), the multiplicative gamma model (MGM), which is based on the assumption that the test response times are approximately gamma distributed, with known index parameters and scale parameters depending on subject ability and test difficulty parameters. Like any other parametric latent trait model, the MGM is based on strong assumptions. One of these assumptions is local independence. Two statistical tests for checking if the local independence assumption holds are compared, using generated and empirical data.
Key Words: speed tests Rasch models local independence Lagrange multiplier tests By an (itemized) test of pure speed, we mean a set of easy items that any member of the subject population could solve given sufficient time, administered under strict time-limit conditions. The test items are supposed to be of the same type and depend on a single latent trait only. This definition of a pure speed test goes back to Gulliksen (1950; Lord & Novick, 1968). Speed tests can also be administered without imposing a time limit, instructing the subjects to complete the entire test as rapidly as possible. Response processes on speed tests can be conceived as a stationary process with constant success probability, to be modeled as a series of observations of the values of identically distributed response variables (Lord & Novick, 1968, chap. 5). For such processes, the numbers of items solved in a fixed time interval or the average response time per item are variables that contain the same information with regard to the subject abilities. If one considers a test as a pure speed test, test scoring under time-limit constraints involves counting the number of items completed and, under unlimited time conditions, recording the time it takes to complete the test. Timing of individual-item latencies is usually not feasible, except when the tests are administered by computer. Obviously, pure speed and pure power tests are only idealizations. In a test where speed and power aspects are present, item responses and item latencies are no longer equivalent and should therefore both be collected. Latent trait models for speed tests are rather rare, and models combining speed and power are even rarer. Verhelst, Verstralen, and Jansen (1997) formulated a latent trait model for time-limit tests that combined speed and power aspects. The model has a logistic form, derived from the assumptions of a gamma distribution for an item response time and a generalized extreme-value distribution for a latent response given the response time. Other examples can be found in work of White (1983), Thissen (1983), and Roskam (1997). In educational measurement, the focus is usually on how well a respondent can perform on a task, and not on the time it takes to complete the task. One area, however, where speed is considered important is reading. Children learning to read are supposed to learn to read both correctly and fluently, and tests developed to measure reading performance, for example, single-word reading tests, tend to have scoring rules in which both aspects are scored separately, or in some combination. Reading tests are given in different formats. Oral reading tests are usually given in a continuous format, where the respondents are instructed to read a list of single words or a reading passage as rapidly as possible without making errors. If the responses cannot be registered electronically, the timing of individual items is usually not feasible. Here, I will restrict myself to modeling response speed in the unlimited time condition. The Multiplicative Gamma Model for response times on (pure) speed tests (MGM) was originally designed to model reading speed (Rasch, 1960/1980). The MGM is a latent trait model for tests rather than items. Therefore, to be applicable, sets of at least two tests measuring the same trait are needed. The MGM can be used for solving some of the practical problems we encounter in achievement testing, such as test scoring and test calibration, in particular in the context of incomplete designs (Jansen, 1997a, 1997b; Jansen & Glas, 2001). The MGM is based on strong assumptions, and if these assumptions are not fulfilled, the validity of these uses is open to doubt. The availability of suitable model tests is therefore of prime importance. The aim of this article is to present procedures for the evaluation of model fit. In particular, I will describe tests for checking if the local independence assumption holds.
Adopting the assumptions of a homogeneous Poisson process, Rasch (1960/1980) derived a latent trait model for simple time-limit tests. If the Poisson distribution holds for the number of items completed, than the interresponse times are exponentially distributed with scale parameter , and the time t it takes to complete m items is gamma distributed with a known index parameter m and scale (rate) parameter .
The mean and the variance of the distribution in Equation 1 are m/
Now, suppose that we have a set of K (speed) tests, measuring the same trait, possibly varying in length and rate parameters. For the rate parameters
where
The model presented here assumes that the number of items or units in the test is known. The index parameter m is supposed to be (exactly) equal to the number of items and interpreted as the test length. A slight modification consists of assuming that the index parameters are known, up to a common scaling factor
As has been mentioned in the previous section, is assumed to be gamma distributed with mean µ and index parameter , implying a variance of µ2/ . For increasing , the shape of the distribution becomes similar to the normal distribution. For smaller values of , the distribution is skewed to the right. Now, if ' = (µ, , ) is the vector of population and test parameters, the log likelihood can be written as
where tn stands for the vector of response times of Respondent n and T for the data matrix of all the respondents. To obtain the marginal maximum likelihood (MML) estimation equations, the first-order derivatives of Equation 3 with respect to
with
It can be easily verified that
where
If we consider only one subject population, we will need one suitable restriction on either the test parameters or the parameters of the subject distribution for identifiability.
The model is based on strong assumptions, and if these assumptions are not fulfilled, misleading results may be the consequence. Jansen (1997b) described a number of statistical tests for assessing model fit. The distributional assumptions imply that the sum scores obtained by summing the weighted test response times (where the weights are the easiness parameters) follow a Pearson Type VI distribution. This can be checked by comparing the observed with the expected distribution. The power of this test is unknown, but possibly low. A second test will be described more fully in the next paragraph. A potential source of misfit can be found in subgroup differences in large heterogeneous target populations. Differences in test parameters between groups of respondents, a phenomenon known as differential test functioning or DTF, form a serious threat to the generalizability of the test. In the case of DTF, we consider subpopulations defined on external observable variables. The same sort of reasoning, but now with an unobservable internal variable, can be followed to arrive at a second source of misfit. Namely, that in different segments of the ability scale the test response behavior is supposed to be properly described by the same test characteristic function. A third basic model assumption is local independence. The assumption of local independence (LI) in item response models is equivalent to the assumption that the latent trait under consideration spans the complete latent space (Lord & Novick, 1968, pp. 360–362). The assumption of a unidimensional latent trait implies, in our situation, that the response on a certain test is, given the trait value, independent of the responses on the other tests. Multidimensionality, therefore, is one of the most obvious sources of LI violation. Another situation leading to the violation of LI is when the response on a certain item/test is dependent on the response on preceding items/tests. In applications, we may encounter tests where some form of (item) clustering is present. If this is the case, LI is not likely to hold (Bradlow, Wainer, & Wang, 1999; Ip, 2001). In the context of item response theory (IRT) models, several goodness-of-fit tests have been developed both in the framework of parametric and nonparametric item response modeling, and their behavior has been studied. In general, compared with model violations such as differential item or test functioning, the violation of LI has attracted less attention. In IRT modeling practice, the requirement of LI is often replaced by the less stringent requirement that the conditional covariance (between items) is zero (Junker, 1993). Especially in the context of nonparametric IRT, we find methods for local dependence assessment using statistics based on summing conditional covariances between the items (Douglas, Kim, Habing, & Gao, 1998; Stout, Froelich, & Gao, 2001). Chen and Thissen (1997) have evaluated several indexes for the detection of local dependence, among these the Q3 statistic, basically a correlation between residual item scores. Another useful approach, based on the principle of Lagrange multiplier tests, has been proposed by Glas (1998) and Glas and Verhelst (1995; see also Glas & Suárez Falcón, 2003). This approach has been used for developing a test for DTF for the MGM (Jansen & Glas, 2001). In the following paragraphs, I will describe the derivation of a test for assessing local dependence, using the Lagrange multiplier (LM) tests approach. The results of the LM test will be compared with a model test, using correlational methods, proposed by Jansen (1997b), but first, I will explain the principle of LM tests.
LM Tests
where
and
where h(
The test outcome depends on the magnitude of the first-order derivatives of the log likelihood of the general model, evaluated at the point of the maximum likelihood estimates of The framework of LM tests is also well suited for developing a statistical test for model violations in case of the speed test model (Jansen & Glas, 2001).
Tests for LI Now, let us assume that for Test k the probability of observing score tnk is given by:
The intensity parameter of the score distribution of Test k is augmented by a factor depending on the inverse of the score on Test j, modeling a positive association between the two scores if
The interpretation of hn(
times a factor depending on the observed score tnj. Substituting Equation 13 in Equation 11 gives the desired test statistic. The test statistic, in the following referred to as LM test, is supposed to be approximately
Another statistical test, similar to the residual correlation indices used in IRT, that might be used to assess the presence of local dependence was suggested by Jansen (1997b). It can be shown that the marginal likelihood can be written as a product of two separate parts, Lc(
and
The first part is the distribution of (un1, . . . , unk), conditional on tn*, the weighted total response time. This part of the likelihood, which is independent of the parameters of the ability distribution and involves the test parameters only, is defined as follows:
The second part involves the distributon of tn*. It can be shown that h(u1, . . . , uk|tn*) is a Dirichlet distribution with (known) parameters m1, . . . , mk. The model predicts a very specific pattern of correlations between the weighted relative response rates U, which suggests a model check in the form of comparing the observed correlations between pairs of Us with the predicted correlations. For each uj and ul the expected correlation is:
In case of a moderate or small number of tests, this correlation will be substantial. A complication arises from the fact that both t* and the us are functions of the unknown test parameters
Method To study the properties of the LM and correlation statistics, we carried out a number of simulation studies.
In a first series of simulations, I used a set of five tests with test difficulties equal to .640, .800, 1.000, 1.250, and 1.563. A two-by-three design was used for test length and sample size. The test length was specified to be 15 or 25 items for each test. The respondents were randomly sampled from a gamma distribution with µ = 1 and index parameter
In a second series of simulations, I generated scores on 10 short tests of fairly homogeneous difficulties (varying between .10 and .08). A two-by-three design was used for test length and sample size. The test length was specified to be 3 or 10 items for each test. The respondents were randomly sampled from a gamma distribution with µ = 1 and index parameter In a following series of simulations, I investigated the sensitivity of both statistics to violations of LI, using the same specifications for the test parameters and the subject parameter distribution as before.
The Type I Error Rates In the first half of Table 1, we find the error rates of the LM and ZCOR tests for the small set of tests and in the second half for the larger set. Each cell in the table is based on 500 replications. For both tests, the error rates are close to the nominal level. In case of the LM test, the rates are slightly higher than 5%, whereas the ZCOR test tends to be more conservative.
Power Studies In the following, I compare the power of the two tests, LM and ZCOR, for the detection of local dependence. Local dependence was introduced by specifying a nonzero value for an additional parameter modeling the dependence of the rate parameter of Test k on the response on Test j. In the simulations, the model violation was imposed by augmenting the subject speed parameters for Test k with a factor depending on the standardized response rate of a preceding Test j. For this, I rewrote the term ( n k + jk mj/tnj) in Equation 12 as follows:
where snj = mj/(tnj I will first present the results for the small set of long tests, where the model violation is imposed on the fourth test. The results of applying the LM and ZCOR test can be found in Tables 2 and 3. The hit rates show clear main effects of sample size and effect size. The hit rate is also larger for longer tests. In all cases, the power of the LM test is higher than the power of the ZCOR test. The false-alarm rates, averaged over the four other tests in the set of five, become larger for increasing sample sizes and larger effect sizes.
For the set of 10 tests, I have imposed a model violation on Test 4 and Test 6 (20% misfit) by imposing dependencies on the responses on the directly preceding tests. The results are shown in Tables 4 and 5. The hits are the rates of rejects averaged over Tests 4 and 6.
From the tables, it can be inferred that the effects of sample size, test length, and effect size show the same pattern as was found for the small set of five tests. Again, the hit rates increase with the sample size and the effect size. The LM test performs better than the ZCOR tests. The relatively small false-alarm rates are consistent with results by Glas and Suárez Falcón (2003), who studied, together with other fit statistics, a LM test for violation of LI in the framework of the Three-Parameter Logistic Model.
The data in this example are taken from a study on the development of early reading and the factors relating to reading problems (van den Bos & Lutje Spelberg, 1997). The participants are children attending schools of special education. Among several other measurements, data were collected for four single-word reading tests. The tests were individually administered, and timing per item was not feasible. The first two tests, the EMT and the KLEPEL, were administered in their usual time-limit format. In addition, the time needed to finish the first 50 items was registered. The EMT and the KLEPEL are highly similar. Both consist of a list of stimulus words, ordered to increasing difficulty from one-syllable consonant vowel consonant (CVC) words to complicated three-syllable words. The difference between the EMT and KLEPEL is that whereas the stimulus words of the EMT are real words, those of the KLEPEL are pseudo-words. The third test, the AARON, consisted of short real words in no specific order. The fourth test consisted of blocks of five different color names (CLN) in random order. Both AARON and CLN were 50-item tests. The test stimuli differ in type as well as presentation, and it is possible that the four tests tap different abilities. Under these circumstances, local dependencies may arise when I analyze the data using a unidimensional model.
Results
The observed correlations between the EMT and the KLEPEL, and the AARON and the CLN, were found to be positive, whereas the other correlations were (strongly) negative and therefore also not in accordance with the model predictions of r = –.333 for all six intercorrelations. The results of the LM test statistic, which is approximately 2 distributed with one degree of freedom, also show strong indications for the violation of LI. Especially the CLN test shows large values for the LM test statistic. The pattern of correlations between the relative weighted responses suggests that the set of four falls apart into two pairs. However, removing only the CLN test resulted in a matrix of correlations for the remaining three tests, more in line with the predicted value, which is now –.50. All three are covered by the asymptotic 95% confidence interval. The LM tests still indicate a lack of fit.
Until recently, the availability of suitable statistical tests to assess the fit of the Rasch model for speed tests was rather limited. The principle of LM tests, which has been introduced by Glas and Verhelst (1995) as a guiding principle for deriving statistical tests in an IRT context, can also be applied to the MGM. In this article, I focused on the performance of the LM test aimed at detecting local dependence. In a simulation study, the LM test statistic was found to be performing adequately. The model, if valid, predicts a specific pattern of correlations between the weighted relative test response rates. The observed correlations were also sensitive to violations of LI. However, in comparison, the LM statistics were more powerful than the correlational indices. The relatively small false-alarm rates are consistent with results reported by Glas and Suárez Falcón (2003) for a similar test in an IRT context. Nonetheless, we must keep in mind that the simulation design that was used here, although inspired by empirical applications, covers a rather limited number of a large set of possible situations we may encounter in practice. Compared to other IRT models that model binary or polytomous item responses, the range of possible model specifications for continuous test response variables is much wider. For instance, tests may vary in length, difficulty, and number. More application studies, as well as simulation studies, are necessary. More research is also needed to assess if the LM test aimed at detecting LI is also sensitive to other sources of lack of model fit, such as DTF, and violations of the equality of the test characteristic function assumption.
MARGO G. H. JANSEN is an associate professor in the Department of Educational Sciences at the University of Groningen, Grote Rozenstraat 38, 9712 TJ, Groningen-NL;g.g.h.jansen{at}rug.nl. Her areas of specialization are test theory, in particular IRT, and applied statistics. Manuscript received July 11, 2002. Revision received February 5, 2005. Accepted for publication August 24, 2005.
Bradlow, E, Wainer, H, & Wang, X. (1999). A Bayesian random effects model for test-lets. Psychometrika, 64, 153-168.[CrossRef][Web of Science]Chen, WH, & Thissen, D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational Statistics, 22, 265-289.[CrossRef]Douglas, J, Kim, HR, Habing, B, & Gao, F. (1998). Investigating local dependence with conditional covariance functions. Journal of Educational and Behavioral Statistics, 23, 129-151.
Journal of Educational and Behavioral Statistics, Vol. 32, No. 1,
24-38 (2007)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

, and the time t it takes to complete m items is gamma distributed with a known index parameter m and scale (rate) parameter 

n refers to the ability and
j to the easiness of Test j. A higher value for
, which has to be estimated. In some cases, the basic units of the test may be more or less arbitrary. In the examples discussed by
, implying a variance of µ2/
' = (µ, 





(.) refers to the psi or digamma function. So, the likelihood equations are given by 


1, against a generalization in which a basic assumption has been relaxed by adding parameters 


2 distributed with degrees of freedom equal to the number of parameters fixed. To calculate the LM test statistic, only the estimates of the parameters 
jk is positive. Modeling a negative association in this way is possible but problematic. Because the rate parameter of a gamma distribution has to be positive, negative values of 










