Advanced Search

Journal Navigation

Journal Home

Subscriptions

Archive

Contact Us

Table of Contents

Click here to sign up for SAGE Journal Email Alerts today!

Sign In to gain access to subscriptions and/or personal tools.
Journal of Educational and Behavioral Statistics
This Article
Right arrow Abstract Freely available
Right arrow Free Full Text (Free PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to Saved Citations
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Request Reprints
Right arrow Add to My Marked Citations
Citing Articles
Right arrow Citing Articles via Google Scholar
Right arrow Citing Articles via Scopus
Google Scholar
Right arrow Articles by Haberman, S. J.
Right arrow Articles by Qian, J.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati   Add to Twitter  
What's this?

Articles

Linear Prediction of a True Score From a Direct Estimate and Several Derived Estimates

Shelby J. Haberman
Jiahe Qian

Educational Testing Service


    Abstract
 Top
 Abstract
 Introduction
 1. The Best Linear...
 2. Estimation of the...
 3. Data Sources and...
 4. Findings
 References
 
Statistical prediction problems often involve both a direct estimate of a true score and covariates of this true score. Given the criterion of mean squared error, this study determines the best linear predictor of the true score given the direct estimate and the covariates. Results yield an extension of Kelley’s formula for estimation of the true score to cases in which covariates are present. The best linear predictor is a weighted average of the direct estimate and of the linear regression of the direct estimate onto the covariates. The weights depend on the reliability of the direct estimate and on the multiple correlation of the true score with the covariates. One application of the best linear predictor is to use essay features provided by computer analysis and an observed holistic score of an essay provided by a human rater to approximate the true score corresponding to the holistic score.

Key Words: Keywords: Kelley’s formula • mean squared error • automatic essay scoring • reliability


    Introduction
 Top
 Abstract
 Introduction
 1. The Best Linear...
 2. Estimation of the...
 3. Data Sources and...
 4. Findings
 References
 
Statistical prediction of a true score on a test may involve both direct estimation of a true score and covariates related to the true score. For example, in a graduate admission test denoted by GRAD in this article, a final essay score was based on a holistic score provided by a human rater, the direct estimate, and essay features such as number of words in the essay, error rates per word in grammar or usage, and numerical measures of word diversity. The essay features, the covariates, were determined by computer analysis of the essay (Attali & Burstein, 2004). The procedure in GRAD for essays employed an integer holistic score in the range 1 to 6 and an integer e-rater score between 1 and 6 generated from computer analysis. Normally, the reported score was the average of the holistic score from the reader and of the e-rater score; however, an additional reader was employed if the reader score and e-rater score differed by more than 1.

The approach used in GRAD was not necessarily an optimal approach to assignment of a final score to an essay. This remark applies even if the true essay score is regarded as the average holistic score an essay would receive if rated by an arbitrarily large number of raters (Lord & Novick, 1968).

In this study, a continuation of work presented earlier (Qian & Haberman, 2003), the criterion of mean squared error is used to determine the best linear predictor of a true score based on a direct estimate and on covariates. In Section 1, this predictor is considered under the assumption that all relevant population parameters are known. In this ideal case, the best linear predictor is shown to be a weighted average of two components. The first component is the direct estimate. The second component is the regression of the direct estimate onto the covariates. The weights assigned to the components depend on the reliability of the direct estimate and on the multiple correlation between the direct estimate and the covariates. The mean squared error of the optimal linear predictor is shown to depend on the variance of the direct estimate, reliability of the direct estimate, and multiple correlation of the true score and the covariates. Results of this section can be regarded as a generalization of Kelley’s formula to the case of covariates (Kelley, 1947). Required arguments are familiar from treatments of linear prediction in classical test theory (Holland & Hoskens, 2003; Lord & Novick, 1968). Results are related to other efforts to combine information from several tests to provide improved estimation of the true scores for each of the tests or for a composite test (Longford, 1997; Wainer et al., 2001; Wainer & Thissen, 2001). As evident from the cited references, arguments used here are readily related to arguments used in Bayesian inference.

In Section 2, estimation of the best linear predictor and of the mean squared error are considered. Estimation is described for a simple random sample of essays from a large population. Because reliability must be estimated, it is assumed that at least for some essays, more than one independently obtained holistic score is available in the sample. This assumption has commonly been fulfilled in applications of e-rater. For each essay in the sample, it is assumed that at least one holistic score and all covariates are observed. Given these data, estimation of parameters is relatively straightforward, at least for large samples. Standard treatments of classical test theory provide basic background (Lord & Novick, 1968), as do classical treatments of statistical inference (Rao, 1973). Some readers may recognize relationships to empirical Bayesian inference (Wainer & Thissen, 2001).

In Section 3, the methods developed in Sections 1 and 2 are applied to essays from GRAD and from the Test of English as a Foreign Language (TOEFL). A notable feature of the analysis is the relatively low weight assigned to the holistic score provided by the reader. This result reflects some limitations in the reliability of holistic scores and a relatively high multiple correlation of holistic scores and computer-derived essay features.

As discussed in Section 4, results in this report suggest that scoring procedures such as those used in GRAD give considerably higher weight to computer-generated essay features than has generally been the case. Policy issues may arise that involve public perceptions concerning the reduced weight given to the human rater, and there is some question to consider concerning the effect on examinee performance if they are aware that a very large fraction of the grade on their essay is determined by a computer program.


    1. The Best Linear Predictor of the True Score
 Top
 Abstract
 Introduction
 1. The Best Linear...
 2. Estimation of the...
 3. Data Sources and...
 4. Findings
 References
 
To obtain the best linear predictor of the true score from a direct estimate and from the available covariates, some elementary notation and a basic probability model are required. Let {theta}, the true score, be a random variable with expectation E({theta}) and positive variance V({theta}), and let h, the direct estimate, be a random variable such that the error e = h{theta} in estimation of {theta} has expectation 0 and positive variance V(e). Let e and {theta} be uncorrelated (Lord & Novick, 1968). Then the observed score h has mean E(h) = E({theta}) and variance


Formula(1)

and the covariance C(h, {theta}) of h and {theta} is V({theta}) (Lord & Novick, 1968). The reliability coefficient


Formula(2)

(Lord & Novick, 1968). Under the assumptions made concerning the variances of the true score {theta} and the error e, the reliability coefficient {tau}2 must be positive and less than 1.

Let d be a q-dimensional vector of covariates dj, 1 ≤ j ≤ q, with mean E(d) and positive definite covariance matrix C(d). Assume that the estimation error e is uncorrelated with the covariates dj, 1 ≤ j ≤ q. Let C(d, e) denote the vector of covariances of the error e and the covariates dj, 1 ≤ j ≤ q. This information suffices to specify the best linear predictor of the true score {theta} based on the observed score h and the vector d of covariates.

To describe the best linear predictor of the true score, first consider the standard formula for the best linear predictor of the direct estimate h based on the covariate vector d. For q-dimensional vectors x and y with respective coordinates xi and yi, let


Formula

Then the best linear predictor of h from d is


Formula(3)

where the vector of regression coefficients is


Formula(4)

Note that C(d, h) is the vector of covariances of dj and h for 1 ≤ j ≤ q (Lord & Novick, 1968, p. 267).

The best linear predictor of the direct estimate h from the covariate vector d is the same as the best linear predictor of the true score {theta} from the covariate vector d. This claim is easily verified. Because the error e is assumed to have expectation 0 and to be uncorrelated with the covariate vector d, the covariance vector C(d, {theta}) for the covariates dj and the true score {theta} is the same as the covariance vector C(d, h) for the covariates dj and the direct estimate h (Holland & Hoskens, 2003). As already noted, the direct estimate h and the true score {theta} satisfy E(h) = E({theta}). Thus, the best linear predictor of {theta} from d is


Formula

where the vector {gamma} of Equation 4 also satisfies


Formula(5)

The residual for prediction of the direct estimate h by the covariate vector d is r = hf. The corresponding residual for prediction of the true score {theta} by the covariate vector d is u = {theta}f, so that r = u + e.

The mean squared error for linear prediction of the direct estimate h by the covariate vector d is then


Formula(6)

where


Formula(7)

(Rao, 1973). If {rho}(h, d) is the multiple correlation coefficient of the direct estimate h and the covariate vector d (Lord & Novick, 1968), {rho}2(h, d) is the square of {rho}(h, d), then


Formula(8)

and Equations 6 and 8 imply that


Formula(9)

In like manner, the mean squared error for linear prediction of the true score {theta} by the covariate vector d is


Formula(10)

It is assumed in this article that the residual variance V(u) is positive so that the true score is not determined by an affine function of the covariate vector d. This assumption implies that V(f) < V({theta}) and V(f)/V({theta}) < 1. By Equation 1,


Formula(11)

In analogy to Equations 8 and 9, the multiple correlation {rho}({theta}, d) of the true score {theta} and the covariate vector d satisfies


Formula(12)

and


Formula(13)

By Equations 2, 8, and 12,


Formula(14)

so that the multiple correlation coefficient of the true score {theta} and the covariate vector d is larger than is the multiple correlation coefficient of the observed score h and d. The ratio of the multiple correlation coefficients is determined by the reliability coefficient {tau}2. By Equation 14 and the inequality in Equation 12,


Formula(15)

Given these basic results, it is then relatively easily shown that the best linear predictor of the true score {theta} based on the direct estimate h and on the covariate vector d is a weighted linear combination


Formula(16)

of the direct estimate h and of the best linear predictor f of {theta} from the covariate vector d.

Prior to verification of Equation 16, it is helpful to interpret the result. The weight {alpha} assigned to the direct estimate h is the ratio


Formula(17)

of the residual variance V(u) from linear prediction of the true score {theta} by the covariate vector d and the residual variance V(r) from linear prediction of the observed score h by the covariate vector d. The second equation in Equation 17 follows from Equation 11.

In Equation 17, the weight {alpha} assigned to the direct estimate is always between 0 and 1 so that the weight 1 – {alpha} assigned to the best linear predictor f based on the covariate vector d is also positive and less than 1. Equal weighting ({alpha} = 1/2) corresponds to a residual variance V(u) from estimation of the true score by covariates equal to the error variance V(e) associated with the direct estimate h. If V(u) is half of V(e), then {alpha} is 1/3 so that the weight of the direct estimate is half the weight of the linear predictor f. The weight {alpha} assigned to the direct estimate can be expressed in terms of the reliability {tau}2 and the multiple correlation coefficient {theta} of the true score {theta} and the covariate vector d because Equations 2, 9, and 14 imply that


Formula

The weight {alpha} increases with an increase in the reliability {tau}2 and decreases with an increase in the multiple correlation [{rho}({theta}, d)] of the true score {theta} and the covariate vector d. If [{rho}({theta}, d)] is 0, then f is just the expected value E(h) = E({theta}), the weight {alpha} = {tau}2, and


Formula

reduces to the estimate of Kelley’s formula. In general, {alpha} ≤ {tau}2 and 1 – {alpha} ≥ 1 – {tau}2.

To verify that the best linear predictor t of the true score satisfies Equation 16, consider the mean squared error


Formula(18)

from prediction of the true score {theta} by an affine function z = a + ch + b'd of the observed score h and the covariate vector d. In z, a and c are real constants and b is a constant q-dimensional vector. Thus, a is a constant term, c is the weight for the direct estimate, and the coordinate bj of b is the weight for the coordinate dj of d. The mean squared error L(a, c, b) is minimized if


Formula(19)


Formula(20)

and


Formula(21)

(Rao, 1973). Recall that the covariance vector C(d, h) is the same as the covariance vector C(d, {theta}) so that a comparison of Equations 4 and 21 shows that


Formula(22)

By Equation 19, it then follows that


Formula(23)

By Equations 6, 7, 10, 17, and 20, the optimal c is {alpha} so that the optimal predictor is z = t.

The residual from prediction of {theta} by t is


Formula(24)

Because u and e have 0 expectations, v also has 0 expectation. Because u, a linear function of {theta} and d, is uncorrelated with the error e, it follows from Equation 11 that the mean squared error of prediction of the true score {theta} by the direct estimate h and the covariate vector d is the variance V(v) of v, and


Formula(25)

Application of Equations 12 and 13 yields


Formula

Note that V(v) is less than either the variance V(e) of the error of the direct estimate or the variance V(u) of the error from use of the predictor f as an estimate of the true score {theta}. If the multiple correlation {rho}({theta}, d) is 0, then the variance V(v) is the variance V(h){tau}2 of Kelley’s estimate.

In the analysis, it is important to distinguish between the best linear predictor and the best general predictor. In general, the optimal predictor of {theta} based on the direct estimate h and the covariate vector d is the conditional expectation E({theta}|h, d) of {theta} given h and d so that the prediction error is w = {theta} E({theta}|h, d), and the mean squared error from use of E({theta}|h, d) as a predictor of {theta} is {sigma}2(w). It is always true that {sigma}2(w) does not exceed the mean squared error {sigma}2(v) from the best linear predictor of {theta} based on the direct score h and the covariate vector d, although {sigma}2(w) = {sigma}2(v) if e, {theta}, and d have a joint multivariate normal distribution (Rao, 1973). The multivariate normality condition will certainly not hold exactly for the application to essay scoring under study, so no claim is made that an optimal approach has been found. Indeed, analysis of the data does suggest that modest gains may be achieved by use of cumulative logistic models; however, it should be emphasized that gains are quite modest.


    2. Estimation of the Best Linear Predictor
 Top
 Abstract
 Introduction
 1. The Best Linear...
 2. Estimation of the...
 3. Data Sources and...
 4. Findings
 References
 
To estimate the best linear predictor t, consider a random sample of size n > q + 1 from the population used to define t. Assume that the underlying population is either infinite or so large that finite sampling corrections can be ignored. For each observation i, 1 ≤ i ≤ n, let mi ≥ 1 direct estimates hij, 1 ≤ j ≤ mi, 1 ≤ i ≤ n, be available, and assume that at least one mi exceeds 1 and that the mi are selected without regard to any characteristics of the essays under study. The requirement of some multiple direct estimates is essential to determine the variance V(e) of the errors ei. In use of e-rater, essays used to construct the regression analysis are assessed by more than one rater so that the requirement imposed here is consistent with current practice with e-rater. The assumption that selection of essays for multiple grading is independent of essay characteristics is not necessarily valid in practice to the extent that formal randomization procedures for essay selection are not used. In the analysis of essays in Section 4, each mi will be 2; however, little is lost by consideration of the more general case. In addition, cost considerations will often require that most mi be 1. Nonetheless, estimation of the error variance V(e) with reasonable accuracy will normally require several hundred essays that are multiply scored.

Let the true score for observation i be {theta}i so that the error for replication j and observation i is eij = hij{theta}i. Let the vector of covariates for observation i be di. For each observation i and replication j, it is assumed that the joint distribution of hij, {theta}i, and di is the same as the joint distribution of h, {theta}, and d. This assumption may not be correct in practice due to lack of formal randomization of readers to essays and due to lack of randomization of the order in which essays are read. The added assumptions are imposed that the errors eij for the direct estimates are all uncorrelated. This assumption may again not be correct in practice due to lack of formal randomization. Available data do not permit analysis of the actual effects of errors in these assumptions.

To assist in some formulas, a variable e will be introduced that is uncorrelated with d and {theta}, has mean 0, and has variance V(e)/m, where


Formula

is the harmonic mean of the mi (Kendall & Stuart, 1977). If m is an integer and mi is at least m, then e has the same mean and variance as does the average ei of the eij, 1 ≤ j ≤ m.

Given these conditions, estimation of the best linear predictor t is straightforward. For each observation i, let hi be the average of the mi direct estimates hij, 1 ≤ j ≤ mi, of the true score {theta}i of essay i so that the average error ei = hi {theta} for observation i has mean 0 and variance V(e)/mi and is uncorrelated with di. One may then estimate the expectation E(h) = E({theta}) by the grand mean


Formula(26)

The expectation E(d) is then estimated by the sample mean


Formula(27)

The covariance matrix C(d) is estimated by the sample covariance


Formula(28)

where xy' is the q by q matrix with elements xjyk for 1 ≤ j ≤ q and 1 ≤ k ≤ q if x and y are vectors of dimension q with respective coordinates xj and yj for 1 ≤ j ≤ q. The covariance vector C(d, {theta}) = C(d, h) is then estimated by


Formula(29)

Thus, the vector {gamma} of regression coefficients may be estimated by


Formula(30)

The approximation to f is then


Formula(31)

For observation i, the covariate vector di predicts an observed score


Formula(32)

To complete estimation, it is necessary to approximate {alpha}. To do so, the variances V(e) of the error e and the variance V(u) of the residual u must be estimated. Estimation of V(e) is a straightforward manner given customary results for one-way analysis of variance. An unbiased estimate of V(e) is


Formula(33)

(Lord & Novick, 1968).

The case of the residual variance V(u) is a bit more complex. Let


Formula(34)

be the residual from regression of the hi on the di for 1 ≤ i ≤ n. Then the residual mean square error


Formula(35)

is a consistent estimate of the variance


Formula

If d has a continuous distribution, if each mi is m, and if the residual u is independent of d, then V (r) is unbiased (Rao, 1973). These assumptions do not hold for essay scoring. Nonetheless, V(u) still has the consistent estimate


Formula(36)

At this point, the natural estimate of {alpha} based on Equation 17 is


Formula(37)

The only complication is that V (e) and V (u) need not be positive. One may adopt the convention that Formula is 0 if V (u) ≤ 0 (Bock & Petersen, 1975).

Given Formula, h, and f and given Equation 16, t may be estimated by


Formula(38)

Given Equation 25, the mean square error V(v) may then be approximated by


Formula(39)


    3. Data Sources and Empirical Results
 Top
 Abstract
 Introduction
 1. The Best Linear...
 2. Estimation of the...
 3. Data Sources and...
 4. Findings
 References
 
The results of Sections 1 and 2 are readily applied to essay assessment. In this section, data and variables used in the analysis are described, and results of the analysis are presented.

Data Sources and Prompts Used in Essay Assessment
The data used in the study are essays generated by four essay prompts, with the first two prompts from GRAD and the other two from TOEFL. For each prompt, about 5,000 essays are available. Each essay has been marked by at least two raters, although another rater is employed if the holistic scores differ by more than 1. Essays are only used if assigned scores from 1 to 6 by both initial raters and if they contain at least 25 words (Haberman, 2004). These restrictions remove responses that do not satisfy minimal criteria for essays responsive to the prompt. For each essay, the initial mi = 2 holistic scores obtained from readers are used in the analysis.

Covariates in the Analysis
Several choices of covariates vectors were considered in the analysis. These vectors are based on the following essay features (Attali & Burstein, 2004; Burstein, Chodorow, & Leacock, 2004; Haberman, 2004).

Number of Words
The number W of words in the essay. It is generally found that holistic scores on essays are strongly positively related to essay length.

Number of Characters
The number C of alphanumeric characters in the essay. As in the case of W, C is highly related to holistic score.

Average Word Length
The ratio A = C/W is the average number of characters per word. Increasing values of A may indicate increasingly sophisticated vocabulary so that one might reasonably expect A to be positively related to holistic score.

Error Rates
For a given essay, let NG be the number of grammatical errors detected by e-rater Version 2.0, let NU be the number of usage errors detected, let NM be the number of detected errors in mechanics, and let NS be the number of detected errors in style. The corresponding rates per word are RG = NG/W, RU = NU/W, RM = NM/W, and RS = NS/W. A summary total is RT = RG + RU+ RM + RS. A special case of mechanical errors, spelling errors, is also of interest. Here, NP is the number of detected spelling errors, and RP = NP/W is the rate per word. It is natural to expect that holistic scores are negatively related to error rates.

Number of Arguments
Let D be the number of discourse elements in the essay, and let D8 be the minimum of D and 8. (In a standard five-paragraph essay, there are 8 discourse elements.) Essays that are more developed are likely to have more discourse elements so that holistic scores are expected to be positively correlated with the D and D8.

Average Argument Length
The ratio L = W/D is the average number of words in a discourse element. It is quite possible that more developed arguments may be positively associated with higher holistic scores.

Standard Frequency Index
The Breland Standard Frequency Index (SFI; Breland, 1996; Breland, Jones, & Jenkins, 1994) is a measure of word frequency. The measure is on a logarithmic scale, and lower numbers indicate less frequent words. In Version 2.0 of e-rater, the fifth lowest SFI value (B5) is used for essay words in the list of 179,195 words with an SFI. The median B of the SFI for essay words in the list is also considered in the regression analysis in this report. It is believed that lower SFI values are indications of more sophisticated word usage so that lower SFI may be associated with higher holistic scores.

Measures of Word Diversity
Simpson’s index S (Gini, 1912; Simpson, 1949) measures the probability that two distinct randomly selected words from an essay are the same. The ratio T is the ratio D/M in an essay of the number D of distinct content words to the total number M of content words. Here, content words are words that are normally used in search engines and indexes. Thus, words such as the and and are excluded. Because word variety is expected in good writing, it is possible that higher diversity is associated with higher holistic scores.

Selection of Specific Words
Let Zj be the jth most frequently used content word among all essays available for a particular prompt, and let Fj be the number of times Zj appears in an essay. The variable Uj is (Fj/M) 1/2. Two other variables, {psi} and e6, used in the regression analysis are obtained from the content vector analysis of e-rater (Attali & Burstein, 2004; Burstein et al., 2004; Haberman, 2004). The variable {psi} is the score group with the highest similarity measure to the observed essay in terms of the observed ratios Fj/M, and e6 is a cosine measure of similarity of the Fj/M to the observed Fj/M in the highest score group of essays. The variables {psi} and e6 are not entirely satisfactory for use in the analysis considered in this article because their calculation is affected by essays other than the essay under study. They are considered in this report to provide some indication of the behavior of the regression used in e-rater; however, any results involving {psi} and e6 should be approached with great caution. The definition of Uj is also affected by the specific essays found in the sample, but the effect is rather small in large samples (Haberman, 2004). The theory behind these measures is that the content of the essay should be related to the holistic score.

Sources of Variables
Variables W, C, A, NG, NU, NM, NS, RG, RU, RM, RS, RT , D, D8, L, B5, B, T, {psi}, and e6 are computed by e-rater software. The variables S and Uj were obtained by one of the authors (Haberman, 2004).

Covariate Vectors Used
In all, seven covariate vectors were considered. These vectors are a sample of possible vectors to be used. One consideration in choice of variables was the variables already used in e-rater. A second consideration involved modifications of variables in e-rater to improve basic statistical properties in regression analysis.

Vector 0 provides results based on Kelley’s formula. No predictor other than a constant is employed so that the linear predictor f is the mean E({theta}) and {alpha} is {tau}2.

In Vector 1, the elements were the number of words W, W2, the average argument length L, the truncated number D8 of discourse elements, the error rate RG for grammar, the error rate RU for usage, the error rate RM for mechanics, the error rate RS for style, the average word length A, the diversity measure T, the measures {psi} and e6 from content vector analysis, and the measure B5 of the fifth lowest Standard Frequency Index. This vector is the one used in e-rater Version 2.0.

In Vector 2, the e-rater variables from content vector analysis were removed from Vector 1 so that the elements were W, W2, L, D8, RG, RU, RM, RS, A, T, and B5. This omission is considered to eliminate variables defined by reference to essays other than the one to be rated.

In Vector 3, the only variables are the logarithm log(C) of the number of characters and the logarithm log(RT ) of the overall error rate. This vector is a rather minimal selection that only considers a length measure and an error rate measure. The logarithms were selected to improve distributions of independent variables by reduction of the influence of outliers and to increase the multiple correlation coefficient of the average holistic score with the independent variables. Some measure of essay length is included here due to the rather high correlation of essay length to holistic score.

In Vector 4, Vector 3 is supplemented by the median Standard Frequency Index B so that log(C), log(RT ), and B are the coordinates. Addition of B provides a measure of vocabulary level.

In Vector 5, the square root C1/2 of the number of characters, average word length A, the square root (RG + RU) 1/2 of the combined error rate for grammar and usage, the square root RP1/2 of the error rate for spelling, the square root (RM RP)1/2 of the error rate for other mechanical errors, and the median SFI B are the covariates. This choice is based on empirical work by one of the authors (Haberman, 2004). There is a length measure, a word length measure, error rate measures that reflect types of errors that appear to correlate with holistic scores, and a vocabulary measure. Again, transformations are designed to improve the distributions of the independent variables and to increase the multiple correlation coefficient of the average holistic score with the independent variables.

In Vector 6, C1/2, (RG + RU)1/2, RP1/2 , (RMRP)1/2, B, and S1/2 are the covariates. The measure of word length A has been replaced by the square root of the Simpson index of word diversity.

In Vector 7, C1/2, (RG + RU) 1/2, RP1/2, (RMRP) 1/2, B, S1/2, and Uj, 1 ≤ j ≤ 50, are the covariates. Thus, Vector 6 is supplemented by measures of specific word choice.

These vectors are examples of possible choices. Many others are quite reasonable. One possible improvement of the analysis that is not possible from the available data is to include independent variables that provide information on the raters. It is quite possible that rater effects are sufficient to affect analysis. It is also possible that the number of papers previously rated by the same rater has an effect on the score and that the scores previously assigned by the rater have an effect.

Results
Results are summarized in Tables 1 and 2. In Table 1, the sample size and the variance estimates V (e) are provided for each prompt. In Table 2, variance estimates V (u) for the residual u, estimated weights Formula, and estimated variances V (v) for the difference {theta} t are provided. Of note is the consistent finding that the estimated optimal weight on the holistic score is less than 0.5, with the optimal weight at times less than 0.2. For each prompt, it is possible to find a vector of covariates such that the estimated variance of v is less than 0.1. The covariates used in e-rater perform quite well relative to other selections, although interpretation of results is complicated if e6 and {psi} are included. It is worth note that an appreciable improvement in results, especially for GRAD prompts, is achieved by use of more Uj terms than are found in Vector 7. For instance, in the first GRAD prompt, use of the first 172 of the Uj rather than just the first 50 yields V (v) of 0.059, whereas in the second GRAD prompt, use of the first 174 of the Uj yields V (v) of 0.033 (Haberman, 2004).


View this table:
[in this window]
[in a new window]

 
TABLE 1 Variability of Holistic Scores

 

View this table:
[in this window]
[in a new window]

 
TABLE 2 Mean Squared Errors and Weights for Covariate Vectors of Section 3

 
For some perspective on these results, note that the estimated mean squared error from use of the average of m holistic scores to predict {theta} is V (e)/m. For the first GRAD prompt, it follows that the average of holistic scores from five raters yields a mean squared error comparable to that provided by one rater and a careful selection of features. Achievable results for TOEFL are comparable to those for three or four readers.

An alternative perspective involves percentage reductions in error. Consider the first GRAD prompt and the seventh covariate vector. The estimated mean squared error obtained by use of h to estimate {theta} is 0.356. If Kelley’s formula is used, then the estimated mean squared error drops to 0.267, a reduction of about 25%. This reduction is just 1 – Formula2, where Formula2 is the estimated rater reliability. If the linear predictor f based on covariates is used to estimate {theta}, then the mean squared error is estimated to be 0.105. This mean squared error is about 71% less than the mean squared error associated with the direct predictor h and about 61% less than the mean squared error associated with Kelley’s formula. In the case of the combined estimate t, the estimated mean squared error is 0.081, a value 100(1 Formula) = 77.2% smaller than the original mean squared error from the direct estimate and about 70% smaller than the estimated mean squared error from use of Kelley’s formula. The gain over use of f is more modest. The reduction is only about 100Formula = 22.8%.


    4. Findings
 Top
 Abstract
 Introduction
 1. The Best Linear...
 2. Estimation of the...
 3. Data Sources and...
 4. Findings
 References
 
This study determines the best linear predictor of a true score based on a direct estimate and a vector of covariates and determines the resulting mean squared error. A simple estimation procedure is also presented for this linear predictor. Application of results to essay scoring suggests that the true score for holistic essay scores assigned by raters can be estimated with relatively good accuracy by use of one human rater and by use of covariates generated by computer analysis of essays.

The proposed estimation procedure differs considerably from the procedure found in GRAD in that a continuous approximation of the true essay score is produced that gives the human holistic score for the essay a relatively small weight. Use of the continuous approximation requires the perception that there is a population of raters who might grade an essay and that there is a distribution of human holistic scores that has a mean and a variance. In this framework, there is no pretense that there is a true rating of the essay that is an integer from 1 to 6 that an infinitely skilled reader would provide the essay.

Given that the mean squared error of the proposed essay rating is somewhat smaller than the mean squared error of the current system of score assignment, it is plausible that the proposed weighting might improve reliability and validity of essay scores; however, this possibility can only be verified with further research.

Improvement in holistic scoring by human readers can improve computer predictions of holistic scores. Random assignment of scorers to essays and models for rater effects can reduce mean squared error. In addition, it is not obvious that the customary scoring system is optimal in any sense. The analysis here is bound by the scoring rubric that exists.

The proposed method of essay scoring has potential problems. It is not clear whether the public can be persuaded that a reduced weight to holistic scores assigned by raters is desirable, no matter what statistical arguments may be made. Perhaps this potential concern can be reduced by emphasis that the essay features used by the computer analysis do provide measures of writing quality that are strongly related to holistic scores and that the collection of holistic scores of essay responses has been employed to determine the final predictor of the essay score.

A further potential difficulty is that behavior of essay writers might change if they are aware of the scoring procedure used to evaluate the essay. Exploiting this knowledge might be difficult in practice, and in any event, research concerning the relationship of essay features to holistic scores assigned by readers is publicly available, at least to a substantial extent (Haberman, 2004).

In conclusion, it appears that the proposed regression-based method of essay assessment should be seriously considered in those cases in which essays are available in computer-readable form and in which holistic scores are assigned by raters.


    Footnotes
 
SHELBY J. HABERMAN is Research Director, Statistical and Psychometric Theory and Practice, Educational Testing Service, Rosedale Road, Princeton, NJ 08541;SHaberman{at}ets.org. His principal research interests are analysis of qualitative data, asymptotic approximations, and application of statistics to educational measurement. Back

JIAHE QIAN is Senior Research Scientist, Statistical and Psychometric Theory and Practice, Educational Testing Service, Rosedale Road, Princeton, NJ 08541;JQian{at}ets.org. His principal research interests are sampling techniques and the analysis of complex surveys in educational measurement. Back

The authors thank John Mazzeo and Ida Lawrence for their support and suggestions. Any opinions expressed in this article are those of the authors and not necessarily of Educational Testing Service. Back

Manuscript received July 25, 2004. Accepted for publication July 18, 2005.


    References
 Top
 Abstract
 Introduction
 1. The Best Linear...
 2. Estimation of the...
 3. Data Sources and...
 4. Findings
 References
 
Attali, Y, & Burstein, J. (2004, June). Automated essay scoring with e-rater v.2.0, Paper presented at the Annual Conference of the International Association for Educational Assessment (IAEA), Philadelphia

Bock, R. D, Petersen, & A. C. (1975). A multivariate correction for attenuation. Biometrika, 62, 673-678.[Abstract/Free Full Text]

Breland, & H. M. (1996). Word frequency and word difficulty: A comparison of counts in four corpora. Psychological Science, 7, 96-99.[CrossRef][Web of Science]

Breland, HM, Jones, RJ, & Jenkins, L. (1994). The college board vocabulary study. New York: College Board. ((College Board Report No. 94–4)).

Burstein, J., Chodorow, M, Leacock, & C. (2004). Automated essay evaluation: The criterion online service. AI Magazine, 25, 27-36.[Web of Science]

Gini, C. (1912). Variabilitá e mutabilitá: Contributo allo studio delle distribuzioni e delle relazioni statistiche [Variability and mutability: Contribution to the study of distributions and statistics relations]. Bologna, Italy: Cuppini.

Haberman, SJ. (2004). Statistical and measurement properties of features used in essay assessment. Princeton, NJ: Educational Testing Service. ((Research Report RR-04–21)).

Holland, P. W, Hoskens, & M. ((2003)). Classical test theory as a first-order item response theory: Application to true-score prediction from a possibly non-parallel test. Psychometrika, 68, 123-149.[CrossRef][Web of Science]

Kelley, TL. (1947). Fundamentals of statistics. Cambridge, MA: Harvard University Press.

Kendall, M, & Stuart, A. (1977). The advanced theory of statistics, 1 (4.). New York: Macmillan.

Longford, & N. T. (1997). Shrinkage estimation of linear combinations of true scores. Psychometrika, 62, 237-244.[CrossRef][Web of Science]

Lord, FM, & Novick, MR. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.

Qian, J, & Haberman, SJ. (2003, August). The best linear predictor for true score from a direct estimate and a derived estimate, Paper presented at the annual Joint Statistical Meetings of the American Statistical Association, San Francisco.

Rao, CR. (1973). Linear statistical inference and its applications. New York: John Wiley.

Simpson, & E. H. (1949). The measurement of diversity. Nature, 163, 688.

Wainer, H, & Thissen, D. In Thissen, D, & Wainer, H (Eds.). (2001). True score theory—The traditional method. Test scoring (p. 23-72). Mahwah, NJ: Lawrence Erlbaum.

Wainer, H, Vevea, JL, Camacho, F, Reeve, BB, Rosa, K, Nelson, L, et al. In Thissen, D, & Wainer, H (Eds.). (2001). Augmented scores—"Borrowing strength" to compute scores based on small numbers of items. Test scoring (p. 343-387). Mahwah, NJ: Lawrence Erlbaum.

Journal of Educational and Behavioral Statistics, Vol. 32, No. 1, 6-23 (2007)
DOI: 10.3102/1076998606298036


Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati   Add to Twitter Twitter    What's this?



This Article
Right arrow Abstract Freely available
Right arrow Free Full Text (Free PDF) Free
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Alert me to new issues of the journal
Right arrow Add to Saved Citations
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Request Reprints
Right arrow Add to My Marked Citations
Citing Articles
Right arrow Citing Articles via Google Scholar
Right arrow Citing Articles via Scopus
Google Scholar
Right arrow Articles by Haberman, S. J.
Right arrow Articles by Qian, J.
Right arrow Search for Related Content
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati   Add to Twitter  
What's this?

AER home page RER home page JEB home page EPA home page RRE home page