| Sign In to gain access to subscriptions and/or personal tools. |
Linear Prediction of a True Score From a Direct Estimate and Several Derived EstimatesEducational Testing Service
Statistical prediction problems often involve both a direct estimate of a true score and covariates of this true score. Given the criterion of mean squared error, this study determines the best linear predictor of the true score given the direct estimate and the covariates. Results yield an extension of Kelleys formula for estimation of the true score to cases in which covariates are present. The best linear predictor is a weighted average of the direct estimate and of the linear regression of the direct estimate onto the covariates. The weights depend on the reliability of the direct estimate and on the multiple correlation of the true score with the covariates. One application of the best linear predictor is to use essay features provided by computer analysis and an observed holistic score of an essay provided by a human rater to approximate the true score corresponding to the holistic score.
Key Words: Keywords: Kelleys formula mean squared error automatic essay scoring reliability
Statistical prediction of a true score on a test may involve both direct estimation of a true score and covariates related to the true score. For example, in a graduate admission test denoted by GRAD in this article, a final essay score was based on a holistic score provided by a human rater, the direct estimate, and essay features such as number of words in the essay, error rates per word in grammar or usage, and numerical measures of word diversity. The essay features, the covariates, were determined by computer analysis of the essay (Attali & Burstein, 2004). The procedure in GRAD for essays employed an integer holistic score in the range 1 to 6 and an integer e-rater score between 1 and 6 generated from computer analysis. Normally, the reported score was the average of the holistic score from the reader and of the e-rater score; however, an additional reader was employed if the reader score and e-rater score differed by more than 1. The approach used in GRAD was not necessarily an optimal approach to assignment of a final score to an essay. This remark applies even if the true essay score is regarded as the average holistic score an essay would receive if rated by an arbitrarily large number of raters (Lord & Novick, 1968). In this study, a continuation of work presented earlier (Qian & Haberman, 2003), the criterion of mean squared error is used to determine the best linear predictor of a true score based on a direct estimate and on covariates. In Section 1, this predictor is considered under the assumption that all relevant population parameters are known. In this ideal case, the best linear predictor is shown to be a weighted average of two components. The first component is the direct estimate. The second component is the regression of the direct estimate onto the covariates. The weights assigned to the components depend on the reliability of the direct estimate and on the multiple correlation between the direct estimate and the covariates. The mean squared error of the optimal linear predictor is shown to depend on the variance of the direct estimate, reliability of the direct estimate, and multiple correlation of the true score and the covariates. Results of this section can be regarded as a generalization of Kelleys formula to the case of covariates (Kelley, 1947). Required arguments are familiar from treatments of linear prediction in classical test theory (Holland & Hoskens, 2003; Lord & Novick, 1968). Results are related to other efforts to combine information from several tests to provide improved estimation of the true scores for each of the tests or for a composite test (Longford, 1997; Wainer et al., 2001; Wainer & Thissen, 2001). As evident from the cited references, arguments used here are readily related to arguments used in Bayesian inference. In Section 2, estimation of the best linear predictor and of the mean squared error are considered. Estimation is described for a simple random sample of essays from a large population. Because reliability must be estimated, it is assumed that at least for some essays, more than one independently obtained holistic score is available in the sample. This assumption has commonly been fulfilled in applications of e-rater. For each essay in the sample, it is assumed that at least one holistic score and all covariates are observed. Given these data, estimation of parameters is relatively straightforward, at least for large samples. Standard treatments of classical test theory provide basic background (Lord & Novick, 1968), as do classical treatments of statistical inference (Rao, 1973). Some readers may recognize relationships to empirical Bayesian inference (Wainer & Thissen, 2001). In Section 3, the methods developed in Sections 1 and 2 are applied to essays from GRAD and from the Test of English as a Foreign Language (TOEFL). A notable feature of the analysis is the relatively low weight assigned to the holistic score provided by the reader. This result reflects some limitations in the reliability of holistic scores and a relatively high multiple correlation of holistic scores and computer-derived essay features. As discussed in Section 4, results in this report suggest that scoring procedures such as those used in GRAD give considerably higher weight to computer-generated essay features than has generally been the case. Policy issues may arise that involve public perceptions concerning the reduced weight given to the human rater, and there is some question to consider concerning the effect on examinee performance if they are aware that a very large fraction of the grade on their essay is determined by a computer program.
To obtain the best linear predictor of the true score from a direct estimate and from the available covariates, some elementary notation and a basic probability model are required. Let , the true score, be a random variable with expectation E( ) and positive variance V( ), and let h, the direct estimate, be a random variable such that the error e = h – in estimation of has expectation 0 and positive variance V(e). Let e and be uncorrelated (Lord & Novick, 1968). Then the observed score h has mean E(h) = E( ) and variance
and the covariance C(h,
(Lord & Novick, 1968). Under the assumptions made concerning the variances of the true score
Let d be a q-dimensional vector of covariates dj, 1 To describe the best linear predictor of the true score, first consider the standard formula for the best linear predictor of the direct estimate h based on the covariate vector d. For q-dimensional vectors x and y with respective coordinates xi and yi, let
Then the best linear predictor of h from d is
where the vector of regression coefficients is
Note that C(d, h) is the vector of covariances of dj and h for 1
The best linear predictor of the direct estimate h from the covariate vector d is the same as the best linear predictor of the true score
where the vector
The residual for prediction of the direct estimate h by the covariate vector d is r = h – f. The corresponding residual for prediction of the true score The mean squared error for linear prediction of the direct estimate h by the covariate vector d is then
where
(Rao, 1973). If
and Equations 6 and 8 imply that
In like manner, the mean squared error for linear prediction of the true score
It is assumed in this article that the residual variance V(u) is positive so that the true score is not determined by an affine function of the covariate vector d. This assumption implies that V(f) < V(
In analogy to Equations 8 and 9, the multiple correlation
and
By Equations 2, 8, and 12,
so that the multiple correlation coefficient of the true score
Given these basic results, it is then relatively easily shown that the best linear predictor of the true score
of the direct estimate h and of the best linear predictor f of
Prior to verification of Equation 16, it is helpful to interpret the result. The weight
of the residual variance V(u) from linear prediction of the true score
In Equation 17, the weight
The weight
reduces to the estimate of Kelleys formula. In general, To verify that the best linear predictor t of the true score satisfies Equation 16, consider the mean squared error
from prediction of the true score
and
(Rao, 1973). Recall that the covariance vector C(d, h) is the same as the covariance vector C(d,
By Equation 19, it then follows that
By Equations 6, 7, 10, 17, and 20, the optimal c is
The residual from prediction of
Because u and e have 0 expectations, v also has 0 expectation. Because u, a linear function of
Application of Equations 12 and 13 yields
Note that V(v) is less than either the variance V(e) of the error of the direct estimate or the variance V(u) of the error from use of the predictor f as an estimate of the true score
In the analysis, it is important to distinguish between the best linear predictor and the best general predictor. In general, the optimal predictor of
To estimate the best linear predictor t, consider a random sample of size n > q + 1 from the population used to define t. Assume that the underlying population is either infinite or so large that finite sampling corrections can be ignored. For each observation i, 1 i n, let mi 1 direct estimates hij, 1 j mi, 1 i n, be available, and assume that at least one mi exceeds 1 and that the mi are selected without regard to any characteristics of the essays under study. The requirement of some multiple direct estimates is essential to determine the variance V(e) of the errors ei. In use of e-rater, essays used to construct the regression analysis are assessed by more than one rater so that the requirement imposed here is consistent with current practice with e-rater. The assumption that selection of essays for multiple grading is independent of essay characteristics is not necessarily valid in practice to the extent that formal randomization procedures for essay selection are not used. In the analysis of essays in Section 4, each mi will be 2; however, little is lost by consideration of the more general case. In addition, cost considerations will often require that most mi be 1. Nonetheless, estimation of the error variance V(e) with reasonable accuracy will normally require several hundred essays that are multiply scored.
Let the true score for observation i be
To assist in some formulas, a variable
is the harmonic mean of the mi (Kendall & Stuart, 1977). If m is an integer and mi is at least m, then
Given these conditions, estimation of the best linear predictor t is straightforward. For each observation i, let
The expectation E(d) is then estimated by the sample mean
The covariance matrix C(d) is estimated by the sample covariance
where xy' is the q by q matrix with elements xjyk for 1
Thus, the vector
The approximation to f is then
For observation i, the covariate vector di predicts an observed score
To complete estimation, it is necessary to approximate
The case of the residual variance V(u) is a bit more complex. Let
be the residual from regression of the
is a consistent estimate of the variance
If d has a continuous distribution, if each mi is m, and if the residual u is independent of d, then
At this point, the natural estimate of
The only complication is that
Given
Given Equation 25, the mean square error V(v) may then be approximated by
The results of Sections 1 and 2 are readily applied to essay assessment. In this section, data and variables used in the analysis are described, and results of the analysis are presented.
Data Sources and Prompts Used in Essay Assessment
Covariates in the Analysis
Number of Words
Number of Characters
Average Word Length
Error Rates
Number of Arguments
Average Argument Length
Standard Frequency Index
Measures of Word Diversity
Selection of Specific Words
Sources of Variables
Covariate Vectors Used
Vector 0 provides results based on Kelleys formula. No predictor other than a constant is employed so that the linear predictor f is the mean E(
In Vector 1, the elements were the number of words W, W2, the average argument length L, the truncated number D8 of discourse elements, the error rate RG for grammar, the error rate RU for usage, the error rate RM for mechanics, the error rate RS for style, the average word length A, the diversity measure T, the measures In Vector 2, the e-rater variables from content vector analysis were removed from Vector 1 so that the elements were W, W2, L, D8, RG, RU, RM, RS, A, T, and B5. This omission is considered to eliminate variables defined by reference to essays other than the one to be rated. In Vector 3, the only variables are the logarithm log(C) of the number of characters and the logarithm log(RT ) of the overall error rate. This vector is a rather minimal selection that only considers a length measure and an error rate measure. The logarithms were selected to improve distributions of independent variables by reduction of the influence of outliers and to increase the multiple correlation coefficient of the average holistic score with the independent variables. Some measure of essay length is included here due to the rather high correlation of essay length to holistic score. In Vector 4, Vector 3 is supplemented by the median Standard Frequency Index B so that log(C), log(RT ), and B are the coordinates. Addition of B provides a measure of vocabulary level. In Vector 5, the square root C1/2 of the number of characters, average word length A, the square root (RG + RU) 1/2 of the combined error rate for grammar and usage, the square root RP1/2 of the error rate for spelling, the square root (RM – RP)1/2 of the error rate for other mechanical errors, and the median SFI B are the covariates. This choice is based on empirical work by one of the authors (Haberman, 2004). There is a length measure, a word length measure, error rate measures that reflect types of errors that appear to correlate with holistic scores, and a vocabulary measure. Again, transformations are designed to improve the distributions of the independent variables and to increase the multiple correlation coefficient of the average holistic score with the independent variables. In Vector 6, C1/2, (RG + RU)1/2, RP1/2 , (RM – RP)1/2, B, and S1/2 are the covariates. The measure of word length A has been replaced by the square root of the Simpson index of word diversity.
In Vector 7, C1/2, (RG + RU) 1/2, RP1/2, (RM – RP) 1/2, B, S1/2, and Uj, 1 These vectors are examples of possible choices. Many others are quite reasonable. One possible improvement of the analysis that is not possible from the available data is to include independent variables that provide information on the raters. It is quite possible that rater effects are sufficient to affect analysis. It is also possible that the number of papers previously rated by the same rater has an effect on the score and that the scores previously assigned by the rater have an effect.
Results
For some perspective on these results, note that the estimated mean squared error from use of the average of m holistic scores to predict is (e)/m. For the first GRAD prompt, it follows that the average of holistic scores from five raters yields a mean squared error comparable to that provided by one rater and a careful selection of features. Achievable results for TOEFL are comparable to those for three or four readers.
An alternative perspective involves percentage reductions in error. Consider the first GRAD prompt and the seventh covariate vector. The estimated mean squared error obtained by use of h to estimate
This study determines the best linear predictor of a true score based on a direct estimate and a vector of covariates and determines the resulting mean squared error. A simple estimation procedure is also presented for this linear predictor. Application of results to essay scoring suggests that the true score for holistic essay scores assigned by raters can be estimated with relatively good accuracy by use of one human rater and by use of covariates generated by computer analysis of essays. The proposed estimation procedure differs considerably from the procedure found in GRAD in that a continuous approximation of the true essay score is produced that gives the human holistic score for the essay a relatively small weight. Use of the continuous approximation requires the perception that there is a population of raters who might grade an essay and that there is a distribution of human holistic scores that has a mean and a variance. In this framework, there is no pretense that there is a true rating of the essay that is an integer from 1 to 6 that an infinitely skilled reader would provide the essay. Given that the mean squared error of the proposed essay rating is somewhat smaller than the mean squared error of the current system of score assignment, it is plausible that the proposed weighting might improve reliability and validity of essay scores; however, this possibility can only be verified with further research. Improvement in holistic scoring by human readers can improve computer predictions of holistic scores. Random assignment of scorers to essays and models for rater effects can reduce mean squared error. In addition, it is not obvious that the customary scoring system is optimal in any sense. The analysis here is bound by the scoring rubric that exists. The proposed method of essay scoring has potential problems. It is not clear whether the public can be persuaded that a reduced weight to holistic scores assigned by raters is desirable, no matter what statistical arguments may be made. Perhaps this potential concern can be reduced by emphasis that the essay features used by the computer analysis do provide measures of writing quality that are strongly related to holistic scores and that the collection of holistic scores of essay responses has been employed to determine the final predictor of the essay score. A further potential difficulty is that behavior of essay writers might change if they are aware of the scoring procedure used to evaluate the essay. Exploiting this knowledge might be difficult in practice, and in any event, research concerning the relationship of essay features to holistic scores assigned by readers is publicly available, at least to a substantial extent (Haberman, 2004). In conclusion, it appears that the proposed regression-based method of essay assessment should be seriously considered in those cases in which essays are available in computer-readable form and in which holistic scores are assigned by raters.
SHELBY J. HABERMAN is Research Director, Statistical and Psychometric Theory and Practice, Educational Testing Service, Rosedale Road, Princeton, NJ 08541;SHaberman{at}ets.org. His principal research interests are analysis of qualitative data, asymptotic approximations, and application of statistics to educational measurement.
JIAHE QIAN is Senior Research Scientist, Statistical and Psychometric Theory and Practice, Educational Testing Service, Rosedale Road, Princeton, NJ 08541;JQian{at}ets.org. His principal research interests are sampling techniques and the analysis of complex surveys in educational measurement.
The authors thank John Mazzeo and Ida Lawrence for their support and suggestions. Any opinions expressed in this article are those of the authors and not necessarily of Educational Testing Service. Manuscript received July 25, 2004. Accepted for publication July 18, 2005.
Attali, Y, & Burstein, J. (2004, June). Automated essay scoring with e-rater v.2.0, Paper presented at the Annual Conference of the International Association for Educational Assessment (IAEA), PhiladelphiaBock, R. D, Petersen, & A. C. (1975). A multivariate correction for attenuation. Biometrika, 62, 673-678.
Journal of Educational and Behavioral Statistics, Vol. 32, No. 1,
6-23 (2007)
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

, the true score, be a random variable with expectation E(

2 must be positive and less than 1.
j 



of Equation 4 also satisfies 


(h, d) is the multiple correlation coefficient of the direct estimate h and the covariate vector d (








assigned to the direct estimate h is the ratio 


1 – 








2(w). It is always true that
will be introduced that is uncorrelated with d and 
i be the average of the mi direct estimates hij, 1 










(
) is unbiased (

is 0 if
and given Equation 16, t may be estimated by 

and e6, used in the regression analysis are obtained from the content vector analysis of e-rater (
(v) for the difference
2, where 



