/
Comparative Judgment as a Novel Approach to Operational Scoring, Comparative Judgment as a Novel Approach to Operational Scoring,

Comparative Judgment as a Novel Approach to Operational Scoring, - PowerPoint Presentation

pasty-toler
pasty-toler . @pasty-toler
Follow
345 views
Uploaded On 2019-12-09

Comparative Judgment as a Novel Approach to Operational Scoring, - PPT Presentation

Comparative Judgment as a Novel Approach to Operational Scoring Rangefinding and other Assessment Activities Jeffrey Steedle and Steve Ferrara Center for Next Generation Learning and Assessment CCSSO National Conference on Student Assessment June 24 2015 ID: 769739

pearson education affiliates 2015 education pearson 2015 affiliates rights reserved judgment comparative amp copyright rubric scoring judgments responses reliability

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Comparative Judgment as a Novel Approach..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Comparative Judgment as a Novel Approach to Operational Scoring, Rangefinding, and other Assessment Activities Jeffrey Steedle and Steve FerraraCenter for Next Generation Learning and AssessmentCCSSO National Conference on Student Assessment, June 24, 2015

Which of these essays is of higher quality? A time when i felt free was, when i finally got released from being in the hospital for four days. The reason i was in the hospital was because i had a kidney stones which hurted really bad that i couldn't eat and stand up straight.So i decided to go to the emergency room to see what was going on.This was before i found out i had kidney stones…A time I felt like I was free was when I was fifteen years old. At age fifteen, everybody is curious and anxious to do things on there own without parental consent. I was just another one of those fifteen year olds anxious to get my turn at something, but then I learned how to drive. A lot of people enjoy driving around, some people do it because they have to get to their job or because they need to go from one place to another… Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 2

Traditional, Rubric-Based Scoring “Those responsible for test scoring should establish and document quality control processes and criteria. Adequate training should be provided. The quality of scoring should be monitored and documented. Any systematic source of scoring errors should be documented and corrected” (AERA, APA, & NCME, 2014).Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 3

Comparative JudgmentCopyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 4 Prompt and responses from http ://tea.texas.gov/student.assessment/staar/writing/

Comparative Judgment Background Not a new idea (Law of Comparative Judgement, Thurstone, 1927)Relative judgments are more accurate than absolute judgments forpsychophysical phenomena (Stewart et al., 2005)estimating distances, counting spelling errors (Shah et al., 2014) evaluating physics and history exams (Gill & Bramley , 2008)Past uses in educational assessmentComparing the alignment of passing standards over time (Bramley, Bell, & Pollitt, 1998; Curcin et al., 2009)Estimating item difficulty (Walker et al., 2005)Scoring essays, portfolios, and short-answer responses (Pollitt, 2004; Whitehouse & Pollitt, 2012; Kimbell et al., 2009; Pollitt, 2012; Attali, 2014)Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.5

Comparative JudgmentsCopyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 6 Rubric-Based Scoring Comparative Judgment Scorers must internalize the definition of each score pointJudges must internalize the definition of “quality”Scorers must agree exactly with the trainer and “anchor papers”Judges must agree with the trainer about the relative quality of responsesLengthy training and qualification (e.g., 16 hours)Brief training and qualification (e.g., 3 hours)Longer time per evaluationShorter time per evaluationRequires fewer evaluations per responseRequires more evaluations per response

Comparative Judgment Advantages Eliminating certain scorer biases/increased validityFaster time per evaluationReduced cognitive demandMinimal training, qualification, and monitoringReduced costsResearch is needed to test the potential advantages. Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 7 POTENTIAL

Potential Applications in Scoring Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.8 Few responses to a large number of prompts Many lengthy trainings, shorter overall evaluation time Many brief trainings, longer overall evaluation time Rubric Scoring Comparative Judgment Field Test Scoring Educators get buy-in and professional development Fewer teachers in lengthy trainings More teachers in brief trainings Lower overall productivity, narrow PD reach Greater overall productivity, expanded PD reac h Educator Scoring Rubric Scoring Comparative Judgment Possibly more efficient

Research Questions How closely do comparative judgment measures correspond to rubric scores?Do comparative judgments take less time than rubric scoring decisions?How do comparative judgment measures and rubric scores compare in terms of validity coefficients? How is the reliability of comparative judgment measures associated with the number of judgments per essay response?Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.9

Method: Essay PromptsTwo essay prompts from online administrations of a high school achievement testing program in a large state 4-point holistic rubric scoring, at least two scores per response, exact agreement requiredSamples of 200 responses for each promptCopyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 10 Agreement rRubric Score Distribution Prompt Exact Adj. 1 2 3 4 1 70% 29% .81 25% 40% 25% 10% 2 69% 30% .85 25% 40% 25% 10%

Method: Participants All with secondary English teaching experienceNo professional scorers to avoid interference between methods of evaluating student responsesCopyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 11 4 judges Prompt 15 judgesPrompt 2

Method: TrainingConducted via web conference by an experienced scoring trainer Judges learned rubric criteria (focus, organization, development, etc.), but the rubric was never shownJudges practiced making comparative judgments on “anchor pairs” involving “anchor papers” used in rubric-based trainingQualification test accuracy ranged from 11 to 15 out of 15Training durations were 3 and 3.75 hours Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 12

Multivariate generalization of Bradley-Terry model (Bradley & Terry, 1952 ) µ A is the latent location of response A on a continuum of writing qualityMethod: Statistical ModelCopyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.13When µB < µA, “Prefer A” is the most probable judgment When µ B > µ A , “Prefer B” is the most probable judgment “Options equal” is never the most probable judgment  

Method: Pairing ResponsesNote: The most information about a response’s latent location is obtained by comparing it to another response of similar quality. The Generalized Grading Model (GGM) provided a predicted score for each response on the 1–4 rubric scale (based on text complexity, coherence, length, spelling, and vocabulary).Each response was paired with 16 other responses (with the same or adjacent predicted score)2 anchor papers 2,000 judgments per prompt Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 14

Method: Data CollectionResponses were “chained” so that a judge only read one new response per judgment Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.15 A vs. B B vs. C C vs. D D vs. E

Results: Parameter EstimationCopyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 16 Scale anchored by anchor paper scores, so most measures fall between 1.0 and 4.0

Results: CorrespondenceCopyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 17 Measure Prompt 1 Prompt 2RubricRounded CJRubricRounded CJMean2.202.40 2.202.21 Std. Deviation 0.93 0.97 0.93 0.98 Exact Agmt . 60.0% 64.0% Adj. Agmt . 38.5% 33.5% Correlation .78 .76 60.0% exact agreement between rubric scores and rounded comparative judgment scores on Prompt 1 Slight tendency for comparative judgment to overestimate on Prompt 1 Better agreement overall on Prompt 2

Results: Judgment TimeCopyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 18 Prompt 1 Prompt 2BothMean (Rubric)121.2 s116.4 s119.4 sMean (CJ)116.7 s70.45 s93.5 sMedian (CJ)83.0 s45.0 s62.0 s Some huge outliers in these data (e.g., 2,760 seconds) Medians likely provide better measures of central tendency

Results: Validity CoefficientsCopyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.19 Rubric Score Multiple-Choice Writing Test .63, .69 Continuous Comparative Judgment Measure .67, .72 Rounded Comparative Judgment Measure .66, .71

Results: Reliability In this context, “reliability” reflects judge behavior and is therefore akin to inter-rater reliability.High reliability translates into greater precision in estimating the perceived relative quality of responses.Reliability does not reflect correspondence between estimated scores and “true” scores. Studying this would require multiple responses from each student. Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 20 Reliability =Consistency in judgments about the quality of a response relative to other responses

Results: Reliability Remove random samples of judgments, refit the model, recalculate reliability.Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.21 Reliability drops below .80 with a 50% reduction (~9 judgments per response)

A Note on Number of JudgmentsTRUE or FALSE: If you have 200 responses and you want reliability of .80, you need about 200×9 = 1,800 judgments.FALSE: A judgment provides information about 2 responses, so you would need about 900 judgments (or 4.5 judgments per unique response).Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 22

ConclusionsScores from comparative judgment correspond to rubric scores at a rate similar to that observed between two scorers (60–70% exact agreement; Ferrara & DeMauro, 2006).Comparative judgment measures appear to have higher validity coefficients than rubric scoresWith 3-4 hours of comparative judgment training, judges can consistently judge the relative quality of responses, as reflected by high reliability coefficients.Time per comparative judgment appears to be less than time per rubric score. Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 23

Future ResearchAgreement might be improved with improvements in the pairing process Potentially improve accuracy and efficiency by implementing adaptive comparative judgment (Pollitt, 2012)Initial pairings are randomSubsequent pairings are based on preliminary score estimatesPilot rangefinding studyData-free form assembly and equating Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 24

Pilot Rangefinding Results Six panelists made 106 judgments about 15 responses in 16 minutes (with reliability = .97).Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 25 1s 2s 3 s 4 s 5 s

Data-Free Forms Assembly and EquatingField testing (especially embedded) is useful for estimating item difficulties for forms assembly and/or pre-equating Problems with field testing:It is not permitted or valued in some countriesThere is backlash against it in the U.S. (i.e, using kids as unpaid laborers)Test security may be compromised because performance tasks and essays are highly memorableExaminees may not be motivated Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 26

Which of these items is more difficult? Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.27 What single transformation is shown below? ReflectionRotationTranslationNo single transformation is shown.The masses of two gorillas are given below.A female gorilla has a mass of 85,000 grams. A male gorilla has a mass of 220 kilograms.What is the difference between these two masses in grams? 135,000 g 84,780 g 63,000 g 305,000 g http://tea.texas.gov/Student_Testing_and_Accountability/Testing/State_of_Texas_Assessments_of_Academic_Readiness_(STAAR)/STAAR_Released_Test_Questions /

Data-Free Forms Assembly and EquatingTo the extent that such judgments are accurate, comparative judgment can be used to put items (from different test forms) on a common scale of perceived item difficulty. Those measures could be used forDeveloping test forms of similar difficultyEquating test forms (with no common items or persons)Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 28

Example Equating ProcessCopyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 29 Calibrate Form X (prior admin.) Calibrate Form Y (current admin.) Compare a sample of Form Y items to a sample of Form X “equating” items to calculate an equating constant Apply the constant to all of Form Y Locate the Form X performance standard on Form Y

Data-Free Forms Assembly and EquatingPrior research has demonstrated that comparative judgment measures can be highly correlated with empirical item difficulties (e.g., Heldsinger & Humphry, 2014).Our study will focus on the accuracy of the comparative judgment measures and subsequent accuracy of raw-to-theta pre-equating tables, equating of performance standards across forms, and inferences about the relative difficulty of test forms. Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 30

THANK YOU! Center for Next Generation Learning and AssessmentResearch and Innovation Networkjeffrey.steedle@pearson.com steve.ferrara@pearson.com 31

References AERA, APA, & NCME. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.Attali, Y. (2014). A ranking method for evaluating constructed responses. Educational and Psychological Measurement, Online First, 1-14. Bradley, R.A., & Terry, M.E. (1952). Rank analysis of incomplete block designs: The method of paired comparisons. Biometrika , 39, 324-345. Bramley, T., Bell, J.F., & Pollitt, A. (1998). Assessing changes in standards over time using thurstone paired comparisons. Education Research and Perspectives, 25(2), 1-24.Curcin, M., Black, B., & Bramley, T. (2009). Standard maintaining by expert judgment on multiple-choice tests: A new use for the rank-ordering method. Paper presented at the the British Educational Research Association Annual Conference, Manchester.Elliot, S., Ferrara, S., Fisher, T., Klein, S., Pitoniak, M., & Steedle, J. (2010). Developing the edsteps continuum Washington, DC. Council of Chief State School Officers.Ferrara, S., & DeMauro, G.E. (2006). Standardized assessment of individual achievement in k-12. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 579-621). Westport, CT: Praeger.Gill, T., & Bramley, T. (2008). How accurate are examiners’ judgments of script quality? An investigation of absolute and relative judgments in two units, one with a wide and one with a narrow ‘zone of uncertainty’. Paper presented at the British Educational Research Association Annual Conference, Edinburgh, Scotland .Heldsinger, S., & Humphry, S. (2010). Using the method of pairwise comparison to obtain reliable teacher assessments. The Australian Educational Researcher, 37 (2), 1-19 . Heldsinger , S., & Humphry, S. (2014). Maintaining consistent metrics in standard setting. Murdoch, Western Australia: Murdoch University. Kimbell , R., Wheeler, T., Stables, K., Shepard, T., Martin, F., Davies, D., . . . Whitehouse, G. (2009). E-scape portfolio assessment: Phase 3 report. London: Technology Education Research Unit, Goldsmiths College, University of London. Pollitt, A. (2004). Let’s stop marking exams . Paper presented at the IAEA Conference, Philadelphia, PA . Pollitt, A. (2012). The method of adaptive comparative judgement . Assessment in Education: Principles, Policy & Practice, 19 (3), 281-300. Shah, N.B., Balakrishnan , S., Bradley, J., Parekh, A., Ramchandran , K., & Wainwright, M. (2014). When is it better to compare than to score? arXiv. http://arxiv.org/abs/1406.6618Stewart, N., Brown, G.D.A., & Chater , N. (2005). Absolute identification by relative judgment. Psychological Review, 112 (4), 881-911. Thurstone , L.L. (1927). A law of comparative judgment. Psychological Review, 34 (4), 273-286 . Walker, M.E., Dorans , N.J., Kim, S., Vafis , G., & Fecko -Curtis, E. (2005). Alternative methods for obtaining item difficulty information . Paper presented at the Annual Meeting of the American Educational Research Association, Montreal, Canada . Whitehouse, C., & Pollitt, A. (2012). Using adaptive comparative judgement to obtain a highly reliable rank order in summative assessment. Manchester: The Assessment and Qualifications Alliance .Wolfe, E.W., & McVay, A. (2012). Application of latent trait models to identifying substantively interesting raters. Educational Measurement: Issues and Practice, 31(3), 31-37.Zahner, D., & Steedle, J.T. (2014). Evaluating performance task scoring comparability in an international testing program. Paper presented at the National Council on Measurement in Education Annual Meeting, Philadelphia, PA. Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.32