/
Issues in Comparability of Test Scores Across States Issues in Comparability of Test Scores Across States

Issues in Comparability of Test Scores Across States - PowerPoint Presentation

luanne-stotts
luanne-stotts . @luanne-stotts
Follow
352 views
Uploaded On 2018-12-20

Issues in Comparability of Test Scores Across States - PPT Presentation

Liru Zhang Delaware DOE Shudong Wang NWEA Presented at the 2014 CCSSO NCSA New Orleans LA June 2527 2014 PerformanceBased Assessments The educational reforms movements over the past decades have promoted the use of performance tasks in K12 assessments which mirror classroom instru ID: 744274

rater scoring process performance scoring rater performance process student consistency model state facets states effects raters item test scores

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Issues in Comparability of Test Scores A..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Issues in Comparability of Test Scores Across States

Liru Zhang, Delaware DOEShudong Wang, NWEAPresented at the 2014 CCSSO NCSANew Orleans, LA June 25-27, 2014Slide2

Performance-Based AssessmentsThe educational reforms movements over the past decades have promoted the use of performance tasks in K–12 assessments, which mirror classroom instruction and provide authentic information about what students know and are able to do.The next generation assessments based on the Common Core State Standards are designed to address the important 21st Century competencies of mastering and applying core academic content and cognitive strategies related to complex thinking, communication, and problem solving.

1Slide3

Advantages and LimitationsAdvantages of performance-based assessmentsProvide direct measures on both process and productPromote higher-level thinking and problem-solving skillsGenerate more information about what students know and can doMotivate student learning and improve classroom instruction

Limitations of performance-based assessmentsLow reliability and inconsistency across occasions over timeLimited generalizability across performance tasksSubjectivity and error in scoring 2Slide4

Rater Errors in ScoringOne of the essential features of performance-based assessments is that they fundamentally depend on the quality of professional judgment from raters (Engelhard,1994). Raters are human and they are therefore subject to rating errors (Guilford, 1936).Severity or leniency is the tendency of raters to consistently score lower or higher than warranted by student performance.Halo effect

is the tendency of raters to fail to distinguish between aspects or specific dimensions that are conceptually distinct from a task.Central tendency is that ratings are clustered about the midpoint of the rating scale.Restriction of score range is that ratings cannot discriminate student performance. 3Slide5

Rater Effects and Rater DriftA variety of factors that can affect the scoring behavior of raters have been identified, such as the characteristics of raters, quality of training and preparation, time pressure, productivity, and scoring conditions.Operationally, rater effect is far from identical, but rather inconsistent from task to task, within a scoring event, across occasions, and over time.Rater effects may result in item-parameter drift from its original value, which not only introduces bias to the estimate of student ability, but it also threatens the validity of the underlying test construct. In an extreme case, construct shift could cause scale drift, particularly for the vertical scale.

4Slide6

Comparability of Test ScoresIn practice, rater effects or rater drift could be more serious when the variation is present in the scoring process from state to state.The disparities across states may involve rater quality (e.g., source of qualified raters, recruiting, screening, training), scoring method (e.g., human, automated, or a combination), scoring design (e.g., number of raters, scoring rules, monitor system), and scoring conditions (e.g., on-site or distributed, timeline, workload). These may

cause rater effects across states.Statistically, the shift in student motivation, the change of item assigning mode, and the lack post-adjustment may contribute to the item-parameter drift from the field test results.These construct-irrelevant variability create challenges to the comparability of test scores across states.5Slide7

Validity of ScoringIn validation of test scores based on student responses, it is important to document that the raters’ cognitive processes are consistent with the intended construct being measured (Standard, 1999). In principle, constant monitoring of the scoring process is necessary, whether human scoring or automated scoring is applied since automated systems are often “trained” based on human scores (McClellan, 2010; Bejar, 2012).In an effort to improve performance assessments, psychometric models and statistical methods have been proposed for analyzing empirical data to detect rater effects and monitor the scoring process.

6Slide8

Purpose of the Presentation (1 of 2)With the implementation of the next generation assessments, the challenge that each consortia will encounter is how to monitor the scoring process that is operated separately by states, prevent potential rater effects and rater drift across occasions, and retain the comparability of test scores across states, particularly with performance tasks.Are the commonly used procedures (e.g., second rater and read-behind) and criteria for evaluating rating quality (e.g., inter-rater agreement, agreement with validity papers) sufficient for the challenge?

7Slide9

Purpose of the Presentation (2 of 2)How can we monitor the scoring process in practice?How can we detect the impact of rater drift on student test scores across states? Issues discussed in this presentation are in the context of score comparability.

8Slide10

Monitor the Scoring Process (1 of 2)With the advancements in technology, automated scoring with its obvious potentials (e.g., efficiency, consistency, and robust against to internal and external influences) provides a new device in scoring and a supplement to human scoring. To capitalize on the strengths and narrow the limitations of both human scoring and automated engines, procedures to combine the two have been investigated as a mechanism to monitor the process of scoring performance tasks.

9Slide11

Monitor the Scoring Process (2 of 2)The most extensively used procedure in recent years is to monitor human raters with an automated scoring engine reported by Pacific Metrics (Lottridge, Schulz & Mitzel, 2012); Pearson (Shin, Wolfe, Wilson & Foltz, 2014); Kieftenbeld and Barrentt, 2014); and ETS (Yoon, Chen, and Zechner, 2014).This procedure could be altered to monitor automated scoring with expert raters (randomly read-behind) and for combination or weighted combination of human and automated scoring.

10Slide12

Considerations in Monitoring Process (1 of 3)Prior to the operation, much preparation must take place for scoring, for instance the design for the monitoring process, including control rater effects, is a critical aspect not only for the quality of rating but also to achieve the comparability of scores.Benchmark Papers

must be selected from the expert-scored responses from the field test, which represent the performance of the target population along the scoring scale with at least 250–300 responses per score point. The same benchmark papers are used for training the automated engine in the monitoring process or for the actual scoring.11Slide13

Considerations in Monitoring Process (2 of 3)Rater Training must achieve two additional goals in addition to the regular rater training. One is to help raters build in a “mental construct” of student proficiency across states for a certain subject at a given grade, and on a specific topic(s). During the scoring process, frequent trainings should be offered based on issues uncovered in the constant monitoring process.

Standardized Scoring Conditions are essential for the comparability of test scores. Such conditions can be generated into four categories, which are:12Slide14

Considerations for Monitoring Process (3 of 3)Rater (e.g., criteria for selection, qualification, training); scoring method (e.g., use human scoring with automated scoring to monitor the process) Scoring Design (e.g., number of raters, random distribution or first-in first-serve, one rater per student or one rater per item or per trait in analytic scoring, and use the same package of benchmark papers)

Monitoring system (e.g., design and functions)Scoring environment (e.g., on-site or distributed)Although identical scoring conditions may not be easily achieved due to state policy, budget, schedule, and availability, certain essential conditions must be considered and implemented. 13Slide15

Many-Facets Measurement ModelThe Rasch Measurement Model generalized by Linacre (1989) provides a framework for examining the psychometric quality of professional judgments on constructed responses by students on performance tasks.The model may include many facets, such as rater severity, item difficulty, and student ability. The Facets model is a unidimensional model with a collection of many facets as the independent variables and a single student competence parameter as the dependent variable. Engelhard (1994) demonstrated the procedures for detecting four general categories of rater errors with facets for a large-scale writing assessment. He concluded that this model offers a promising approach. It is likely to discover most rater errors and minimize their potential effects.

14Slide16

Detect Rater Effects by Facets (1 of 3)Facets is commonly used to detect rater effects in performance-based assessments. The output of Facets provides detailed information about rater behavior.Because rater facet is centered at 0, positive or negative values are considered the presence of rater effects. The degree of rater severity or leniency can be determined based on how far the mean of ratings is away from 0. Also, the reliability of the separation index provides evidence if the raters systematically differ in severity or leniency in scoring.

Rater severity, central tendency, and halo effect could all lead to restriction of score range; while the halo effect may cause central tendency and/or restriction of score range as well.15Slide17

Detect Rater Effects by Facets (2 of 3)Observed positively or negatively skewed frequency distributions of test scores could be an indication of rater errors. Rasch fit statistics, Infit and Outfit, can be used to detect halo effect. Infit statistics with the expectation of 1.0 measures the degree of intra-rater consistency. When Infit˂0.6 indicates too little variation and overuse of inner scale categories, such as 2 and 3 on a

0–5 scale; while Infit˃1.5 indicates excess variation and overuse of outer scale categories. Outfit statistics with expectation 1.0 measures the same thing as Infit but is more sensitive to outliers.16Slide18

Detect Rater Effects by Facets (3 of 3) Halo effect usually appears when analytic judgments are influenced by the rater’s overall impression on student performance. With halo effect, the correlations of ratings on different dimensions are statistically exaggerated, while the variance components are diminished. The presence of halo effect may alter the construct and minimize the opportunity for students to demonstrate their strength and weakness. Using a single rater for scoring all traits of a given response may introduce bias due to the halo effect.

17Slide19

The Facets model below

based on the Rasch Measurement Model by Linacre (1989) is proposed for detecting the consistency/inconsistency in scoring across states. It defines the probability of a person

n with ability

n

who receives a

rating

x

in category

k

with the threshold

k

on

item

i

having item difficulty

i

by a rater

r

showing scoring severity

r

with scoring consistency

in state

s

.

Where:

P

is the probability of person n on essay/item

i

being scored by rater

r

receiving a

rating

of category

k

with the state

consistency

n

is the ability parameter for person

n

i is the item difficulty parameter for item or essay i is the consistency parameter for state s r is the severity parameter for rater r k is the step difficulty parameter on a rating scale of k categories

 

A Facets Model for Cross-State Consistency (1 of 3)

18Slide20

A Facets Model for Cross-State Consistency(2 of 3)The proposed model can be used to identify and evaluate the consistency in scoring constructed-responses and essays across states. For each element, such as student, performance task, rater, and rating consistency across states, Facets provides a measure, its standard error, and fit statistics for aberrant observations; and quantifies the function of each.Through the analysis, the probability of obtaining a rating category for an item (e.g., score point of 2 on a 0–5 scale for the item) is estimated depending on student ability, item difficulty, rater’s rating, and scoring consistency in a state.

The proposed model creates an opportunity to put student ability, rater performance, and cross-state consistency on the same scale for comparisons. 19Slide21

A Facets Model for Cross-State Consistency(3 of 3)It is important to note that simulation studies and empirical analyses are needed to validate the proposed model, examine its potentials in identifying and quantifying the consistency in scoring across states. For evaluation purpose, a threshold should be determined

.In extreme cases, the results from the cross-state consistency analyses could be used to adjust student scores. 20Slide22

A Final NoteAmong many factors, a quality system and constant monitoring for the scoring process is necessary, and standardized scoring conditions are essential for the validity of scoring performance tasks and to enhance the comparability of test scores across states.The proposed model can be used to quantify the rater effects by state, examine the potential impact of the variations in scoring

on student performance, and identify the consistency in scoring across states. Thus, the model can be used to validate the scoring process. For evaluation purpose, a threshold should be determined.In reality, unidentified or hard to quantify factors, such as the availability of qualified raters, possible item selection in adaptive nature for performance tasks, and state policy (e.g., budget, schedule) may introduce additional challenges to the scoring.21Slide23

Thanks!!22

We would like to hear from you. Please contact us if you have comments, suggestions, and/or questions regarding the presentation, particularly about the notion of the comparability of test scores across states and the proposed Cross-State Facets Model.Liru Zhang: liru.zhang@doe.k12.de.us

Shudong Wang:

shudong.wang@NWEA.org