/
Chapter 3 Getting It Right Every Time: Reliability Chapter 3 Getting It Right Every Time: Reliability

Chapter 3 Getting It Right Every Time: Reliability - PowerPoint Presentation

mia
mia . @mia
Follow
352 views
Uploaded On 2022-02-12

Chapter 3 Getting It Right Every Time: Reliability - PPT Presentation

Outline Reliability Definition Test score theory True score Error Sources of error Test Reliability Testretest Parallel forms Single administration methods Splithalf Internal consistency ID: 908556

test reliability score error reliability test error score coefficient tests measurement split scores standard form retest variance content sample

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Chapter 3 Getting It Right Every Time: R..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Chapter 3

Getting It Right Every Time: Reliability

Slide2

Outline

Reliability

Definition

Test score theory

True score

Error

Sources of error

Test Reliability

Test-retest

Parallel forms

Single administration methods

Split-half

Internal consistency

Coefficient alpha

KR 20

Using the relevant measure of internal consistency

Slide3

Outline

Inter-scorer reliability

Standard error of measurement

Reliability and nature of sample

Heterogeneous samples

Homogeneous samples

Restriction of range

Increasing the reliability of a test

Slide4

Suppose you want to know how much you weigh. Have inexpensive bathroom scale. Never get same weight twice. Weigh self five times. Take mean.

Slide5

Reliability

Your bathroom scale not perfect measurement device.

Readings due to measurement error.

Psychological tests are imperfect measurement devices.

Always some error associated with a psychological test.

Psychologists approach problem of error in same way we approached bathroom scale problem.

Slide6

Psychologists assume: True level of the characteristics we are trying to measure. If taken many times, without memory, practice or fatigue effects, scores would cluster around a true score. Mean would be best estimate of the individual’s true score.

Slide7

Reliability

Take bathroom scale analogy even further.

Similarity among observed weights reflects scale’s reliability.

More consistent measurement devices are, the higher their reliability.

Important?

Cannot administer test many times to determine a person’s true score.

So much rides on a test score.

Slide8

Charles Spearman is the father of test score theory. Suggested observed test score

X

has two components: True score (

T

), and Error component (

E

). Represented by the following formula:

X

=

T

+

E

Slide9

Test Score Theory

True score (

T

)

Factors that lead to consistency in measurement of characteristics in question.

Include:

Individual’s true level of the characteristic, and

Situational variables.

Think about how anxiety scores might vary during the course of the semester.

Slide10

Test Score Theory

Error component (

E

)

Factors that contribute to inconsistency in measurement.

Three reasons for error in measurement:

People change from T

1

to T

2

Differences in tasks from T

1

to T

2

Limited sample of behavior

Need indicator of how much error there is in measurement (how reliable a measure is).

Slide11

Test Reliability

Number of ways to access reliability.

We will consider three:

Test-retest (or retesting with the same form)

Parallel forms

Single administration methods

Split-half

Internal consistency

Coefficient alpha

KR 20

Slide12

Test-Retest Reliability: Provides indication of the stability of test scores over time. Most psychological tests measure relatively stable characteristics or traits. Expect test results to reflect this stability. Low variation in person’s performance over repeated measurement.

Slide13

Suppose you took an intelligence test today. Then took the same exam a few days later. Expect to receive very similar scores on the two tests. Unlikely your intelligence would change drastically over a few days. Yet, there should be some variability in test scores (error).

Slide14

Test-Retest Reliability

Calculation of test-retest reliability is straightforward.

Administer the same test on two different times to:

Same group of people, and

Same conditions.

Yields two scores for each person.

Compute the correlation between these two sets of scores.

This is the test-retest reliability coefficient.

Slide15

Warning! Interval between the two test administrations is important. Should be long enough so that memory effects are minimized. Not so long that true score changes have occurred.

Slide16

Test-Retest Reliability

Example

Took achievement test in the morning and then again in the afternoon.

Might remember responses and repeat.

Produces an artificially high estimate of reliability of the test.

Took second test six months later.

May learned new skills result in a meaningfully higher score.

Produce artificially low estimate of reliability of the test.

Slide17

Test-Retest Reliability

No simple rule.

Decision depends on:

Nature of the test, and

Intended use.

When in doubt check the test manual.

Interval between testing reported in test manual.

Evaluate information in terms of the characteristics being measured and the people being tested.

Slide18

Test-Retest Reliability

How high should the correlation coefficient be?

Obviously, higher the better.

But difficult to provide specific numbers.

Depends on the type of test and its intended use.

Slide19

Test-Retest Reliability

Example

Wechsler Adult Intelligence Scale (WAIS-R) reports test-retest reliability coefficient of .97.

Scale consists of 11 subscales

Most have reliability coefficients in the high .80s and .90s

Some are in the .60s!

Point?

Look at specific reliability of subtests before the full scale correlation coefficient.

Slide20

Test-Retest Reliability

Why the differences among the subscales?

Different types of intelligence assessed.

Subscales with higher correlation coefficients related to vocabulary.

A relatively stable variable.

Lower correlation coefficients related to short-term memory.

A relatively less stable variable.

Slide21

Advantages of Test-Retest Reliability

To evaluate advantages and disadvantages of test-retest reliability, recall sources of error:

People change from T

1

to T

2

Differences in tasks from T

1

to T

2

Limited sample of behavior

Slide22

Advantages of Test-Retest Reliability

Advantage:

Test-retest reliability can provide good estimates of error on:

How people change from

T

1

to T

2

Differences in tasks from T

1

to T

2

Disadvantage:

Memory and practice effects may come into play.

Cannot evaluate the effects of limited sample of behavior.

Slide23

Alternative Form Reliability

Test-retest reliability cannot evaluate error associated with sample tasks that represent a domain of behavior.

Need another procedure.

Procedure developed was alternative or parallel form reliability.

Slide24

Desirable to have more than one form of a test. Think about the SAT. Take test in September and research those questions got wrong. Take again in December. Scores would improve.

Slide25

Alternative Form Reliability

Problem!

If two forms exist, are they measuring same thing?

How confident are we they are?

When more than one form exists, important that tests are measuring the same thing!

Content of the various forms of the test must be equivalent.

Solution!

Alternative (parallel) form reliability.

Slide26

Alternative Form Reliability

Calculation of alternative form reliability is straightforward.

Administer two forms of the test on two different times to:

Same group of people, and

Same conditions.

Yields two scores for each person.

Compute the correlation between these two sets of scores.

This is the alternative form reliability coefficient.

Slide27

Alternative Form Reliability

The correlation coefficient reflects degree content is similar on the two forms.

Thus, error variance is said to result from content sampling.

Higher correlation coefficient, more confidence content sampled in Form A is equivalent to the content sampled in Form B.

Slide28

Alternative Form Reliability

Evaluating alternative form reliability in a test manual:

Determine if forms administered on same occasion.

For short tests (30 minutes or less):

Not be unreasonable to ask people to take both tests at one sitting.

For long test (SAT):

Reasonable to wait a week or two before administering the second form of the test.

Slide29

Alternative Form Reliability

Advantage:

Provides estimate on variability in limited sample of behavior.

If testing is separated by an interval:

Alternative form reliable provides estimate of all three sources of variability:

People change from T

1

to T

2

Differences in tasks from T

1

to T

2

Limited sample of behavior

Slide30

Alternative Form Reliability

Disadvantage:

Need two parallel forms of the test.

Need examinees to be available at two different times.

Solution!

Subdivide tests.

Reliability estimates are gleaned from one test divided into two equivalent halves.

Slide31

Split-half Reliability

Majority of psychological tests consist of single form.

Still interested the degree to which content sampling influences reliability.

When single form exists, method used is split-half reliability.

Split-half reliability measures internal consistency.

Slide32

Split-half Reliability

To calculate split-half reliability:

Single test administered to a group of people.

Each person’s test is divided into two equal halves.

Each person has two scores.

Split-half reliability is the correlation between these two sets of scores.

Error variance results from content sampling.

Slide33

Split-half Reliability

Problem!

How is the test divided?

Solution!

Odd-even split resulting in two scores:

One for the odd items

One for the even items.

Usually preferred method

Items become more difficult on test.

Ensures halves are approximately equal in difficulty level.

Slide34

Two exceptions! Multiple questions for a topic. All questions concerning topic should be included in one of the halves. Consider SAT where a paragraph is presented and a series of questions are asked based on the paragraph.

Slide35

The second case concerns speed tests. In speed tests, the items are so easy that almost everyone answers every item attempted correctly. The goal to measure how quickly a person can perform the task.

Slide36

Split-half Reliability

Disadvantage:

Split-half reliability examines relationship between two tests that are half as long as the full-length test.

Example, calculate the split-half reliability for a final exam consisting of 100 questions.

The test is divided into odd and even items of 50 items each.

Why is this a problem?

Slide37

Split-half Reliability

Consider the tests:

100 item test scores can range from 0 to 100.

50 item test scores can only range from 0 to 50.

When a range of scores is restricted, the resulting correlation coefficient is lowered.

The 50-item tests underestimates the reliability of a 100-item test.

Slide38

Restriction of range deflates correlation.

Why?

Slide39

Restriction of range deflates correlation.

Why?

Slide40

Spearman-Brown Correction Formula

Solution!

Spearman-Brown Correction Formula

Estimates what our split-half reliability would be if we could calculate it for the full 100-item test.

r

est

=

2

r

ob

1+

r

ob

r

ob

represents our obtained split-half reliability coefficient.

r

est

represents split-half reliability estimated by the Spearman-Brown formula.

Slide41

Spearman-Brown Correction Formula

Example

Correlation between two 50-item halves of a final exam is .80

Estimated reliability would be:

r

est

=

2

r

ob

1+

r

ob

r

est

=

2(.80)

1+.80

r

est

=

1.60

1.80

r

est

=.89

Test authors usually report the “corrected” split-half reliability coefficient in test manuals and journal articles.

Slide42

Kuder-Richardson Reliability

A second measure of internal consistency.

Provides:

Estimate of error variance resulting from content sampling.

Estimate of error variance resulting from content heterogeneity.

Content heterogeneity:

All items are measuring different traits, abilities, etc.

Content homogeneity:

All items are measuring same traits, abilities, etc.

Slide43

Kuder-Richardson Reliability

Example

Exam contains analytical and factual questions.

Assume difficulty of questions are distributed evenly.

Split-half reliability is high.

However, some excel at analytical questions while other excel at factual questions.

Kuder-Richardson reliability coefficient takes this into account.

Reliability coefficient would be lower than split-half reliability coefficient.

Slide44

Kuder-Richardson Reliability

Low Kuder-Richardson reliability co-efficient reflects heterogeneity of item content.

Can be a good or bad thing.

More on this later.

How to calculate the KR-20 reliability estimate?

Slide45

Kuder-Richardson Reliability

The formula for KR

20

is:

KR

20

=

N

s

2

-

pq

N

– 1 s

2

Where

KR

20

=

reliability estimate

N

= number of items on the test

s

2

= variance of the total test

p

= proportion of people answering each item correctly

q

= proportion of people answering each item incorrectly

Calculation not difficult BUT time consuming!

Leave calculations to computer.

Slide46

Coefficient Alpha

Disadvantage:

Kuder-Richardson reliability test only used for tests where scoring is dichotomous.

Right or wrong, yes or no, 0 or 1.

Solution:

Need another test that can be use with continuous measurement scales.

Coefficient alpha (

α

).

Slide47

Coefficient Alpha

The formula for coefficient alpha is:

α

=

N

s

2

- ∑s

2

i

N

– 1 s

2

Where:

α

= coefficient alpha reliability

N

= number of items in the test

s

2

= variance of the total test score

s

2

i

= variance of each individual item

Slide48

Coefficient Alpha

Difference between coefficient alpha and KR

20

:

Variance (s

2

i

)

substituted for sum of proportion of people answering each item correctly and incorrectly (∑

pq

).

Why is this important?

Variance of each item, can be calculated even when the response is dichotomous.

Coefficient alpha can be used for tests with both response formats.

Slide49

Using the Relevant Measure of Internal Consistency

When to calculate split-half reliability and KR

20

/coefficient alpha?

Split-half reliability reflects the error variance from content sampling.

KR

20

and coefficient alpha reflect error variance from content sampling

AND

content heterogeneity.

Slide50

Using the Relevant Measure of Internal Consistency

If desirable to have high split-half reliability but not desirable to have a homogeneous group of items:

Split-half reliability should be calculated.

Intelligence.

If desirable to have every item measure a single construct (homogeneity of items):

KR

20

or coefficient alpha coefficient should be calculated.

Anxiety.

Slide51

Inter-Scorer Reliability

For many tests, reliability of scoring is not an issue.

Examples include:

Tests where computer decides which alternative was selected.

Personality tests where scorer counts the number of true or false responses.

Point:

Under objective circumstances scoring of the tests will be accurate and reliable.

Slide52

Inter-Scorer Reliability

But what about tests that depend on judgment of person doing the scoring?

Examples included:

Essay tests.

Projective tests of personality.

In subjective scoring cases, important to establish inter-scorer reliability of the test.

Slide53

Inter-Scorer Reliability

Calculating inter-scorer reliability (inter-judge reliability) is a straight-forward process.

Step 1:

Administer test to a group of people on a single occasion

Step 2:

Two scorers or judges, independently score each test.

Step 3:

Correlate the two sets of scores.

Slide54

Inter-Scorer Reliability

There will be differences between how the judges score the test.

Perfect agreement = +1.0

Any correlation less than +1.0 indicates error variance.

Judges not rating the test in identical fashion.

Slide55

Inter-Scorer Reliability

How can error variance resulting from scorer differences be reduced?

Offer numerous, clear examples of scoring guidelines in test manual.

Training those who will be doing the scoring.

Slide56

The Standard Error of Measurement

Recall purpose of calculating reliability coefficients.

Determines proportion of true variance to total variance in set of test scores.

Problem:

This is an aggregate measure.

Does

not

tell us much about the test score received by a particular person.

Slide57

The Standard Error of Measurement

Tests used to make decisions about individuals.

Important to know how error variance might influence way we interpret individual’s test score.

Need a statistic that tells us effect error variance has on estimate of individual’s true score.

Solution:

Standard error of measurement

Slide58

The Standard Error of Measurement

Calculating standard error of measurement is straight-forward:

σ

meas

= SD

x

1-

r

xx

Where

σ

meas

= Standard error of measurement

SD

x

= Standard deviation of the test scores

r

xx

= Reliability of the test

Slide59

The Standard Error of Measurement

Example

Susan, a fifth-grader, is being considered for gifted program at school.

Students must receive a score of 130 on a standardized intelligence test to qualify.

Susan receives a score of 128.

Standard deviation for test is 15

Reliability is .90.

Calculate the standard error of measurement.

Slide60

The Standard Error of Measurement

Calculate the standard error of measurement.

Standard deviation for test is 15

Reliability is .90.

σ

meas

= SD

x

1-

r

xx

σ

meas

= 15

1-.90

σ

meas

= 15

.

10

σ

meas

= 15(.32)

σ

meas

= 4.8

How do we apply standard error of measurement to Susan’s score?

Slide61

The Standard Error of Measurement

Recall 68% of area under curve falls between -1 standard deviation and +1 standard deviation.

Thus, 68% chance Susan’s true intelligence test falls between 123.2 and 132.8 points.

Slide62

The Standard Error of Measurement

How’s that?

Susan’s score = 128.

Standard error of measurement is 4.8.

IQ scores are standardized

We know that 1 standard deviation is equal to 4.8 units.

Add 4.8 to 128 and subtract 4.8 to 128 to come up with estimate.

Susan’s true score is between 123.2 and 132.8 points.

Slide63

The Standard Error of Measurement

What implication does this have on Susan getting into the gifted program?

Recall a score of 128 does not qualify.

BUT error variance associated with the test is reasonably high.

Susan’s true score ranged between 123.2 and 132.8 points!

Justifies a request for retesting.

Slide64

Reliability and the Nature of the Sample

To this point, we focused on the reliability of tests.

Now we will consider the sample.

Why is the sample important?

Reliability coefficients are dependent on samples.

Slide65

Reliability and the Nature of the Sample

For reliability issues, there are two types of samples:

Heterogeneous samples

Many different types of people comprise the sample.

Homogeneous samples

Many similar types of people comprise the sample.

Slide66

Reliability and the Nature of the Sample

Reliability coefficients based on heterogeneous samples are likely to be higher than coefficients based on homogeneous samples.

Why?

Restricted range of scores!

Slide67

Reliability and the Nature of the Sample

Large Urban University

Exclusive Liberal Arts College

T

1

T

2

T

1

T

2

s

1

780

750

800

790

s

2

740

770

800

790

s

3

700

730

800

800

s

4

660

630

790

800

s

5

630

660

790

780

s

6

600

570

790

800

s

7

560

590

780

790

s

8

520

550

780

790

s

9

480

450

770

780

s

10

440

470

760

770

r

xx

= .96

r

xx

= .70

Suppose we calculated test-retest reliability for math part of SAT at two different universities.

The first is a large urban.

The second is an exclusive liberal arts college (min math score: 750).

Slide68

Reliability and the Nature of the Sample

Large Urban University

Exclusive Liberal Arts College

T

1

T

2

T

1

T

2

s

1

780

750

800

790

s

2

740

770

800

790

s

3

700

730

800

800

s

4

660

630

790

800

s

5

630

660

790

780

s

6

600

570

790

800

s

7

560

590

780

790

s

8

520

550

780

790

s

9

480

450

770

780

s

10

440

470

760

770

r

xx

= .96

r

xx

= .70

Test-retest reliability coefficients:

.96 for large urban university,

.70 for students at the exclusive liberal arts college.

Huh?

But scores are more consistent at exclusive liberal arts college!

Slide69

Reliability and the Nature of the Sample

Point:

Magnitude of reliability coefficients reflects degree first and second sets of scores are associated with each other.

Says nothing about magnitude of differences between the first and second administrations.

Why make such a big deal over the restriction of range?

Slide70

Increasing the Reliability of a Test

What happens if you created a test that has low reliability?

How can you increase the reliability?

Simplest way is by increasing the number of items!

This results in a greater range of scores.

Slide71

Increasing the Reliability of a Test

Suppose you have two instructors who give an exam.

Instructor A:

Gives test that consists of 100 questions.

You are confident you got 85 questions right.

You think you might have gotten two other questions right.

So you might end up with a score of 85 if you were unlucky or 87 if you were lucky.

Slide72

Increasing the Reliability of a Test

Instructor B.

Gives a test of only five questions.

All five correct results in an A, four correct results in a B, and so on.

The difference between being lucky and unlucky on two items would mean the difference between two letter grades.

Point:

For shorter tests, error variance is likely to be greater compared to longer tests.

Slide73

Increasing the Reliability of a Test

How can we predict the effect of adding items has on the reliability of a test?

A variation of the Spearman-Brown formula can be used:

r’

=

n * r

xx

1+(

n

-1)

r

xx

Where

r’

= estimate reliability of new test

n

= factor by which the test length is increased

r

xx

– reliability of the original test

Slide74

Next Class

Validity

Test Validity

Types

Content-related validity

Criterion-related validity

Construct-related validity

Content-related validity

Definition

Methods for establishing

Issues

Criterion-related validity

Definition

Types

Concurrent

Predictive

Methods for establishing

Construct-related validity

Definition

Methods for establishing

Multitrait-Multimethod

Factor Analysis

Recent Developments

All construct validity?

Values and social consequences