Outline Reliability Definition Test score theory True score Error Sources of error Test Reliability Testretest Parallel forms Single administration methods Splithalf Internal consistency ID: 908556
Download Presentation The PPT/PDF document "Chapter 3 Getting It Right Every Time: R..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Chapter 3
Getting It Right Every Time: Reliability
Slide2Outline
Reliability
Definition
Test score theory
True score
Error
Sources of error
Test Reliability
Test-retest
Parallel forms
Single administration methods
Split-half
Internal consistency
Coefficient alpha
KR 20
Using the relevant measure of internal consistency
Slide3Outline
Inter-scorer reliability
Standard error of measurement
Reliability and nature of sample
Heterogeneous samples
Homogeneous samples
Restriction of range
Increasing the reliability of a test
Slide4Suppose you want to know how much you weigh. Have inexpensive bathroom scale. Never get same weight twice. Weigh self five times. Take mean.
Slide5Reliability
Your bathroom scale not perfect measurement device.
Readings due to measurement error.
Psychological tests are imperfect measurement devices.
Always some error associated with a psychological test.
Psychologists approach problem of error in same way we approached bathroom scale problem.
Slide6Psychologists assume: True level of the characteristics we are trying to measure. If taken many times, without memory, practice or fatigue effects, scores would cluster around a true score. Mean would be best estimate of the individual’s true score.
Slide7Reliability
Take bathroom scale analogy even further.
Similarity among observed weights reflects scale’s reliability.
More consistent measurement devices are, the higher their reliability.
Important?
Cannot administer test many times to determine a person’s true score.
So much rides on a test score.
Slide8Charles Spearman is the father of test score theory. Suggested observed test score
X
has two components: True score (
T
), and Error component (
E
). Represented by the following formula:
X
=
T
+
E
Slide9Test Score Theory
True score (
T
)
Factors that lead to consistency in measurement of characteristics in question.
Include:
Individual’s true level of the characteristic, and
Situational variables.
Think about how anxiety scores might vary during the course of the semester.
Slide10Test Score Theory
Error component (
E
)
Factors that contribute to inconsistency in measurement.
Three reasons for error in measurement:
People change from T
1
to T
2
Differences in tasks from T
1
to T
2
Limited sample of behavior
Need indicator of how much error there is in measurement (how reliable a measure is).
Slide11Test Reliability
Number of ways to access reliability.
We will consider three:
Test-retest (or retesting with the same form)
Parallel forms
Single administration methods
Split-half
Internal consistency
Coefficient alpha
KR 20
Slide12Test-Retest Reliability: Provides indication of the stability of test scores over time. Most psychological tests measure relatively stable characteristics or traits. Expect test results to reflect this stability. Low variation in person’s performance over repeated measurement.
Slide13Suppose you took an intelligence test today. Then took the same exam a few days later. Expect to receive very similar scores on the two tests. Unlikely your intelligence would change drastically over a few days. Yet, there should be some variability in test scores (error).
Slide14Test-Retest Reliability
Calculation of test-retest reliability is straightforward.
Administer the same test on two different times to:
Same group of people, and
Same conditions.
Yields two scores for each person.
Compute the correlation between these two sets of scores.
This is the test-retest reliability coefficient.
Slide15Warning! Interval between the two test administrations is important. Should be long enough so that memory effects are minimized. Not so long that true score changes have occurred.
Slide16Test-Retest Reliability
Example
Took achievement test in the morning and then again in the afternoon.
Might remember responses and repeat.
Produces an artificially high estimate of reliability of the test.
Took second test six months later.
May learned new skills result in a meaningfully higher score.
Produce artificially low estimate of reliability of the test.
Slide17Test-Retest Reliability
No simple rule.
Decision depends on:
Nature of the test, and
Intended use.
When in doubt check the test manual.
Interval between testing reported in test manual.
Evaluate information in terms of the characteristics being measured and the people being tested.
Slide18Test-Retest Reliability
How high should the correlation coefficient be?
Obviously, higher the better.
But difficult to provide specific numbers.
Depends on the type of test and its intended use.
Slide19Test-Retest Reliability
Example
Wechsler Adult Intelligence Scale (WAIS-R) reports test-retest reliability coefficient of .97.
Scale consists of 11 subscales
Most have reliability coefficients in the high .80s and .90s
Some are in the .60s!
Point?
Look at specific reliability of subtests before the full scale correlation coefficient.
Slide20Test-Retest Reliability
Why the differences among the subscales?
Different types of intelligence assessed.
Subscales with higher correlation coefficients related to vocabulary.
A relatively stable variable.
Lower correlation coefficients related to short-term memory.
A relatively less stable variable.
Slide21Advantages of Test-Retest Reliability
To evaluate advantages and disadvantages of test-retest reliability, recall sources of error:
People change from T
1
to T
2
Differences in tasks from T
1
to T
2
Limited sample of behavior
Slide22Advantages of Test-Retest Reliability
Advantage:
Test-retest reliability can provide good estimates of error on:
How people change from
T
1
to T
2
Differences in tasks from T
1
to T
2
Disadvantage:
Memory and practice effects may come into play.
Cannot evaluate the effects of limited sample of behavior.
Slide23Alternative Form Reliability
Test-retest reliability cannot evaluate error associated with sample tasks that represent a domain of behavior.
Need another procedure.
Procedure developed was alternative or parallel form reliability.
Slide24Desirable to have more than one form of a test. Think about the SAT. Take test in September and research those questions got wrong. Take again in December. Scores would improve.
Slide25Alternative Form Reliability
Problem!
If two forms exist, are they measuring same thing?
How confident are we they are?
When more than one form exists, important that tests are measuring the same thing!
Content of the various forms of the test must be equivalent.
Solution!
Alternative (parallel) form reliability.
Slide26Alternative Form Reliability
Calculation of alternative form reliability is straightforward.
Administer two forms of the test on two different times to:
Same group of people, and
Same conditions.
Yields two scores for each person.
Compute the correlation between these two sets of scores.
This is the alternative form reliability coefficient.
Slide27Alternative Form Reliability
The correlation coefficient reflects degree content is similar on the two forms.
Thus, error variance is said to result from content sampling.
Higher correlation coefficient, more confidence content sampled in Form A is equivalent to the content sampled in Form B.
Slide28Alternative Form Reliability
Evaluating alternative form reliability in a test manual:
Determine if forms administered on same occasion.
For short tests (30 minutes or less):
Not be unreasonable to ask people to take both tests at one sitting.
For long test (SAT):
Reasonable to wait a week or two before administering the second form of the test.
Slide29Alternative Form Reliability
Advantage:
Provides estimate on variability in limited sample of behavior.
If testing is separated by an interval:
Alternative form reliable provides estimate of all three sources of variability:
People change from T
1
to T
2
Differences in tasks from T
1
to T
2
Limited sample of behavior
Slide30Alternative Form Reliability
Disadvantage:
Need two parallel forms of the test.
Need examinees to be available at two different times.
Solution!
Subdivide tests.
Reliability estimates are gleaned from one test divided into two equivalent halves.
Slide31Split-half Reliability
Majority of psychological tests consist of single form.
Still interested the degree to which content sampling influences reliability.
When single form exists, method used is split-half reliability.
Split-half reliability measures internal consistency.
Slide32Split-half Reliability
To calculate split-half reliability:
Single test administered to a group of people.
Each person’s test is divided into two equal halves.
Each person has two scores.
Split-half reliability is the correlation between these two sets of scores.
Error variance results from content sampling.
Slide33Split-half Reliability
Problem!
How is the test divided?
Solution!
Odd-even split resulting in two scores:
One for the odd items
One for the even items.
Usually preferred method
Items become more difficult on test.
Ensures halves are approximately equal in difficulty level.
Slide34Two exceptions! Multiple questions for a topic. All questions concerning topic should be included in one of the halves. Consider SAT where a paragraph is presented and a series of questions are asked based on the paragraph.
Slide35The second case concerns speed tests. In speed tests, the items are so easy that almost everyone answers every item attempted correctly. The goal to measure how quickly a person can perform the task.
Slide36Split-half Reliability
Disadvantage:
Split-half reliability examines relationship between two tests that are half as long as the full-length test.
Example, calculate the split-half reliability for a final exam consisting of 100 questions.
The test is divided into odd and even items of 50 items each.
Why is this a problem?
Slide37Split-half Reliability
Consider the tests:
100 item test scores can range from 0 to 100.
50 item test scores can only range from 0 to 50.
When a range of scores is restricted, the resulting correlation coefficient is lowered.
The 50-item tests underestimates the reliability of a 100-item test.
Slide38Restriction of range deflates correlation.
Why?
Slide39Restriction of range deflates correlation.
Why?
Slide40Spearman-Brown Correction Formula
Solution!
Spearman-Brown Correction Formula
Estimates what our split-half reliability would be if we could calculate it for the full 100-item test.
r
est
=
2
r
ob
1+
r
ob
r
ob
represents our obtained split-half reliability coefficient.
r
est
represents split-half reliability estimated by the Spearman-Brown formula.
Slide41Spearman-Brown Correction Formula
Example
Correlation between two 50-item halves of a final exam is .80
Estimated reliability would be:
r
est
=
2
r
ob
1+
r
ob
r
est
=
2(.80)
1+.80
r
est
=
1.60
1.80
r
est
=.89
Test authors usually report the “corrected” split-half reliability coefficient in test manuals and journal articles.
Slide42Kuder-Richardson Reliability
A second measure of internal consistency.
Provides:
Estimate of error variance resulting from content sampling.
Estimate of error variance resulting from content heterogeneity.
Content heterogeneity:
All items are measuring different traits, abilities, etc.
Content homogeneity:
All items are measuring same traits, abilities, etc.
Slide43Kuder-Richardson Reliability
Example
Exam contains analytical and factual questions.
Assume difficulty of questions are distributed evenly.
Split-half reliability is high.
However, some excel at analytical questions while other excel at factual questions.
Kuder-Richardson reliability coefficient takes this into account.
Reliability coefficient would be lower than split-half reliability coefficient.
Slide44Kuder-Richardson Reliability
Low Kuder-Richardson reliability co-efficient reflects heterogeneity of item content.
Can be a good or bad thing.
More on this later.
How to calculate the KR-20 reliability estimate?
Slide45Kuder-Richardson Reliability
The formula for KR
20
is:
KR
20
=
N
s
2
-
∑
pq
N
– 1 s
2
Where
KR
20
=
reliability estimate
N
= number of items on the test
s
2
= variance of the total test
p
= proportion of people answering each item correctly
q
= proportion of people answering each item incorrectly
Calculation not difficult BUT time consuming!
Leave calculations to computer.
Slide46Coefficient Alpha
Disadvantage:
Kuder-Richardson reliability test only used for tests where scoring is dichotomous.
Right or wrong, yes or no, 0 or 1.
Solution:
Need another test that can be use with continuous measurement scales.
Coefficient alpha (
α
).
Slide47Coefficient Alpha
The formula for coefficient alpha is:
α
=
N
s
2
- ∑s
2
i
N
– 1 s
2
Where:
α
= coefficient alpha reliability
N
= number of items in the test
s
2
= variance of the total test score
s
2
i
= variance of each individual item
Coefficient Alpha
Difference between coefficient alpha and KR
20
:
Variance (s
2
i
)
substituted for sum of proportion of people answering each item correctly and incorrectly (∑
pq
).
Why is this important?
Variance of each item, can be calculated even when the response is dichotomous.
Coefficient alpha can be used for tests with both response formats.
Slide49Using the Relevant Measure of Internal Consistency
When to calculate split-half reliability and KR
20
/coefficient alpha?
Split-half reliability reflects the error variance from content sampling.
KR
20
and coefficient alpha reflect error variance from content sampling
AND
content heterogeneity.
Slide50Using the Relevant Measure of Internal Consistency
If desirable to have high split-half reliability but not desirable to have a homogeneous group of items:
Split-half reliability should be calculated.
Intelligence.
If desirable to have every item measure a single construct (homogeneity of items):
KR
20
or coefficient alpha coefficient should be calculated.
Anxiety.
Slide51Inter-Scorer Reliability
For many tests, reliability of scoring is not an issue.
Examples include:
Tests where computer decides which alternative was selected.
Personality tests where scorer counts the number of true or false responses.
Point:
Under objective circumstances scoring of the tests will be accurate and reliable.
Slide52Inter-Scorer Reliability
But what about tests that depend on judgment of person doing the scoring?
Examples included:
Essay tests.
Projective tests of personality.
In subjective scoring cases, important to establish inter-scorer reliability of the test.
Slide53Inter-Scorer Reliability
Calculating inter-scorer reliability (inter-judge reliability) is a straight-forward process.
Step 1:
Administer test to a group of people on a single occasion
Step 2:
Two scorers or judges, independently score each test.
Step 3:
Correlate the two sets of scores.
Slide54Inter-Scorer Reliability
There will be differences between how the judges score the test.
Perfect agreement = +1.0
Any correlation less than +1.0 indicates error variance.
Judges not rating the test in identical fashion.
Slide55Inter-Scorer Reliability
How can error variance resulting from scorer differences be reduced?
Offer numerous, clear examples of scoring guidelines in test manual.
Training those who will be doing the scoring.
Slide56The Standard Error of Measurement
Recall purpose of calculating reliability coefficients.
Determines proportion of true variance to total variance in set of test scores.
Problem:
This is an aggregate measure.
Does
not
tell us much about the test score received by a particular person.
Slide57The Standard Error of Measurement
Tests used to make decisions about individuals.
Important to know how error variance might influence way we interpret individual’s test score.
Need a statistic that tells us effect error variance has on estimate of individual’s true score.
Solution:
Standard error of measurement
Slide58The Standard Error of Measurement
Calculating standard error of measurement is straight-forward:
σ
meas
= SD
x
1-
r
xx
Where
σ
meas
= Standard error of measurement
SD
x
= Standard deviation of the test scores
r
xx
= Reliability of the test
Slide59The Standard Error of Measurement
Example
Susan, a fifth-grader, is being considered for gifted program at school.
Students must receive a score of 130 on a standardized intelligence test to qualify.
Susan receives a score of 128.
Standard deviation for test is 15
Reliability is .90.
Calculate the standard error of measurement.
Slide60The Standard Error of Measurement
Calculate the standard error of measurement.
Standard deviation for test is 15
Reliability is .90.
σ
meas
= SD
x
1-
r
xx
σ
meas
= 15
1-.90
σ
meas
= 15
.
10
σ
meas
= 15(.32)
σ
meas
= 4.8
How do we apply standard error of measurement to Susan’s score?
Slide61The Standard Error of Measurement
Recall 68% of area under curve falls between -1 standard deviation and +1 standard deviation.
Thus, 68% chance Susan’s true intelligence test falls between 123.2 and 132.8 points.
Slide62The Standard Error of Measurement
How’s that?
Susan’s score = 128.
Standard error of measurement is 4.8.
IQ scores are standardized
We know that 1 standard deviation is equal to 4.8 units.
Add 4.8 to 128 and subtract 4.8 to 128 to come up with estimate.
Susan’s true score is between 123.2 and 132.8 points.
Slide63The Standard Error of Measurement
What implication does this have on Susan getting into the gifted program?
Recall a score of 128 does not qualify.
BUT error variance associated with the test is reasonably high.
Susan’s true score ranged between 123.2 and 132.8 points!
Justifies a request for retesting.
Slide64Reliability and the Nature of the Sample
To this point, we focused on the reliability of tests.
Now we will consider the sample.
Why is the sample important?
Reliability coefficients are dependent on samples.
Slide65Reliability and the Nature of the Sample
For reliability issues, there are two types of samples:
Heterogeneous samples
Many different types of people comprise the sample.
Homogeneous samples
Many similar types of people comprise the sample.
Slide66Reliability and the Nature of the Sample
Reliability coefficients based on heterogeneous samples are likely to be higher than coefficients based on homogeneous samples.
Why?
Restricted range of scores!
Slide67Reliability and the Nature of the Sample
Large Urban University
Exclusive Liberal Arts College
T
1
T
2
T
1
T
2
s
1
780
750
800
790
s
2
740
770
800
790
s
3
700
730
800
800
s
4
660
630
790
800
s
5
630
660
790
780
s
6
600
570
790
800
s
7
560
590
780
790
s
8
520
550
780
790
s
9
480
450
770
780
s
10
440
470
760
770
r
xx
= .96
r
xx
= .70
Suppose we calculated test-retest reliability for math part of SAT at two different universities.
The first is a large urban.
The second is an exclusive liberal arts college (min math score: 750).
Slide68Reliability and the Nature of the Sample
Large Urban University
Exclusive Liberal Arts College
T
1
T
2
T
1
T
2
s
1
780
750
800
790
s
2
740
770
800
790
s
3
700
730
800
800
s
4
660
630
790
800
s
5
630
660
790
780
s
6
600
570
790
800
s
7
560
590
780
790
s
8
520
550
780
790
s
9
480
450
770
780
s
10
440
470
760
770
r
xx
= .96
r
xx
= .70
Test-retest reliability coefficients:
.96 for large urban university,
.70 for students at the exclusive liberal arts college.
Huh?
But scores are more consistent at exclusive liberal arts college!
Slide69Reliability and the Nature of the Sample
Point:
Magnitude of reliability coefficients reflects degree first and second sets of scores are associated with each other.
Says nothing about magnitude of differences between the first and second administrations.
Why make such a big deal over the restriction of range?
Slide70Increasing the Reliability of a Test
What happens if you created a test that has low reliability?
How can you increase the reliability?
Simplest way is by increasing the number of items!
This results in a greater range of scores.
Slide71Increasing the Reliability of a Test
Suppose you have two instructors who give an exam.
Instructor A:
Gives test that consists of 100 questions.
You are confident you got 85 questions right.
You think you might have gotten two other questions right.
So you might end up with a score of 85 if you were unlucky or 87 if you were lucky.
Slide72Increasing the Reliability of a Test
Instructor B.
Gives a test of only five questions.
All five correct results in an A, four correct results in a B, and so on.
The difference between being lucky and unlucky on two items would mean the difference between two letter grades.
Point:
For shorter tests, error variance is likely to be greater compared to longer tests.
Slide73Increasing the Reliability of a Test
How can we predict the effect of adding items has on the reliability of a test?
A variation of the Spearman-Brown formula can be used:
r’
=
n * r
xx
1+(
n
-1)
r
xx
Where
r’
= estimate reliability of new test
n
= factor by which the test length is increased
r
xx
– reliability of the original test
Slide74Next Class
Validity
Test Validity
Types
Content-related validity
Criterion-related validity
Construct-related validity
Content-related validity
Definition
Methods for establishing
Issues
Criterion-related validity
Definition
Types
Concurrent
Predictive
Methods for establishing
Construct-related validity
Definition
Methods for establishing
Multitrait-Multimethod
Factor Analysis
Recent Developments
All construct validity?
Values and social consequences