Prediction Develop a model which can infer a single aspect of the data predicted variable from some combination of other aspects of the data predictor variables Which students are offtask Which students will fail the class ID: 209834
Download Presentation The PPT/PDF document "Data Annotation for Classification" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Data Annotation for ClassificationSlide2
Prediction
Develop a model which can infer a single aspect of the data (predicted variable) from some combination of other aspects of the data (predictor variables)
Which students are off-task?
Which students will fail the class?Slide3
Classification
Develop a model which can infer a
categorical predicted variable from
some combination of other aspects of the
data
Which students will fail the class?
Is the student currently gaming the system?
Which type of gaming the system is occurring?Slide4
We will…
We will go into detail on
classification methods
tomorrowSlide5
In order to use prediction methods
We need to know what we’re trying to predict
And we need to have some labels of it in real data Slide6
For example…
If we want to predict whether a student using educational software is off-task, or gaming the system, or bored, or frustrated, or going to fail the class…
We need to first collect some data
And within that data, we need to be able to identify which students are off-task (or the construct of interest), and ideally whenSlide7
So we need to label
some data
We need to obtain outside knowledge to determine what the value is for the construct of interestSlide8
In some cases
We can get a
gold-standard
label
For instance, if we want to know if a student passed a class, we just go ask their instructorSlide9
But for behavioral constructs…
There’s no one to ask
We can’t ask the student (self-presentation)
There’s no gold-standard metric
So we use data labeling methods or observation methods
(e.g. quantitative field observations, video coding)
To collect
bronze-standard
labels
Not perfect, but good enoughSlide10
One such labeling method
Text
replay
codingSlide11
Text replays
Pretty-prints of student interaction behavior from the logsSlide12
ExamplesSlide13Slide14Slide15Slide16Slide17Slide18
Sampling
You can set up any sampling schema you want, if you have enough log data
5 action sequences
20 second sequences
Every behavior on a specific skill, but other skills omittedSlide19
Sampling
Equal number of observations per lesson
Equal number of observations per student
Observations that machine learning software needs help to categorize (“biased sampling”)Slide20
Major Advantages
Both video and field observations hold some risk of observer effects
Text replays are based on logs that were collected completely unobtrusivelySlide21
Major Advantages
Blazing fast to conduct
8 to 40 seconds per observationSlide22
Notes
Decent inter-rater reliability is possible
(Baker, Corbett, & Wagner, 2006)
(Baker,
Mitrovic
, & Mathews,
2010)
(Sao Pedro et al,
2010)
(
Montalvo
et al,
2010)
Agree with other measures of constructs
(Baker, Corbett, & Wagner, 2006)
Can be used to train machine-learned detectors
(Baker & de
Carvalho
, 2008)
(Baker,
Mitrovic
, &
Mathews,
2010)
(Sao Pedro et al,
2010)Slide23
Major Limitations
Limited range of constructs you can code
Gaming the System – yes
Collaboration in online chat – yes
(
Prata
et al, 2008)
Frustration, Boredom – sometimes
Off-Task Behavior outside of software – no
Collaborative Behavior outside of software – noSlide24
Major Limitations
Lower precision (because lower bandwidth of observation)Slide25
Hands-on exerciseSlide26
Find a partner
Could be your project team-mate, but doesn’t have to be
You
will do this exercise with themSlide27
Get a copy of the text replay software
On your flash drive
Or at
http://www.joazeirodebaker.net/algebra-obspackage-LSRM.zip
Slide28
Skim the instructions
At Instructions-LSRM.docxSlide29
Log into text replay software
Using exploratory login
Try to figure out what the student’s behavior means, with your partner
Do this for
~5 minutesSlide30
Now pick a category you want to code
With your partnerSlide31
Now code data
According to your coding scheme
(is-category versus is-not-category)
Separate from your partner
For
20 minutesSlide32
Now put your data together
Using the observations-NAME files you obtained
Make a table (in excel?) showingSlide33
Coder
1
Y
Coder
1
N
Coder
2
Y
15
2
Coder
2
N
3
8Slide34
Now…
We can compute your inter-rater reliability… (also called agreement)Slide35
Agreement/ Accuracy
The easiest measure of inter-rater reliability is
agreement
, also called
accuracy
# of agreements
total number of codesSlide36
Agreement/ Accuracy
There is general agreement across fields that agreement/accuracy is not a good metric
What are some drawbacks of agreement/accuracy?Slide37
Agreement/ Accuracy
Let’s say that Tasha and
Uniqua
agreed on the classification of 9200 time sequences, out of 10000 actions
For a coding scheme with two codes
92% accuracy
Good, right?Slide38
Non-even assignment to categories
Percent Agreement does poorly when there is non-even assignment to categories
Which is almost always the case
Imagine an extreme case
Uniqua
(correctly) picks category A 92% of the time
Tasha
always
picks category A
Agreement/accuracy of 92%
But essentially no informationSlide39
An alternate metric
Kappa
(Agreement – Expected Agreement)
(1 – Expected Agreement)Slide40
Kappa
Expected agreement computed from a table of the form
Rater 2
Category 1
Rater 2
Category 2
Rater 1
Category 1
Count
Count
Rater 1
Category 2
Count
CountSlide41
Kappa
Expected agreement computed from a table of the form
Note that Kappa can be calculated for any number of categories (but only 2
raters
)
Rater 2
Category 1
Rater 2
Category 2
Rater 1
Category 1
Count
Count
Rater 1
Category 2
Count
CountSlide42
Cohen’s (1960) Kappa
The formula for 2 categories
Fleiss’s
(1971) Kappa, which is more complex, can be used for 3+ categories
I have an Excel spreadsheet which calculates multi-category Kappa, which I would be happy to share with youSlide43
Expected agreement
Look at the proportion of labels each coder gave to each category
To find the number of agreed category A that could be expected by chance, multiply pct(coder1/
categoryA
)*pct(coder2/
categoryA
)
Do the same thing for
categoryB
Add these two values together and divide by the total number of labels
This is your expected agreementSlide44
Example
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20
5
Tyrone
On-Task
15
60Slide45
Example
What is the percent agreement?
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20
5
Tyrone
On-Task
15
60Slide46
Example
What is the percent agreement?
80%
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20
5
Tyrone
On-Task
15
60Slide47
Example
What is Tyrone’s
expected frequency for on-task?
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20
5
Tyrone
On-Task
15
60Slide48
Example
What is Tyrone’s
expected frequency for on-task?
75%
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20
5
Tyrone
On-Task
15
60Slide49
Example
What is Pablo’s
expected frequency for on-task?
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20
5
Tyrone
On-Task
15
60Slide50
Example
What is Pablo’s
expected frequency for on-task?
65%
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20
5
Tyrone
On-Task
15
60Slide51
Example
What is the expected on-task agreement?
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20
5
Tyrone
On-Task
15
60Slide52
Example
What is the expected on-task agreement?
0.65*0.75= 0.4875
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20
5
Tyrone
On-Task
15
60Slide53
Example
What is the expected on-task agreement?
0.65*0.75= 0.4875
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20
5
Tyrone
On-Task
15
60 (48.75)Slide54
Example
What are Tyrone
and Pablo’s expected frequencies for off-task
behavior
?
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20
5
Tyrone
On-Task
15
60 (48.75)Slide55
Example
What are Tyrone
and Pablo’s expected frequencies for off-task
behavior
?
25% and 35%
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20
5
Tyrone
On-Task
15
60 (48.75)Slide56
Example
What is the expected off-task agreement?
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20
5
Tyrone
On-Task
15
60 (48.75)Slide57
Example
What is the expected off-task agreement?
0.25*0.35= 0.0875
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20
5
Tyrone
On-Task
15
60 (48.75)Slide58
Example
What is the expected off-task agreement?
0.25*0.35= 0.0875
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20 (8.75)
5
Tyrone
On-Task
15
60 (48.75)Slide59
Example
What is the
total
expected agreement?
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20 (8.75)
5
Tyrone
On-Task
15
60 (48.75)Slide60
Example
What is the
total
expected agreement?
0.4875+0.0875 = 0.575
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20 (8.75)
5
Tyrone
On-Task
15
60 (48.75)Slide61
Example
What is kappa?
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20 (8.75)
5
Tyrone
On-Task
15
60 (48.75)Slide62
Example
What is kappa?
(0.8 – 0.575) / (1-0.575)
0.225/0.425
0.529
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20 (8.75)
5
Tyrone
On-Task
15
60 (48.75)Slide63
So is that any good?
What is kappa?
(0.8 – 0.575) / (1-0.575)
0.225/0.425
0.529
Pablo
Off-Task
Pablo
On-Task
Tyrone
Off-Task
20 (8.75)
5
Tyrone
On-Task
15
60 (48.75)Slide64
What is your Kappa?Slide65
Interpreting Kappa
Kappa = 0
Agreement is at chance
Kappa = 1
Agreement is perfect
Kappa = negative infinity
Agreement is perfectly inverse
Kappa > 1
You messed
up somewhereSlide66
Kappa<0
It does happen, but usually not in the case of inter-rater reliability
Occasionally seen when Kappa is used for EDM or other types of machine
learningSlide67
0<Kappa<1
What’s a good Kappa?
There is no absolute standard
For inter-rater reliability,
0.8 is usually what ed. psych. reviewers want to see
You can usually make a case that values of Kappa around 0.6 are good enough to be usable for some applications
Particularly if there’s a lot of data
Or if you’re collecting observations to drive EDM, and remembering that this is a “bronze-standard”Slide68
Landis & Koch’s (1977) scale
κ
Interpretation
< 0
No agreement
0.0 — 0.20
Slight agreement
0.21 — 0.40
Fair agreement
0.41 — 0.60
Moderate agreement
0.61 — 0.80
Substantial agreement
0.81 — 1.00
Almost perfect agreementSlide69
Why is there no standard?
Because Kappa is scaled by the proportion of each category
When one class is much more prevalent
Expected agreement is higher than
If classes are evenly balancedSlide70
Because of this…
Comparing Kappa values between two studies, in a principled fashion, is highly difficult
A lot of work went into statistical methods for comparing Kappa values in the 1990s
No real consensus
Informally, you can compare two studies if the proportions of each category are “similar”Slide71
There
is
a way to statistically compare two inter-rater reliabilities…
“Junior high school” meta-analysisSlide72
There is a way to statistically compare two inter-rater reliabilities…
“Junior high school” meta-analysis
Do a 1
df
Chi-squared test on each reliability, convert the Chi-squared values to Z, and then compare the two Z values using the method in Rosenthal &
Rosnow
(1991)Slide73
There is a way to statistically compare two inter-rater reliabilities…
“Junior high school” meta-analysis
Do a 1
df
Chi-squared test on each reliability, convert the Chi-squared values to Z, and then compare the two Z values using the method in Rosenthal &
Rosnow
(1991)
Or in other words,
nyardley
nyardley
nyooSlide74
Additional thoughts/comments
About inter-rater reliabilitySlide75
Additional thoughts/comments
About text replays