/
Data Annotation for Classification Data Annotation for Classification

Data Annotation for Classification - PowerPoint Presentation

debby-jeon
debby-jeon . @debby-jeon
Follow
384 views
Uploaded On 2015-11-30

Data Annotation for Classification - PPT Presentation

Prediction Develop a model which can infer a single aspect of the data predicted variable from some combination of other aspects of the data predictor variables Which students are offtask Which students will fail the class ID: 209834

tyrone task agreement pablo task tyrone pablo agreement kappa expected category rater data inter values count accuracy amp reliability observations behavior number

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Data Annotation for Classification" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Data Annotation for ClassificationSlide2

Prediction

Develop a model which can infer a single aspect of the data (predicted variable) from some combination of other aspects of the data (predictor variables)

Which students are off-task?

Which students will fail the class?Slide3

Classification

Develop a model which can infer a

categorical predicted variable from

some combination of other aspects of the

data

Which students will fail the class?

Is the student currently gaming the system?

Which type of gaming the system is occurring?Slide4

We will…

We will go into detail on

classification methods

tomorrowSlide5

In order to use prediction methods

We need to know what we’re trying to predict

And we need to have some labels of it in real data Slide6

For example…

If we want to predict whether a student using educational software is off-task, or gaming the system, or bored, or frustrated, or going to fail the class…

We need to first collect some data

And within that data, we need to be able to identify which students are off-task (or the construct of interest), and ideally whenSlide7

So we need to label

some data

We need to obtain outside knowledge to determine what the value is for the construct of interestSlide8

In some cases

We can get a

gold-standard

label

For instance, if we want to know if a student passed a class, we just go ask their instructorSlide9

But for behavioral constructs…

There’s no one to ask

We can’t ask the student (self-presentation)

There’s no gold-standard metric

So we use data labeling methods or observation methods

(e.g. quantitative field observations, video coding)

To collect

bronze-standard

labels

Not perfect, but good enoughSlide10

One such labeling method

Text

replay

codingSlide11

Text replays

Pretty-prints of student interaction behavior from the logsSlide12

ExamplesSlide13
Slide14
Slide15
Slide16
Slide17
Slide18

Sampling

You can set up any sampling schema you want, if you have enough log data

5 action sequences

20 second sequences

Every behavior on a specific skill, but other skills omittedSlide19

Sampling

Equal number of observations per lesson

Equal number of observations per student

Observations that machine learning software needs help to categorize (“biased sampling”)Slide20

Major Advantages

Both video and field observations hold some risk of observer effects

Text replays are based on logs that were collected completely unobtrusivelySlide21

Major Advantages

Blazing fast to conduct

8 to 40 seconds per observationSlide22

Notes

Decent inter-rater reliability is possible

(Baker, Corbett, & Wagner, 2006)

(Baker,

Mitrovic

, & Mathews,

2010)

(Sao Pedro et al,

2010)

(

Montalvo

et al,

2010)

Agree with other measures of constructs

(Baker, Corbett, & Wagner, 2006)

Can be used to train machine-learned detectors

(Baker & de

Carvalho

, 2008)

(Baker,

Mitrovic

, &

Mathews,

2010)

(Sao Pedro et al,

2010)Slide23

Major Limitations

Limited range of constructs you can code

Gaming the System – yes

Collaboration in online chat – yes

(

Prata

et al, 2008)

Frustration, Boredom – sometimes

Off-Task Behavior outside of software – no

Collaborative Behavior outside of software – noSlide24

Major Limitations

Lower precision (because lower bandwidth of observation)Slide25

Hands-on exerciseSlide26

Find a partner

Could be your project team-mate, but doesn’t have to be

You

will do this exercise with themSlide27

Get a copy of the text replay software

On your flash drive

Or at

http://www.joazeirodebaker.net/algebra-obspackage-LSRM.zip

Slide28

Skim the instructions

At Instructions-LSRM.docxSlide29

Log into text replay software

Using exploratory login

Try to figure out what the student’s behavior means, with your partner

Do this for

~5 minutesSlide30

Now pick a category you want to code

With your partnerSlide31

Now code data

According to your coding scheme

(is-category versus is-not-category)

Separate from your partner

For

20 minutesSlide32

Now put your data together

Using the observations-NAME files you obtained

Make a table (in excel?) showingSlide33

Coder

1

Y

Coder

1

N

Coder

2

Y

15

2

Coder

2

N

3

8Slide34

Now…

We can compute your inter-rater reliability… (also called agreement)Slide35

Agreement/ Accuracy

The easiest measure of inter-rater reliability is

agreement

, also called

accuracy

# of agreements

total number of codesSlide36

Agreement/ Accuracy

There is general agreement across fields that agreement/accuracy is not a good metric

What are some drawbacks of agreement/accuracy?Slide37

Agreement/ Accuracy

Let’s say that Tasha and

Uniqua

agreed on the classification of 9200 time sequences, out of 10000 actions

For a coding scheme with two codes

92% accuracy

Good, right?Slide38

Non-even assignment to categories

Percent Agreement does poorly when there is non-even assignment to categories

Which is almost always the case

Imagine an extreme case

Uniqua

(correctly) picks category A 92% of the time

Tasha

always

picks category A

Agreement/accuracy of 92%

But essentially no informationSlide39

An alternate metric

Kappa

(Agreement – Expected Agreement)

(1 – Expected Agreement)Slide40

Kappa

Expected agreement computed from a table of the form

Rater 2

Category 1

Rater 2

Category 2

Rater 1

Category 1

Count

Count

Rater 1

Category 2

Count

CountSlide41

Kappa

Expected agreement computed from a table of the form

Note that Kappa can be calculated for any number of categories (but only 2

raters

)

Rater 2

Category 1

Rater 2

Category 2

Rater 1

Category 1

Count

Count

Rater 1

Category 2

Count

CountSlide42

Cohen’s (1960) Kappa

The formula for 2 categories

Fleiss’s

(1971) Kappa, which is more complex, can be used for 3+ categories

I have an Excel spreadsheet which calculates multi-category Kappa, which I would be happy to share with youSlide43

Expected agreement

Look at the proportion of labels each coder gave to each category

To find the number of agreed category A that could be expected by chance, multiply pct(coder1/

categoryA

)*pct(coder2/

categoryA

)

Do the same thing for

categoryB

Add these two values together and divide by the total number of labels

This is your expected agreementSlide44

Example

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20

5

Tyrone

On-Task

15

60Slide45

Example

What is the percent agreement?

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20

5

Tyrone

On-Task

15

60Slide46

Example

What is the percent agreement?

80%

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20

5

Tyrone

On-Task

15

60Slide47

Example

What is Tyrone’s

expected frequency for on-task?

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20

5

Tyrone

On-Task

15

60Slide48

Example

What is Tyrone’s

expected frequency for on-task?

75%

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20

5

Tyrone

On-Task

15

60Slide49

Example

What is Pablo’s

expected frequency for on-task?

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20

5

Tyrone

On-Task

15

60Slide50

Example

What is Pablo’s

expected frequency for on-task?

65%

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20

5

Tyrone

On-Task

15

60Slide51

Example

What is the expected on-task agreement?

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20

5

Tyrone

On-Task

15

60Slide52

Example

What is the expected on-task agreement?

0.65*0.75= 0.4875

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20

5

Tyrone

On-Task

15

60Slide53

Example

What is the expected on-task agreement?

0.65*0.75= 0.4875

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20

5

Tyrone

On-Task

15

60 (48.75)Slide54

Example

What are Tyrone

and Pablo’s expected frequencies for off-task

behavior

?

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20

5

Tyrone

On-Task

15

60 (48.75)Slide55

Example

What are Tyrone

and Pablo’s expected frequencies for off-task

behavior

?

25% and 35%

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20

5

Tyrone

On-Task

15

60 (48.75)Slide56

Example

What is the expected off-task agreement?

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20

5

Tyrone

On-Task

15

60 (48.75)Slide57

Example

What is the expected off-task agreement?

0.25*0.35= 0.0875

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20

5

Tyrone

On-Task

15

60 (48.75)Slide58

Example

What is the expected off-task agreement?

0.25*0.35= 0.0875

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20 (8.75)

5

Tyrone

On-Task

15

60 (48.75)Slide59

Example

What is the

total

expected agreement?

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20 (8.75)

5

Tyrone

On-Task

15

60 (48.75)Slide60

Example

What is the

total

expected agreement?

0.4875+0.0875 = 0.575

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20 (8.75)

5

Tyrone

On-Task

15

60 (48.75)Slide61

Example

What is kappa?

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20 (8.75)

5

Tyrone

On-Task

15

60 (48.75)Slide62

Example

What is kappa?

(0.8 – 0.575) / (1-0.575)

0.225/0.425

0.529

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20 (8.75)

5

Tyrone

On-Task

15

60 (48.75)Slide63

So is that any good?

What is kappa?

(0.8 – 0.575) / (1-0.575)

0.225/0.425

0.529

Pablo

Off-Task

Pablo

On-Task

Tyrone

Off-Task

20 (8.75)

5

Tyrone

On-Task

15

60 (48.75)Slide64

What is your Kappa?Slide65

Interpreting Kappa

Kappa = 0

Agreement is at chance

Kappa = 1

Agreement is perfect

Kappa = negative infinity

Agreement is perfectly inverse

Kappa > 1

You messed

up somewhereSlide66

Kappa<0

It does happen, but usually not in the case of inter-rater reliability

Occasionally seen when Kappa is used for EDM or other types of machine

learningSlide67

0<Kappa<1

What’s a good Kappa?

There is no absolute standard

For inter-rater reliability,

0.8 is usually what ed. psych. reviewers want to see

You can usually make a case that values of Kappa around 0.6 are good enough to be usable for some applications

Particularly if there’s a lot of data

Or if you’re collecting observations to drive EDM, and remembering that this is a “bronze-standard”Slide68

Landis & Koch’s (1977) scale

κ

Interpretation

< 0

No agreement

0.0 — 0.20

Slight agreement

0.21 — 0.40

Fair agreement

0.41 — 0.60

Moderate agreement

0.61 — 0.80

Substantial agreement

0.81 — 1.00

Almost perfect agreementSlide69

Why is there no standard?

Because Kappa is scaled by the proportion of each category

When one class is much more prevalent

Expected agreement is higher than

If classes are evenly balancedSlide70

Because of this…

Comparing Kappa values between two studies, in a principled fashion, is highly difficult

A lot of work went into statistical methods for comparing Kappa values in the 1990s

No real consensus

Informally, you can compare two studies if the proportions of each category are “similar”Slide71

There

is

a way to statistically compare two inter-rater reliabilities…

“Junior high school” meta-analysisSlide72

There is a way to statistically compare two inter-rater reliabilities…

“Junior high school” meta-analysis

Do a 1

df

Chi-squared test on each reliability, convert the Chi-squared values to Z, and then compare the two Z values using the method in Rosenthal &

Rosnow

(1991)Slide73

There is a way to statistically compare two inter-rater reliabilities…

“Junior high school” meta-analysis

Do a 1

df

Chi-squared test on each reliability, convert the Chi-squared values to Z, and then compare the two Z values using the method in Rosenthal &

Rosnow

(1991)

Or in other words,

nyardley

nyardley

nyooSlide74

Additional thoughts/comments

About inter-rater reliabilitySlide75

Additional thoughts/comments

About text replays