/
Adjudicator Agreement and System Rankings for Person Name S Adjudicator Agreement and System Rankings for Person Name S

Adjudicator Agreement and System Rankings for Person Name S - PowerPoint Presentation

ellena-manuel
ellena-manuel . @ellena-manuel
Follow
384 views
Uploaded On 2017-03-15

Adjudicator Agreement and System Rankings for Person Name S - PPT Presentation

Mark Arehart Chris Wolf Keith Miller The MITRE Corporation marehart cwolf keith mitreorg Summary Matching multicultural name variants is knowledge intensive Ground truth dataset requires tedious adjudication ID: 524623

sensitivity adjudicator randomly agreement adjudicator sensitivity agreement randomly differences multiple consensus truth 1000 watchlist reversals disagreement adjudication adjudicators true

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Adjudicator Agreement and System Ranking..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Adjudicator Agreement and System Rankings for Person Name Search

Mark Arehart, Chris Wolf, Keith Miller

The MITRE Corporation

{

marehart

,

cwolf

,

keith

}@

mitre.orgSlide2

Summary

Matching multicultural name variants is knowledge intensive

Ground truth dataset requires tedious adjudication

Guidelines not comprehensive, adjudicators often disagreePrevious evaluations: multiple adjudication, votingResults of study: High agreement, multiple adjudication not needed “Nearly” same payoff for much less effort

2Slide3

Dataset

Watchlist

, ~71K

Deceased persons lists Mixed cultures 1.1K variants for 404 base names Ave. 2.8 variants per base recordQueries, 700 404 base names 296 randomly selected from watchlist Subset of 100 randomly selected for this study

3Slide4

Method

Adjudication pools as in TREC: pool from 13 algorithms

Four judges complete pools (1712 pairs, excluding exact matches)

Compare system rankings under different versions of ground truth

Type

Criteria for true match

1

Consensus

Tie or majority

vote (baseline)

2UnionJudged true by anyone3IntersectionJudged true by everyone4SingleJudgments from a single adjudicator (4)5RandomRandomly choose adjudicator per item (1000)

4Slide5

Adjudicator Agreement Measures

+

-

+

A

B

-

C

D

overlap = a / (a + b + c)

p+ = 2a / (2a + b + c)p- = 2d / (2d + b + c)5Slide6

Adjudicator Agreement

6

Lowest is A~B

kappa 0.57Highest is C~Dkappa 0.78Slide7

So far…

7

Test

watchlist and query listResults from 13 algorithmsAdjudications by 4 volunteersWays of compiling alternate ground truth sets

Still need…Slide8

Comparing System Rankings

8

A complete ranking

B

C

D

A

E

How similar?

Kendall’s tau

Spearman’s rank correlationCBEADSlide9

Significance Testing

9

Not all differences are significant (duh)

F1-measure: harmonic mean of precision & recall Not a proportion or mean of independent observations Not amenable to traditional significance tests Like other IR measures, e.g. MAPBootstrap resampling

Sample with replacement from data

Compute difference for many trials

Produces a distribution of differencesSlide10

Incomplete Ranking

10

B

C

D

A

E

Not all differences significant

 partial ordering

How similar?

BCDAESlide11

Evaluation Statements

11

B

C

D

A

E

A>B

A>C

A>D

A>EB=CB>DB>EC>DC>ED=EBCDAE

A<B

A>C

A>D

A>E

B>C

B>D

B>E

C>D

C>E

D=ESlide12

Similarity

12

A>B

A>CA>DA>EB=CB>DB>EC>DC>E

D=E

n systems

 n(n-1) / 2 evaluation statements

A<B

A>C

A>D

A>EB>CB>DB>EC>DC>ED=ESensitivity: proportion of relations with sig diffSens = 80%Sens = 90%Reversal rate: proportion of reversed relations: 10%Total disagreement: 20%Slide13

Comparisons With Baseline

13

Truth Set

Sensitivity

Disagree

Reversal

Consensus

0.744

n/a

n/a

Union0.7820.0640Intersection0.5380.4230.038Judge A0.7690.0510

Judge B

0.705

0.038

0

Judge C

0.756

0.115

0

Judge D

0.692

0.179

0

No reversals except with intersection GT

(one algorithm)

Highest and lowest

agr

with consensus

Low!Slide14

GT Comparisons

14Slide15

Comparison With Random

15

1000 GT versions created by randomly selecting a judge

Consensus sensitivity = 74.4%Average random sensitivity = 72.9% (sig diff at 0.05)Average disagreement with consensus = 7.3%5% disagreement expected (actually more)

2.3% remainder (actually less) attributable to GT method

No reversals in any of the 1000 setsSlide16

Conclusion

16

Multiple adjudicators judge everything

 expensiveSingle adjudicator  variability in sensitivityMultiple adjudicators randomly divide pool:

Slightly less sensitivity

No reversals of results

Much less labor

Differences wash out approximating consensus

Practically same result for less effort