Mark Arehart Chris Wolf Keith Miller The MITRE Corporation marehart cwolf keith mitreorg Summary Matching multicultural name variants is knowledge intensive Ground truth dataset requires tedious adjudication ID: 524623
Download Presentation The PPT/PDF document "Adjudicator Agreement and System Ranking..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Adjudicator Agreement and System Rankings for Person Name Search
Mark Arehart, Chris Wolf, Keith Miller
The MITRE Corporation
{
marehart
,
cwolf
,
keith
}@
mitre.orgSlide2
Summary
Matching multicultural name variants is knowledge intensive
Ground truth dataset requires tedious adjudication
Guidelines not comprehensive, adjudicators often disagreePrevious evaluations: multiple adjudication, votingResults of study: High agreement, multiple adjudication not needed “Nearly” same payoff for much less effort
2Slide3
Dataset
Watchlist
, ~71K
Deceased persons lists Mixed cultures 1.1K variants for 404 base names Ave. 2.8 variants per base recordQueries, 700 404 base names 296 randomly selected from watchlist Subset of 100 randomly selected for this study
3Slide4
Method
Adjudication pools as in TREC: pool from 13 algorithms
Four judges complete pools (1712 pairs, excluding exact matches)
Compare system rankings under different versions of ground truth
Type
Criteria for true match
1
Consensus
Tie or majority
vote (baseline)
2UnionJudged true by anyone3IntersectionJudged true by everyone4SingleJudgments from a single adjudicator (4)5RandomRandomly choose adjudicator per item (1000)
4Slide5
Adjudicator Agreement Measures
+
-
+
A
B
-
C
D
overlap = a / (a + b + c)
p+ = 2a / (2a + b + c)p- = 2d / (2d + b + c)5Slide6
Adjudicator Agreement
6
Lowest is A~B
kappa 0.57Highest is C~Dkappa 0.78Slide7
So far…
7
Test
watchlist and query listResults from 13 algorithmsAdjudications by 4 volunteersWays of compiling alternate ground truth sets
Still need…Slide8
Comparing System Rankings
8
A complete ranking
B
C
D
A
E
How similar?
Kendall’s tau
Spearman’s rank correlationCBEADSlide9
Significance Testing
9
Not all differences are significant (duh)
F1-measure: harmonic mean of precision & recall Not a proportion or mean of independent observations Not amenable to traditional significance tests Like other IR measures, e.g. MAPBootstrap resampling
Sample with replacement from data
Compute difference for many trials
Produces a distribution of differencesSlide10
Incomplete Ranking
10
B
C
D
A
E
Not all differences significant
partial ordering
How similar?
BCDAESlide11
Evaluation Statements
11
B
C
D
A
E
A>B
A>C
A>D
A>EB=CB>DB>EC>DC>ED=EBCDAE
A<B
A>C
A>D
A>E
B>C
B>D
B>E
C>D
C>E
D=ESlide12
Similarity
12
A>B
A>CA>DA>EB=CB>DB>EC>DC>E
D=E
n systems
n(n-1) / 2 evaluation statements
A<B
A>C
A>D
A>EB>CB>DB>EC>DC>ED=ESensitivity: proportion of relations with sig diffSens = 80%Sens = 90%Reversal rate: proportion of reversed relations: 10%Total disagreement: 20%Slide13
Comparisons With Baseline
13
Truth Set
Sensitivity
Disagree
Reversal
Consensus
0.744
n/a
n/a
Union0.7820.0640Intersection0.5380.4230.038Judge A0.7690.0510
Judge B
0.705
0.038
0
Judge C
0.756
0.115
0
Judge D
0.692
0.179
0
No reversals except with intersection GT
(one algorithm)
Highest and lowest
agr
with consensus
Low!Slide14
GT Comparisons
14Slide15
Comparison With Random
15
1000 GT versions created by randomly selecting a judge
Consensus sensitivity = 74.4%Average random sensitivity = 72.9% (sig diff at 0.05)Average disagreement with consensus = 7.3%5% disagreement expected (actually more)
2.3% remainder (actually less) attributable to GT method
No reversals in any of the 1000 setsSlide16
Conclusion
16
Multiple adjudicators judge everything
expensiveSingle adjudicator variability in sensitivityMultiple adjudicators randomly divide pool:
Slightly less sensitivity
No reversals of results
Much less labor
Differences wash out approximating consensus
Practically same result for less effort