Implied Facts in Historical Documents David W Embley Stephen W Liddle Deryle W Lonsdale Spencer Machado Thomas Packer Joseph Park Nathan Tate Andrew Zitzelberger Brigham Young University ID: 416242
Download Presentation The PPT/PDF document "Enabling Search for Facts and" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Enabling Search for Facts andImplied Facts in Historical Documents
David W. Embley, Stephen W. Liddle, Deryle W. Lonsdale, Spencer Machado, Thomas Packer, Joseph Park, Nathan Tate, Andrew ZitzelbergerBrigham Young University
BYU
D
ata
E
xtraction Research
G
roupSlide2
WoK
-HD(A Web of Knowledge Superimposed over Historical Documents)
…
…
…
…Slide3
WoK
-HD(A Web of Knowledge Superimposed over Historical Documents)
…
…
grandchildren of Mary Ely
…
…Slide4
WoK
-HD(A Web of Knowledge Superimposed over Historical Documents)
…
…
…
…
grandchildren of Mary ElySlide5
WoK
-HD(A Web of Knowledge Superimposed over Historical Documents)
…
…
grandchildren of Mary Ely
…
…Slide6
grandchildren of Mary Ely
WoK-HD
(A Web of Knowledge Superimposed over Historical Documents)
…
…
…
…Slide7
WoK-HD InputSlide8
Querying for Facts & Implied FactsSlide9
Querying for Facts & Implied Facts
Animation ofExtraction query, results, highlightingReasoned Query, results, reasoning chain, highlightingSlide10
Extraction OntologiesSlide11
Extraction OntologiesSlide12
Fact ExtractionSlide13
Fact ExtractionSlide14
Fact ExtractionSlide15
Reasoning for Implied FactsSlide16
Reasoning for Implied FactsSlide17
Reasoning for Implied FactsSlide18
Reasoning for Implied FactsSlide19
Query Interpretation
“Mary Ely” grandchildSlide20
Query Interpretation
“Mary Ely” grandchildSlide21
Query Interpretation
“Mary Ely” grandchildSlide22
Generated SPARQL QuerySlide23
Generated SPARQL QuerySlide24
Query ResultsSlide25
Results of Processingthe Ely Ancestry (all 830 Pages)
Number of facts extracted: 22,2518,740 Person-Birthdate facts3,803 Person-
Deathdate facts9,708 children facts, including5,020 Child-has-parent-Person facts
2,394 Son-of-Person facts2,294 Daughter-of-Person factsNumber of implied grandchild facts inferred: 5,277
Processing time:
~18 seconds per page
CPU time: ~4 hours
Processing 10 in parallel: ~24 minutesSlide26
Results of Processingthe Ely Ancestry (all 830 Pages)
Precision: .52 (by randomly selecting & checking 100 of the 22,251 facts)
Recall: .33 & Precision: .40 (by randomly selecting and checking 2 fact-filled family pages)Errors:
Name recognizerText pattern expectationsOCRVarying accuracy (for pages checked)
Recall: .11, Precision: .11
(bad combination of all problems)
Recall: .50, Precision: .68
(some problems, but closer to expectations)
Recall: .59, Precision: .71
(10 pages, mostly as expected)
Recall: .91, Precision: .94 (tuned, no problems except a few OCR errors)Slide27
Current and Future Work
Implementation Status:Full line works (but is fragile & needs finishing touches)HyKSS integrated (but not all features)Scalability:Handcrafted extraction
ontologies & reasoning rules (worth the work for certain applications)
ListReader (plus bootstrapping for lists and general extraction)Optimization (especially for query processing)Integration:
Mapping extraction
ontologies
to domain
ontologies
Object identity for people and placesSlide28
Summary and Conclusion
WoK-HDSuperimposes a web of knowledge over a collection of historical documents
Works as a proof-of-concept prototypeTo build and deploy the WoK-HD successfully:
Efficient implementationBetter, more cost-effective extractionIntegration and record linkage
www.deg.byu.edu