Dutch lAnguage Investigation of Summarization technologY Katholieke Universiteit Leuven Rijksuniversiteit Groningen Qgo DAISY on one slide Segmentation Rhetorical classification ID: 255145
Download Presentation The PPT/PDF document "DAISY" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
DAISYDutch lAnguage Investigationof Summarization technologY
Katholieke
Universiteit
Leuven
Rijksuniversiteit
Groningen
Q-goSlide2
DAISY on one slide
Segmentation
Rhetorical
classificationSentencecompressionSentencegeneration
Multi-document summarization:
Detect differences
Improvement question answering,
e.g. e-mail answering
Summarization of web contentSlide3
OverviewReport of our current progress in:Corpus building and preprocessingSegmentation
Sentence generationSlide4
Corpus Building and PreprocessingTarget: corpus of questions, short texts and
webpages
about the same topic
Freely available: UWV (questions & answer texts)SVB (questions)Available for internal use: KLM (questions, answer texts, web pages)Todo: web pages SVB
ABN AMRO (committed, not delivered)Slide5
Corpus Building and PreprocessingPOS-tagged and parsed: KLM and UWVSVB corpus: in progress
Coreference
resolution: in progressSlide6
Segmentation
Find main content in webpage
Smaller segments
Can be obtained from HTML structure
<H#>, <P>, <BR>, <UL>, ...
Hierarchical
Will be refined in relation to
rhetorical
rolesSlide7
SegmentationSlide8
SegmentationSlide9
Segmentation
Search for block with highest density of textSlide10
SegmentationSlide11
Segmentation
Additional heuristics to extend the selection:
Find closing tags for all tags that were opened in the selection
Include all text delimited by known tag patterns occurring just before and after the selection
Take the smallest enclosing DIV blockSlide12
Sentence generationSpecification of abstract dependency treesSpecify grammatical relations between lexical items and constituents dominating over lexical items
Alpino
dependency trees without adjacency information
More variation through underspecification in lexical items, handling of particlesSlide13
Sentence generationInitial implementation generator:Chart generator (Kay, 1996)Top-down guidance through expected dependency relations
Generates substantial part of input created from the
Alpino
testsuitesIncluded in recent Alpino versionsFurther work: optimization (time and space)Slide14
Sentence generationSelecting the most fluent sentence through fluency ranking:N-gram language model
Log-linear model
Experiments with
Velldall (2007) and parse disambiguation feature templates.Need more insight about feature overlapExperiment with more feature templatesSlide15
Sentence generationEvaluation:Corpus sentences used as a reference for the most fluent realization
Fairly strict, since there can be multiple fluent sentences
Where is the ceiling?
More annotated material!FLAN: FLuency ANnotator (web application) Slide16
Thanks!