Tal Baumel Rafi Cohen Michael Elhadad Jan 2014 Generic Summarization Generic Extractive Multidoc Summarization Given a set of documents Di Identify a set of sentences Sj st Sj ID: 791124
Download The PPT/PDF document "Query Chain Focused Summarization" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Query Chain Focused Summarization
Tal
Baumel
, Rafi Cohen, Michael Elhadad
Jan 2014
Slide2Generic Summarization
Generic Extractive Multi-doc Summarization:
Given a set of documents Di
Identify a set of sentences
Sj
s.t.
|
Sj
| < L
The “central information” in Di is captured by
Sj
Sj
does not contain redundant information
Representative methods:
KLSum
LexRank
Key concepts:
Centrality, Redundancy
Slide3Update Summarization
Given a set of documents split as A =
ai
/ B =
bj
defined as background / new sets
Select a set of sentences
Sk
s.t.
|
Sk
| < L
Sk
captures central information in B
Sk
does not repeat information conveyed by A
Key concepts: centrality, redundancy,
novelty
Slide4Query-Focused Summarization
Given a set of documents Di and a query Q
Select a set of sentences
Sj
s.t.
:
|
Sj
| < L
Sj
captures information in Di relevant to Q
Sj
does not contain redundant information
Key concepts:
relevance
, redundancy
Slide5Query-Chain Focused Summarization
We define a new task to clarify among key concepts:
Relevance
Novelty
Contrast
Similarity
Redundancy
The task is also useful for
Exploratory Search
Slide6QCFS Task
Given a set of topic-related documents Di and a chain of queries
qj
Output a chain of summaries {
Sjk
}
s.t.
:
|
Sjk
| < L
Sjk
is relevant to
qj
Sjk
does not contain information in
Slk
for l < j
Slide7Query Chains
Query Chains are observed in query logs:
PubMed search log mining
Extract query chains (length 3) of same session / with related terms (manually)
Query Chains evolution may correspond to:
Zoom in
(asthma
atopic dermatitis)
Query reformulation
(respiratory problem
pneumonia)
Focus Change
(asthma
cancer)
Slide8Query Chains vs. Novelty Detection
TREC Novelty Detection Task (2005)
Task 1
: Given a set of documents for the topic, identify all relevant and novel sentences.
Task 2
: Given the relevant sentences in all documents, identify all novel sentences.
Task 3
: Given the relevant and novel sentences in the first 5 docs only, find the relevant and novel sentences in the remaining docs.
Task 4
: Given the relevant sentences from all documents and the novel sentences from the first 5 docs, find the novel sentences in the remaining docs.
Slide9Novelty Detection Task
Create 50 topics:
Compose topic (textual description)
Select 25 relevant docs from News collection
Sort docs chronologically
Mark relevant sentences
Among relevant sentences, mark novel ones (not covered in previous relevant sentences).
28 “events” topics / 22 “opinion” topics
Slide10TREC Novelty – Dataset Analysis
Select
parts
of documents (not full docs).
Relevant rate: events: 25% / opinion: 15%
Consecutive sentences: 85% / 65%
Relevant agreement: 68% / 50%
Novelty rate: 38% / 42%
Novelty agreement: 45% / 29%
Slide11TREC Novelty Methods
Relevance = Similarity to Topic.
Novelty = Dissimilarity to past sentences.
Methods:
Tf.idf
and okapi with threshold for retrieval
Topic expansion
Sentences expansion
Named entities as features
Coreference
resolution
Named entities normalization (entity linking)
Results:
High recall / Low precision
Almost no distinction relevant / novel
Slide12QCFS and Contrast
QCFS is different from Query Focus:
When generating S2 – must take S1 into account.
QCFS is different from Update:
Split A/B is not observed.
QCFS is different from Novelty Detection:
Chronology is not relevant
Key concepts:
Query Relevance
Query Distinctiveness (how qi+1 contrasts with qi)
Slide13Contrastive IR
CWS: A Comparative Web Search System
Sun et al, WWW 2006
Given 2 queries q1 and q2
Rank a set of “contrastive pairs” (p1, p2)
where p1 and p2 are snippets of relevant docs.
Method:
Retrieve relevant snippets SR1 = {p1i} and SR2 = {p2j}
Score
aR
(p1, q1) +
bR
(p2, q2) +
cT
(p1,p2,q1,q2)
T(p1,p2,q1,q2) = x
Sim
(url1, url2) + (1-x)
Sim
(p1\q1, p2\q2)
Greedy ranking of pairs:
rank all pairs (p1,p2) by score – take top
Remove p1top and p2top from all pairs – iterate.
Cluster pairs into comparative clusters
Extract terms from comparative clusters.
Slide14Document Clustering
A Hierarchical
Monothetic
Document Clustering Algorithm for Summarization and Browsing Search Results
Kummamuru
et al, WWW 2004
Desirable properties of clustering:
Coverage
Compactness
Sibling distinctiveness
Reach time
Incremental algorithm:
Decide on width n of tree (# children / node)
Nodes are represented by “concepts” (terms)
Rank concepts by score and add them under current node
Score(
Sak
,
cj
) = a
ScoreC
(Sak-1,
cj
) + b
ScoreD
(Sak-1,
cj
)
ScoreC
= document coverage
ScoreD
= sibling distinctiveness