with Time Series Query Hyun Duk Kim now at Twitter Danila Nikitin now at Google ChengXiang Zhai University of Illinois at UrbanaChampaign Malu Castellanos Meichun ID: 248002
Download Presentation The PPT/PDF document "Information Retrieval" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Information Retrieval with Time Series Query
Hyun
Duk
Kim (now at Twitter) ,
Danila
Nikitin
(now at Google),
ChengXiang
Zhai
University of Illinois at Urbana-Champaign
Malu
Castellanos,
Meichun
Hsu
HP LaboratoriesSlide2
…
Time
Any clues in the companion news stream?
Dow Jones Industrial Average [Source: Yahoo Finance]
IR for stock market analysis?
What might have caused the
stock market
crash?
Sept 11 Attack!
What documents to read to analyze such a “causal” topic? Slide3
Analysis of Presidential Prediction Markets
What might have caused the sudden drop of price for this candidate?
What “mattered” in this election?
…
Time
Any clues in the companion news stream?
Tax cut?
What documents to read to analyze such a “causal” topic? Slide4
…
Time
Any clues in the companion product reviews?
Analysis of Product Sales
What might have caused the decrease of sales?
safety concerns
What reviews to read to analyze such a “causal” topic? Slide5
…
Time
Which documents cover such a “trendy” topic?
Finding documents about
“trendy” topics
Draw a “time series query”: Find documents about a topic emerging this summer, which has attracted much attention this Oct Slide6
Information Retrieval with Time Series Query
Instead of keyword query, use time series as a query
Retrieve documents that
contain
topics that are correlated with the query time
seriesInput: Time series data with time stamp
Text stream which is a collection of documents with time stamp within the same time period
OutputRanked list of documentsSlide7
Ideal Results of Information Retrieval with Time Series Query
2000
2001 …
News
RANK
DATE
EXCERPT
1
9/29/2000
Expect
earning will be far below
2
12/8/2000
$
4 billion cash in company
3
10/19/2000
D
isappointing
earning report
4
4/19/2001
Dow
and
Nasdaq
soar after rate cut
by
Federal Reserve
5
7/20/2001
Apple's
new retail store
…
…
…Slide8
IR w/ TS - Method Overview
Sep , 2001
Oct , 2001 …
Text Stream
Non-text
Time Series
Vocabulary,
Word Frequency Curves
W1
W2
W3
W4
…
Input 1
Input 2
Rank by
Correlation
…
…
…
…
…
Ranked Documents
Output
…
…
…
…
…
Input DocumentsSlide9
IR w/ TS - Method Overview
…
Sep , 2001
Oct , 2001 …
Text Stream
Non-text
Time Series
Vocabulary,
Word Frequency Curves
W1
W2
W3
W4
…
Rank by
Correlation
Input 1
Input 2
…
…
…
…
…
…
…
…
…
Ranked Documents
Output
Input Documents
1. How to measure correlation between word and time series
2. How to aggregate word correlations to rank documentsSlide10
Correlation Function
Measure correlation between word frequency curve vs. input time series
Pearson Correlation
Basic correlation
Dynamic Time Warping [Senin`08]
Capture alignment of shifted or stretched time series
Series before alignment
Time series Alignment
Values
TimeSlide11
Aggregation FunctionScore document correlation by aggregating word correlations
Weighted TF-IDF (BM25)
Use top K correlated words as a text query
Use IR formula such as BM25
Use correlation coefficient as a weightSlide12
Aggregation Function
Average Correlation
Average over
all
terms: Not all the words are correlated?
Average over top-k terms:May be dominated by multiple occurrences of the same term
Average over top-k unique terms: Slide13
Evaluation
Data Set
New York Times corpus (Jul 2000~Dec 2001)
Entity annotated
Daily Stock prices of 24 companiesMeasureMean average precision (MAP)Normalized discounted cumulative gain (NDCG)
Research questionsCan our method retrieve meaningful documents?Does DTW outperform Pearson Correlation?
Which aggregation function works the best? Slide14
Top ranked documents by American Airlines stock price
Rank
Date
Excerpt
1
10/22/2001
Fleeing
the war
212/11/2001Us and anti-Taliban forces in
Afghanistan311/18/2001
Fate of Taliban Soldiers Under Discussion
411/12/2001Tally and dead and missing in Sep 11 terrorist attac
ks5
9/25/2001Soldiers in Afghanistan
…611/19/2001
Recover operation at World Trade Center711/3/20014343 died or missing as a result of the attacks on Sep 11811/17/2001Dead and missing report of Sep 11 attack………
All top ranked documents are related to September 11, terrorist attackSlide15
Top Correlated Words to American Airlines stock price
A
ll top correlated terms to input time series are related to terrorist attack
Highly correlated terms contributed to retrieval of documents about this topic
Word
|
ρ|
challenged0.887031
afghanistan0.861351security
0.858745sept0.858309
terrorism0.854865pakistan
0.848829aghans
0.844596afghan0.843481
islamic0.842499
taliban0.841455Slide16
Top ranked ‘relevant’ documents for Apple stock price
Rank
Date
Excerpt
1
9/29/2000
Fourth-
quarter earning far below estimates
212/8/2000$4 billion reserve, not $11 billion
310/19/2000Announced
earnings report4
4/29/2001Dow and Nasdaq soar after rate cur by Federal Reserve
57/20/2001
Apple’s new retail stores
612/6/2000Apple warns it will record quarterly loss
73/24/2001Stocks perk up, with Nasdaq posing gain 88/10/2000Mixing Mac and Windows………Retrieved relevant event: Disappointing earning report, store open, etc.
Useful as a new feature for re-ranking search results? Slide17
Quantitative EvaluationAll our methods > Random precision (0.0013)
Dynamic time warping >> Pearson correlation
Pearson
DTW
MAP
NDCG
MAP
NDCG
0.0019
0.3515
0.0022
0.3609
- Average performance (Average correlation as aggregation method)Slide18
Comparison of Aggregation Methods
AC <<
TopK
, BM25
Top5-AC << Top20-AC,but not more than K=20BM25 is sensitive to parameter settingScores of AC methods are more meaningful Incomplete judgments
Possibly much better performance in reality
MAP
NDCG
AC0.0019
0.3515
Top5-AC0.0021
0.361
Top10-AC0.0023
0.3618Top20-AC
0.00240.3629
Top5-AC-Uniq0.00220.3613Top10-AC-Uniq0.00220.3616Top20-AC-Uniq0.00220.3619
Top5-BM25
0.00190.3584
Top10-BM25
0.00230.361Top20-BM250.00190.3582- Average performance (w/ Pearson correlation)Slide19
“Higher” NDCG vs. Low MAPSlide20
SummaryIntroduced a novel retrieval problemtime series as query
Studied basic solutions: Time series representation of terms
Term retrieval: correlation(query, term)
Document retrieval: aggregation of term retrieval results
Dynamic time warping + top-K average correlation seems working well Slide21
Limitations & Future Work
Evaluation is based on simulation
Highly incomplete judgments!
What’s a good way to evaluate such a new retrieval task?
Current solutions are heuristicHow can we develop a more principled model? Different notions of relevance
“Local” relevance vs. global relevance? All other issues relevant to a standard retrieval problem are worth exploring (e.g., feedback?) Slide22
Thank You! Comments/Questions?