/
Information Retrieval Information Retrieval

Information Retrieval - PowerPoint Presentation

luanne-stotts
luanne-stotts . @luanne-stotts
Follow
427 views
Uploaded On 2016-03-09

Information Retrieval - PPT Presentation

with Time Series Query Hyun Duk Kim now at Twitter Danila Nikitin now at Google ChengXiang Zhai University of Illinois at UrbanaChampaign Malu Castellanos Meichun ID: 248002

documents time correlation series time documents series correlation retrieval top query average 2001 input word stock correlated text 2000 topic terms aggregation

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Information Retrieval" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Information Retrieval with Time Series Query

Hyun

Duk

Kim (now at Twitter) ,

Danila

Nikitin

(now at Google),

ChengXiang

Zhai

University of Illinois at Urbana-Champaign

Malu

Castellanos,

Meichun

Hsu

HP LaboratoriesSlide2

Time

Any clues in the companion news stream?

Dow Jones Industrial Average [Source: Yahoo Finance]

IR for stock market analysis?

What might have caused the

stock market

crash?

Sept 11 Attack!

What documents to read to analyze such a “causal” topic? Slide3

Analysis of Presidential Prediction Markets

What might have caused the sudden drop of price for this candidate?

What “mattered” in this election?

Time

Any clues in the companion news stream?

Tax cut?

What documents to read to analyze such a “causal” topic? Slide4

Time

Any clues in the companion product reviews?

Analysis of Product Sales

What might have caused the decrease of sales?

safety concerns

What reviews to read to analyze such a “causal” topic? Slide5

Time

Which documents cover such a “trendy” topic?

Finding documents about

“trendy” topics

Draw a “time series query”: Find documents about a topic emerging this summer, which has attracted much attention this Oct Slide6

Information Retrieval with Time Series Query

Instead of keyword query, use time series as a query

 Retrieve documents that

contain

topics that are correlated with the query time

seriesInput: Time series data with time stamp

Text stream which is a collection of documents with time stamp within the same time period

OutputRanked list of documentsSlide7

Ideal Results of Information Retrieval with Time Series Query

2000

2001 …

News

RANK

DATE

EXCERPT

1

9/29/2000

Expect

earning will be far below

2

12/8/2000

$

4 billion cash in company

3

10/19/2000

D

isappointing

earning report

4

4/19/2001

Dow

and

Nasdaq

soar after rate cut

by

Federal Reserve

5

7/20/2001

Apple's

new retail store

…Slide8

IR w/ TS - Method Overview

Sep , 2001

Oct , 2001 …

Text Stream

Non-text

Time Series

Vocabulary,

Word Frequency Curves

W1

W2

W3

W4

Input 1

Input 2

Rank by

Correlation

Ranked Documents

Output

Input DocumentsSlide9

IR w/ TS - Method Overview

Sep , 2001

Oct , 2001 …

Text Stream

Non-text

Time Series

Vocabulary,

Word Frequency Curves

W1

W2

W3

W4

Rank by

Correlation

Input 1

Input 2

Ranked Documents

Output

Input Documents

1. How to measure correlation between word and time series

2. How to aggregate word correlations to rank documentsSlide10

Correlation Function

Measure correlation between word frequency curve vs. input time series

Pearson Correlation

Basic correlation

Dynamic Time Warping [Senin`08]

Capture alignment of shifted or stretched time series

Series before alignment

Time series Alignment

Values

TimeSlide11

Aggregation FunctionScore document correlation by aggregating word correlations

Weighted TF-IDF (BM25)

Use top K correlated words as a text query

 Use IR formula such as BM25

Use correlation coefficient as a weightSlide12

Aggregation Function

Average Correlation

Average over

all

terms: Not all the words are correlated?

Average over top-k terms:May be dominated by multiple occurrences of the same term

Average over top-k unique terms: Slide13

Evaluation

Data Set

New York Times corpus (Jul 2000~Dec 2001)

Entity annotated

Daily Stock prices of 24 companiesMeasureMean average precision (MAP)Normalized discounted cumulative gain (NDCG)

Research questionsCan our method retrieve meaningful documents?Does DTW outperform Pearson Correlation?

Which aggregation function works the best? Slide14

Top ranked documents by American Airlines stock price

Rank

Date

Excerpt

1

10/22/2001

Fleeing

the war

212/11/2001Us and anti-Taliban forces in

Afghanistan311/18/2001

Fate of Taliban Soldiers Under Discussion

411/12/2001Tally and dead and missing in Sep 11 terrorist attac

ks5

9/25/2001Soldiers in Afghanistan

…611/19/2001

Recover operation at World Trade Center711/3/20014343 died or missing as a result of the attacks on Sep 11811/17/2001Dead and missing report of Sep 11 attack………

All top ranked documents are related to September 11, terrorist attackSlide15

Top Correlated Words to American Airlines stock price

A

ll top correlated terms to input time series are related to terrorist attack

 Highly correlated terms contributed to retrieval of documents about this topic

Word

|

ρ|

challenged0.887031

afghanistan0.861351security

0.858745sept0.858309

terrorism0.854865pakistan

0.848829aghans

0.844596afghan0.843481

islamic0.842499

taliban0.841455Slide16

Top ranked ‘relevant’ documents for Apple stock price

Rank

Date

Excerpt

1

9/29/2000

Fourth-

quarter earning far below estimates

212/8/2000$4 billion reserve, not $11 billion

310/19/2000Announced

earnings report4

4/29/2001Dow and Nasdaq soar after rate cur by Federal Reserve

57/20/2001

Apple’s new retail stores

612/6/2000Apple warns it will record quarterly loss

73/24/2001Stocks perk up, with Nasdaq posing gain 88/10/2000Mixing Mac and Windows………Retrieved relevant event: Disappointing earning report, store open, etc.

Useful as a new feature for re-ranking search results? Slide17

Quantitative EvaluationAll our methods > Random precision (0.0013)

Dynamic time warping >> Pearson correlation

Pearson

DTW

MAP

NDCG

MAP

NDCG

0.0019

0.3515

0.0022

0.3609

- Average performance (Average correlation as aggregation method)Slide18

Comparison of Aggregation Methods

AC <<

TopK

, BM25

Top5-AC << Top20-AC,but not more than K=20BM25 is sensitive to parameter settingScores of AC methods are more meaningful Incomplete judgments

 Possibly much better performance in reality

MAP

NDCG

AC0.0019

0.3515

Top5-AC0.0021

0.361

Top10-AC0.0023

0.3618Top20-AC

0.00240.3629

Top5-AC-Uniq0.00220.3613Top10-AC-Uniq0.00220.3616Top20-AC-Uniq0.00220.3619

Top5-BM25

0.00190.3584

Top10-BM25

0.00230.361Top20-BM250.00190.3582- Average performance (w/ Pearson correlation)Slide19

“Higher” NDCG vs. Low MAPSlide20

SummaryIntroduced a novel retrieval problemtime series as query

Studied basic solutions: Time series representation of terms

Term retrieval: correlation(query, term)

Document retrieval: aggregation of term retrieval results

Dynamic time warping + top-K average correlation seems working well Slide21

Limitations & Future Work

Evaluation is based on simulation

Highly incomplete judgments!

What’s a good way to evaluate such a new retrieval task?

Current solutions are heuristicHow can we develop a more principled model? Different notions of relevance

“Local” relevance vs. global relevance? All other issues relevant to a standard retrieval problem are worth exploring (e.g., feedback?) Slide22

Thank You! Comments/Questions?