/
Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential

Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential - PowerPoint Presentation

mentegor
mentegor . @mentegor
Follow
343 views
Uploaded On 2020-08-27

Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential - PPT Presentation

Hyun Duk Kim ChengXiang Cheng Zhai UIUC Thomas A Rietz Univ of Iowa Daniel Diermeier Northwestern Univ Meichun Hsu Malu Castellanos and Carlos ID: 804494

time topic text topics topic time topics text series causal words analysis 2000 stream presidential gun causality control stock

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Analysis of Causal Topics in Text Data a..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC) Thomas A. Rietz (Univ. of Iowa) Daniel Diermeier (Northwestern Univ.) Meichun Hsu, Malu Castellanos, and Carlos Ceja (HP Labs)

1

Slide2

Time

Any clues in the companion news stream?

2

Dow Jones Industrial Average [Source: Yahoo Finance]

Text Mining for Understanding Time Series

What might have caused the

stock market

crash?

Sept 11 Attack!

Slide3

Analysis of Presidential Prediction MarketsWhat might have caused the sudden drop of price for this candidate? What “mattered” in this election?

Time

Any clues in the companion news stream?

Tax cut?

3

Slide4

Joint Analysis of Text and Time Series to Discover “Causal Topics”Input: Time series Text data produced in a similar time period (text stream) OutputTopics whose coverage in the text stream has strong correlations with the time series (“causal” topics)

Tax cut

Gun control

4

Slide5

Related WorkTopic modeling (e.g., [Hofmann 99], [Blei et al. 03], …) Extract topics from text data and reveal their patternsNo consideration of time series  topics extracted may not be correlated with time seriesStream data mining (e.g., [Agrawal 02]) Clustering & categorization of time series dataNo topics being generated for text dataTemporal text retrieval and prediction (e.g., [Efron 10], [Smith10])Incorporating time factor in retrieval or text-based prediction No topics being generated New Problem: Discover causal topics from text streams with time series data for supervision 5

Slide6

Background: Topic ModelsTopic = multinomial distribution over words (unigram language models)Text is assumed to be a sample of words drawn from a mixture of multiple (unknown) topics Parameter estimation and Bayesian inference “reveal” All the unknown topics in a text collectionThe coverage of each topic in each documentPrior can be imposed to bias the inference of both topics and topic coverage6

Slide7

Document as a Sample of Mixed Topics

Topic 

1

Topic 

k

Topic 

2

Background 

k

government 0.3

response 0.2

...

donate 0.1

relief 0.05

help 0.02

...

city 0.2

new 0.1

orleans 0.05

...

is 0.05

the 0.04

a 0.03

...

[

Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response

]

to the

[

flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated

]

[ Over seventy countries pledged monetary donations or other assistance]

. …

7

Generative

Topic

Model

Inference/Estimation

Of topics

Prior can be added on them

Slide8

When a topic model applied to text stream…

Time

Topic 

1

Topic 

k

Topic 

2

Background 

k

government 0.3

response 0.2

...

donate 0.1

relief 0.05

help 0.02

...

city 0.2

new 0.1

orleans 0.05

...

is 0.05

the 0.04

a 0.03

...

8

Slide9

New Text Mining Framework:Iterative Causal Topic Modeling9Non-textTime Series

Sep

2001

Oct …

2001

Text Stream

Causal Topics

Topic 1

Topic 2

Topic 3

Topic 4

Zoom into

W

ord Level

Split Words

Feedback

as Prior

Causal

Words

Topic 1

Topic Modeling

Topic 2

Topic 3

Topic 4

Topic 1-2

W2 --

W4 --

Topic 1-1

W1 +

W3 +

Topic 1

W1 +

W2 --

W3 +

W4 --

W5

Slide10

Iterative Causal Topic Modeling Framework10Non-textTime Series

Sep

2001

Oct …

2001

Text Stream

Causal Topics

Topic 1

Topic 2

Topic 3

Topic 4

Zoom into

W

ord Level

Split Words

Feedback

as Prior

Causal

Words

Topic 1

Topic Modeling

Topic 2

Topic 3

Topic 4

Topic 1-2

W2 --

W4 --

Topic 1-1

W1 +

W3 +

Topic 1

W1 +

W2 --

W3 +

W4 --

W5

General Framework for any topic modeling and any causality measure

Naturally incorporate non-text time series in the process

Topic level + Word level  Efficiency + Granularity

Slide11

Heuristic Optimization of Causality + Coherence 11

Slide12

Pearson correlationBasic correlationGranger Test For two time series x (topic), y (stock), time lag p Significance test if lagged x terms should be retained or not12Causality MeasuresAuto-regressionLagged values

Slide13

Feedback Prior GenerationTopicWordImpactSignificance (%)1Social

+

99

Security

+

96

Gun

-

98

Control

-

96

5

September

-

99

Airline

-

99

Terrorism

-

97

(5 more words)

Attack

-

96

Good

+

96

13

Topic

Word

Prob

1

Social

0.8

security

0.2

2

Gun

0.75

Control

0.25

3

September

0.1

Airline

0.1

Terrorism

0.075

… (5 more)

Attack

0.05

Good

0.0

Slide14

Time: June 2000 – Dec. 2011 Text dataNew York TimesTime seriesAmerican Airlines stock (AAMRQ) Apple stock (AAPL)Question: any “causal topics” to explain fluctuation of the stocks of the two companies? 14Experiment Design 1: Stock Market Analysis

Slide15

Time: May 2000 – Oct. 2000 Text dataNew York Times (use text mentioning Bush or Gore)Time seriesNormalized “Gore stock price” in Iowa Electronic Markets (IEM), online future marketQuestion: any “causal topics” to explain changes in opinions about Gore?

 

15

Experiment

Design

2:

2000

Presidential

election

campaign

Slide16

Measuring Topic QualityCausality Confidence of a topicBased on p-value of causality test (Granger, Pearson) for the topicTopic Purity Consistency in the direction of “causal” relation with the time series (“are all words in the topic positively correlated with the time series?”) Based on entropy of distribution of positive/negative words16

Slide17

Topic PurityTopicWordImpactSignificance (%)1Social

+

99

Security

+

96

Gun

-

98

Control

-

96

5

September

-

99

Airline

-

99

Terrorism

-

97

Attack

-

96

Good

+

96

P(T=“

pos

”)

H

(T)

1.0

0

0.5

1.0

P(T=“

pos

”)=p(T=“

neg

”)=1/2

Highest entropy  Lowest purity(0)

P(T=“

pos

”)=1/5 p(T=“

neg

”)=4/5

 Lower entropy  Higher purity

17

Slide18

AAMRQAAPLrussia russian putin europe european germany bush gore presidential police court judge airlines airport airunited trade terrorismfood foods cheese nets

scott

basketball

tennis

williams

open

awards gay boy

moss

minnesota

chechnya

paid notice strussia

russian europeolympic

games olympicsshe her

msoil ford pricesblack fashion blackscomputer technology

softwareinternet com webfootball giants jetsjapan japanese plane…18- Significant topic list of two different external time series.

AAMRQ: airline, terrorism topicAAPL: IT industry topic

 Topics discovered depend on

external time series

Sample Result 1:Topics discovered for AAMRQ vs. AAPL

Slide19

Effect of Iterations on Causality Confidence & Purity19

Slide20

IterDifferent Feedback Strength (µ)20

Significant improvement in confidence,

n

umber

of significant topics

by feedback

Clear benefit of feedback

Large

µ

guarantees topic purity improvement

µ=10

µ=50

µ=100

µ=500

µ=1000

Iter

Iter

Slide21

Sample Result 2: Major Topics in 2000 Presidential ElectionRevealed several important issuesE.g. tax cut, abortion, gun control, oil energySuch topics are also cited in political science literature [Pomper `01] and Wikipedia [Link]21Top Three Words in Significant Topicstax cut 1screen pataki guiliani

enthusiasm door symbolic

oil energy

prices

news w top

pres

al vice

love tucker presented

partial

abortion

privatization

court supreme

abortion

gun control

nra

Slide22

Additional Results:http://sifaka.cs.uiuc.edu/~hkim277/InCaToMi/demo/2000_Presidential_Election/dashboard/Dashboard.html22

Slide23

Conclusions & Future WorkMeaningful topics can be extracted from text stream by using time series for supervisionSuch “causal” topics provide potential explanations for changes in the time series dataPreliminary experiment results on 2000 presidential prediction markets are promising Future work (discussion) Issues related to topic models (e.g., local maxima, # of topics, interpretation of topics)Issues related to causality analysis (e.g., “local” causality)Unified analysis modelSystem to support online interactive analysis of causal topics (time series can be derived from text too) 23

Slide24

Thank You!Questions/Comments? 24