Hyun Duk Kim ChengXiang Cheng Zhai UIUC Thomas A Rietz Univ of Iowa Daniel Diermeier Northwestern Univ Meichun Hsu Malu Castellanos and Carlos ID: 804494
Download The PPT/PDF document "Analysis of Causal Topics in Text Data a..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC) Thomas A. Rietz (Univ. of Iowa) Daniel Diermeier (Northwestern Univ.) Meichun Hsu, Malu Castellanos, and Carlos Ceja (HP Labs)
1
Slide2…
Time
Any clues in the companion news stream?
2
Dow Jones Industrial Average [Source: Yahoo Finance]
Text Mining for Understanding Time Series
What might have caused the
stock market
crash?
Sept 11 Attack!
Slide3Analysis of Presidential Prediction MarketsWhat might have caused the sudden drop of price for this candidate? What “mattered” in this election?
…
Time
Any clues in the companion news stream?
Tax cut?
3
Slide4Joint Analysis of Text and Time Series to Discover “Causal Topics”Input: Time series Text data produced in a similar time period (text stream) OutputTopics whose coverage in the text stream has strong correlations with the time series (“causal” topics)
Tax cut
Gun control
…
4
Slide5Related WorkTopic modeling (e.g., [Hofmann 99], [Blei et al. 03], …) Extract topics from text data and reveal their patternsNo consideration of time series topics extracted may not be correlated with time seriesStream data mining (e.g., [Agrawal 02]) Clustering & categorization of time series dataNo topics being generated for text dataTemporal text retrieval and prediction (e.g., [Efron 10], [Smith10])Incorporating time factor in retrieval or text-based prediction No topics being generated New Problem: Discover causal topics from text streams with time series data for supervision 5
Slide6Background: Topic ModelsTopic = multinomial distribution over words (unigram language models)Text is assumed to be a sample of words drawn from a mixture of multiple (unknown) topics Parameter estimation and Bayesian inference “reveal” All the unknown topics in a text collectionThe coverage of each topic in each documentPrior can be imposed to bias the inference of both topics and topic coverage6
Slide7Document as a Sample of Mixed Topics
Topic
1
Topic
k
Topic
2
…
Background
k
government 0.3
response 0.2
...
donate 0.1
relief 0.05
help 0.02
...
city 0.2
new 0.1
orleans 0.05
...
is 0.05
the 0.04
a 0.03
...
[
Criticism of government response to the hurricane primarily consisted of criticism of its response to the approach of the storm and its aftermath, specifically in the delayed response
]
to the
[
flooding of New Orleans. … 80% of the 1.3 million residents of the greater New Orleans metropolitan area evacuated
]
…
[ Over seventy countries pledged monetary donations or other assistance]
. …
7
Generative
Topic
Model
Inference/Estimation
Of topics
Prior can be added on them
Slide8When a topic model applied to text stream…
Time
Topic
1
Topic
k
Topic
2
…
Background
k
government 0.3
response 0.2
...
donate 0.1
relief 0.05
help 0.02
...
city 0.2
new 0.1
orleans 0.05
...
is 0.05
the 0.04
a 0.03
...
8
Slide9New Text Mining Framework:Iterative Causal Topic Modeling9Non-textTime Series
Sep
2001
Oct …
2001
Text Stream
Causal Topics
Topic 1
Topic 2
Topic 3
Topic 4
Zoom into
W
ord Level
Split Words
Feedback
as Prior
Causal
Words
Topic 1
Topic Modeling
Topic 2
Topic 3
Topic 4
Topic 1-2
W2 --
W4 --
Topic 1-1
W1 +
W3 +
Topic 1
W1 +
W2 --
W3 +
W4 --
W5
…
Slide10Iterative Causal Topic Modeling Framework10Non-textTime Series
Sep
2001
Oct …
2001
Text Stream
Causal Topics
Topic 1
Topic 2
Topic 3
Topic 4
Zoom into
W
ord Level
Split Words
Feedback
as Prior
Causal
Words
Topic 1
Topic Modeling
Topic 2
Topic 3
Topic 4
Topic 1-2
W2 --
W4 --
Topic 1-1
W1 +
W3 +
Topic 1
W1 +
W2 --
W3 +
W4 --
W5
…
General Framework for any topic modeling and any causality measure
Naturally incorporate non-text time series in the process
Topic level + Word level Efficiency + Granularity
Slide11Heuristic Optimization of Causality + Coherence 11
Slide12Pearson correlationBasic correlationGranger Test For two time series x (topic), y (stock), time lag p Significance test if lagged x terms should be retained or not12Causality MeasuresAuto-regressionLagged values
Slide13Feedback Prior GenerationTopicWordImpactSignificance (%)1Social
+
99
Security
+
96
Gun
-
98
Control
-
96
5
September
-
99
Airline
-
99
Terrorism
-
97
…
(5 more words)
Attack
-
96
Good
+
96
13
Topic
Word
Prob
1
Social
0.8
security
0.2
2
Gun
0.75
Control
0.25
3
September
0.1
Airline
0.1
Terrorism
0.075
… (5 more)
Attack
0.05
Good
0.0
Slide14Time: June 2000 – Dec. 2011 Text dataNew York TimesTime seriesAmerican Airlines stock (AAMRQ) Apple stock (AAPL)Question: any “causal topics” to explain fluctuation of the stocks of the two companies? 14Experiment Design 1: Stock Market Analysis
Slide15Time: May 2000 – Oct. 2000 Text dataNew York Times (use text mentioning Bush or Gore)Time seriesNormalized “Gore stock price” in Iowa Electronic Markets (IEM), online future marketQuestion: any “causal topics” to explain changes in opinions about Gore?
15
Experiment
Design
2:
2000
Presidential
election
campaign
Slide16Measuring Topic QualityCausality Confidence of a topicBased on p-value of causality test (Granger, Pearson) for the topicTopic Purity Consistency in the direction of “causal” relation with the time series (“are all words in the topic positively correlated with the time series?”) Based on entropy of distribution of positive/negative words16
Slide17Topic PurityTopicWordImpactSignificance (%)1Social
+
99
Security
+
96
Gun
-
98
Control
-
96
5
September
-
99
Airline
-
99
Terrorism
-
97
Attack
-
96
Good
+
96
P(T=“
pos
”)
H
(T)
1.0
0
0.5
1.0
P(T=“
pos
”)=p(T=“
neg
”)=1/2
Highest entropy Lowest purity(0)
P(T=“
pos
”)=1/5 p(T=“
neg
”)=4/5
Lower entropy Higher purity
17
Slide18AAMRQAAPLrussia russian putin europe european germany bush gore presidential police court judge airlines airport airunited trade terrorismfood foods cheese nets
scott
basketball
tennis
williams
open
awards gay boy
moss
minnesota
chechnya
paid notice strussia
russian europeolympic
games olympicsshe her
msoil ford pricesblack fashion blackscomputer technology
softwareinternet com webfootball giants jetsjapan japanese plane…18- Significant topic list of two different external time series.
AAMRQ: airline, terrorism topicAAPL: IT industry topic
Topics discovered depend on
external time series
Sample Result 1:Topics discovered for AAMRQ vs. AAPL
Slide19Effect of Iterations on Causality Confidence & Purity19
Slide20IterDifferent Feedback Strength (µ)20
Significant improvement in confidence,
n
umber
of significant topics
by feedback
Clear benefit of feedback
Large
µ
guarantees topic purity improvement
µ=10
µ=50
µ=100
µ=500
µ=1000
Iter
Iter
Slide21Sample Result 2: Major Topics in 2000 Presidential ElectionRevealed several important issuesE.g. tax cut, abortion, gun control, oil energySuch topics are also cited in political science literature [Pomper `01] and Wikipedia [Link]21Top Three Words in Significant Topicstax cut 1screen pataki guiliani
enthusiasm door symbolic
oil energy
prices
news w top
pres
al vice
love tucker presented
partial
abortion
privatization
court supreme
abortion
gun control
nra
Slide22Additional Results:http://sifaka.cs.uiuc.edu/~hkim277/InCaToMi/demo/2000_Presidential_Election/dashboard/Dashboard.html22
Slide23Conclusions & Future WorkMeaningful topics can be extracted from text stream by using time series for supervisionSuch “causal” topics provide potential explanations for changes in the time series dataPreliminary experiment results on 2000 presidential prediction markets are promising Future work (discussion) Issues related to topic models (e.g., local maxima, # of topics, interpretation of topics)Issues related to causality analysis (e.g., “local” causality)Unified analysis modelSystem to support online interactive analysis of causal topics (time series can be derived from text too) 23
Slide24Thank You!Questions/Comments? 24