A NALYTICS ON S ENTIMENT FOR S PATIOTEMPORAL DATA U SE C ASE S PATIO T EMPORAL S ENTIMENT A NALYSIS OF US E LECTION 2016 23rd SIGKDD Conference on Knowledge Discovery and Data ID: 661319
Download Presentation The PPT/PDF document "Compass C OMP REHENSIVE" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Compass
COMPREHENSIVE ANALYTICS ON SENTIMENT FOR SPATIO-TEMPORAL DATAUSE CASE: SPATIO TEMPORAL SENTIMENT ANALYSIS OF US ELECTION 2016 23rd SIGKDD Conference on Knowledge Discovery and Data Mining APPLIED DATA SCIENCE TRACK
LAB
Debjyoti
Paul,
Feifei
Li,
Murali
Krishna
Teja
, Xin Yu, Richie FrostSlide2
Motivation
- Recognize emerging social issues and events - Understanding crowd reaction and societal response to eventsSlide3
Motivation
3Openness in sharing their views about public events - Twitter, Facebook, Google+ - Emergency and calamity services - apps with location information (Uber, Instagram etc.)Slide4
Motivation
4Collective crowd opinion matters - crowd response changes with time - location Slide5
Motivation
5Factors motivated us: - Ubiquitous Geotagged Data - Measure Community Influence - Local Socio-Economic Events - Local Health–Food indicators - PubMed 2017 Social media indicators of the food environment and state health outcomes. - AJPH 2017 Geotagged U.S. Tweets as Predictors of County-Level Health Outcomes - Local Ad-targeting frameworkSlide6
Challenges & SolutionSlide7
Challenges
7Large volume of dataRealtimeClean up/filter non-relevant dataGeneric Topic Classification techniquesSentiment AnalysisAdhoc query support Information representationSlide8
Solution - COMPASS
8A Framework with MINIMAL human hand curation - Scalable - Streaming and batch - Easy configurable - YAML - Pluggable – (WebSocket/HTTP-REST/Program) - Data Sources - Machine learning models - Adhoc pluggable modules - Bursty Event Detection module - Analytics Support - SIMBA- Spatial In-Memory Big Analytics SIGMOD 2016 - APIs for UI
Comp
rehensive
Analytics on
Sentiment for Spatiotemporal dataSlide9
Solution - COMPASS
9Peoples sentiment play vital roleCumulative Opinion mattersGeographical location mattersReaction to pre-election events mattersUse case: Spatio Temporal Sentiment Analysis of US Election 2016Slide10
US Election SENTIMENT MAP
10http://estorm.orgSlide11
US Election SENTIMENT MAP
11http://estorm.orgSlide12
COMPASS ARCHITECTURESlide13
COMPASS ARCHITECTURE - US ELECTION
13Slide14
COMPASS Modules
14Geo-data Collection ModuleTweet Classification ModelSentiment Analysis ModelBursty Event Detection ModelSpatio Temporal Analysis FrameworkVisualizationSlide15
GEOData Collection
15Twitter APIs: - 1% Streaming API - Location based search APIGather maximum geotagged tweets: (286 million) - Partition search location - Prioritize location queries based on density - Create location based search bounding box queries such
they collect almost equal number of tweets
- Round-robin schedule of queries
Observation:
- Places in New York City will generate more
tweets than a place Southern Utah Slide16
Political Tweets
16Political Tweets Classification:Challenges: - Keywords filter approach not feasible - incomprehensive list - doesn’t guarantee correct classification (same word two meaning) - Machine Learning methods - need labeled dataset
- need manual labor - large enough datasetSlide17
Classification
17Slide18
Political Tweets
18Political Tweets Classification:Approach: - Minimal hand curation - Get the best of both the approaches (Keywords, ML ) - Use Topic modeling to get political keywords - Use word2vec model trained on 8.5 million tweets - Prepare training data - Prepare unbiased model Slide19
Political Tweets
19Political Tweets Classification:Approach: - Seeds: Keywords from topic model - Enriched Keywords: - Similar words with seeds - k-
nearest neighbor from seeds in word2vec - Cosine distance Slide20
Political Tweets
20Slide21
PARTY ClASSIFICATION
21Party Classification:Approach: - Map: Enriched Keywords <=> Democrat / Republican -- Only manual task in whole pipeline -- 300 keywords labeled by the experts (crowdsourcing) - Prepare training data from these data
- But how to prepare unbiased model ?
- remove keywords from data used to filter
- generic approach for all type of classification
- Sports, Finance, Health, Food etc. Slide22
Sentiment Model
22Compass’ sentiment model (SENT) can be instantiated with different classifiers: - SVM based (SENT1) - LR based (SENT2) - MNB based (SENT3) - LSTM-RNN based (SENT4) - FastText based (SENT5)Training Data: Stanford Twitter Sentiment Corpus (
A. Go, et. al)Approach:
- Sentiment Score range: [0.0, 1.0]
- Towards 0.0 negative sentiment - Towards 1.0 positive sentimentSlide23
GEOMAPping MODULE
23Mapping of geotagged tweets to counties is important for faster processing: - Web services are inefficient - Compass uses high precision GeoJSON - A geo-tree based approach -- country → state → county as nodes -- Efficient computationSlide24
Bursty Event Detection
24(Definition 1) Burst. Burst is a phenomenon identified when at least n number of event occurrences (of the same event) happens in τ time where τ is determined from probability distribution of gaps between occurrences. Slide25
Bursty Event Detection
25Se ={a0,a1,a2,...,am−1,am,...} a data stream of microblog article ai about the same event e each with timestamps ti.
We define term surge st,τ
e
α
is a the
time variant smoothing parameter,
α*
when
t-t
i-1=
ν
eSlide26
Bursty Event Detection
26Se is extracted from a stream S that has a mixture of microblog articles with mentioning of different events. (Definition 2) Burstiness. We define burstiness of event e at time t as area under the curve of surge over time τe .
b
t,τe
is the basis of bursty event detection moduleSlide27
Bursty Event Detection
27Two types of query can be invoked: Qe,[start,end] where e is an event type represented by a keyword and [start,end] defines a query time window. Q[start,end],γ where γ is a threshold. Query 1 returns st,τe
and bt,τe streams for e
Query 2 returns all elements whose bt,τe
value has exceeded threshold in interval [start,end]
. Slide28
Spatio
-temporal Analytics SIMBA-(SIGMOD 2016) 28Compass provide fast, scalable, and high-throughput spatial and temporal queries and analytics for location-based services (LBS) Compass integrates and uses Simba (Spatial In-Memory Big data Analytics) - Scalable - Efficient in-memory spatial query processing and analytics - Meant for Spatiotemporal data. Compass use HTTP RESTful API interface interacting in SQL with Simba Interactive web visualization system - Build with D3.js for election sentiment map - http://www.estorm.orgSlide29
Experiment & ResultSlide30
Experiment and result
30Tweet Classification:Political Model (POLM): POLM1 – SVM based POLM2 – LR basedSlide31
Experiment and result
31Party Classification:Party Model (PARTYM): PARTYM1 – SVM based PARTYM2 – LR based A0 – No keywords in training data A1 – Only Trump & Hillary in training data
Performance on crowdsourced labeled tweets and
general testbed seems similarSlide32
Experiment and result
32Sentiment Classification:Sentiment Model (SENT): SENT1 – SVM based SENT2 – LR basedSENT3 – MNB basedSENT4 – LSTM-RNN based SENT5 – FastText based
1.6 million tweets from Stanford Twitter Sentiment (STS) for train and testSlide33
Experiment and result
33Bursty Event Detection:Slide34
Experiment and result
34Overall: Florida actual election result Florida sentiment analysisSlide35
Experiment and result
35Overall: California actual election result California sentiment analysisSlide36
conclusionSlide37
conclusion
37Use case US Election 2016 proves that our approach is quite successful in achieving what we aim for.People reaction changes with time; so temporal query analysis makes more sense.In terms of election result: Twitter reflects the voice of young generation mostly. Exit polls indicates former generation and non-graduate leans toward republican party and their percent- age is statistically significant. (source: NY Times Exit Polls)Beyond election, Compass is a useful and generic end-to-end framework for spatio-temporal sentiment analysis over any topic of interest. Compass provides a configurable framework for diverse application.Slide38
ReferencesSlide39
references
39[1] D. Agarwal and B.-C. Chen. da: matrix factorization through latent dirichlet allocation. In WSDM, 2010. [2] L. AlSumait, D. Barbara ́, and C. Domeniconi. On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In ICDM. IEEE, 2008. [3] A. Anandkumar
, D. P. Foster, D. J. Hsu, S. M. Kakade, and Y.-K. Liu. A spectral algorithm for latent
dirichlet
allocation. In NIPS, 2012. [4] D.
Anuta, J. Churchin
, and J. Luo. Election bias: Comparing polls and twi
er
in the 2016 us election. arXiv:1701.06232, 2016.
[5] AOL. 2016 presidential election timeline, 2016. [accessed 08-02-2017].
[6] D. M.
Blei
. Probabilistic topic models. CACM, 55(4):77–84, 2012.
[7] D. M.
Blei
, A. Y. Ng, and M. I. Jordan. Latent
dirichlet
allocation. JMLR, 3(Jan):
993–1022
, 2003.
[8] P. Bojanowski, E. Grave, A.
Joulin
, and T.
Mikolov
. Enriching word vectors with
subword
information.
arXiv
preprint arXiv:1607.04606, 2016.
[9] A. Bovet, F.
Morone
, and H. A.
Makse
. Predicting election trends with
twitter
:
Hillary
clinton
versus
donald
trump. arXiv:1610.01587, 2016.
[10] M.
Cataldi
, L. Di Caro, and C. Schifanella. Emerging topic detection on
twitter based on temporal and social terms evaluation. In MDM/KDD, 2010.
[11] G. Cormode
and S. Muthukrishnan. An improved data stream summary:
The count-min sketch and its applications. In LATIN, 2004.
[12] C. N. Dos Santos and M. Gatti
. Deep convolutional neural networks for sentiment analysis of short texts. In COLING, 2014.
[13] A. Duric
and F. Song. Feature selection for sentiment analysis based on content and syntax models. Decision Support Systems, 53(4):704–711, 2011.
[14] A. El-
Kishky
, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable topical phrase
mining
from text corpora. PVLDB, 8(3), 2014.
[15] A.
Genkin
, D. D. Lewis, and D. Madigan. Large-scale
bayesian
logistic regression
for
text categorization.
Technometrics
, 49(3), 2007.
[16] A. Go, R.
Bhayani
, and L. Huang. Twi
er
sentiment
classi
cation using distant
supervision
. CS224N Project, Stanford, 1(12), 2009.
[17] F. Godin, V.
Slavkovikj
, W. De Neve, B.
Schrauwen
, and R. Van de
Walle
. Using
topic
models for
twi
er
hashtag recommendation. In WWW, 2013.
[18] S.
Hochreiter
and J.
Schmidhuber
. Long short-term memory. Neural computation,
9(8
):1735–1780, 1997.
[19] T. Hofmann. Probabilistic latent semantic analysis. In Uncertainty in
artificial intelligence
, pages 289–296, 1999.
[20] IETF.
Rfc
7946 - the
geojson
format, 2017. [accessed 08-Feb-2017].
[21] L. Jiang, M. Yu, M. Zhou, X. Liu, and T. Zhao. Target-dependent
twitter
sentiment
classification
. In ACL HLT, pages 151–160.
[22] T.
Joachims
. Text categorization with support vector machines: Learning with
many
relevant features. In ECML, 1998.
[23] A.
Joulin
, E. Grave, P. Bojanowski, and T.
Mikolov
. Bag of tricks for
effcient
text
classification
.
arXiv
preprint arXiv:1607.01759, 2016.
[24] D.
Kingma
and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
[25] J. Kleinberg.
Bursty
and hierarchical structure in streams. Data Mining and Knowledge Discovery, 7(4):373–397, 2003. Slide40
references
40[26] E. Kouloumpis, T. Wilson, and J. D. Moore. Twi er sentiment analysis: e good the bad and the omg! Icwsm, 11(538-541), 2011. [27] S. Lai, L. Xu, K. Liu, and J. Zhao. Recurrent convolutional neural networks for text classi cation. In AAAI, volume 333, pages 2267–2273, 2015. [28] Q. Li, S. Shah, X. Liu, A. Nourbakhsh, and R. Fang. Tweetsi : Tweet topic
classification based on entity knowledge base and topic enhanced word embedding. In CIKM, 2016.
[29] R. Lu and Q. Yang. Trend analysis of news topics on twi
er
. IJMLC, 2(3), 2012. [30] A. McCallum, K. Nigam, et al. A comparison of event models for naive bayes
text classification. In AAAI, volume 752, pages 41–48, 1998.
[31] L.
Medsker
and L. C. Jain. Recurrent neural networks: design and applications.
CRC
press, 1999.
[32] T.
Mikolov
, K. Chen, G.
Corrado
, and J. Dean. E
cient
estimation of word
representations
in vector space. arXiv:1301.3781, 2013.
[33] T.
Mikolov
, I.
Sutskever
, K. Chen, G. S.
Corrado
, and J. Dean. Distributed
representations
of words and phrases and their compositionality. In NIPS, pages
3111–3119
, 2013.
[34] D.-P. Nguyen, R. Gravel, R.
Trieschnigg
, and T.
Meder
. ” how old do you think
i
am
?” a study of language and age in
twi
er. 2013. [35] B. Pang, L. Lee, et al. Opinion mining and sentiment analysis. FTIR, 2(1–2):1–135,
2008.
[36] PRC. Demographics of social media users in 2016, 2016. [accessed 08-Feb-2017]. [37] D. A.
Shamma, L. Kennedy, and E. F. Churchill. Peaks and persistence: Modeling the
shape of microblog conversations. In CSCW, 2011. [38] M. Taboada
, J. Brooke, M. To loski
, K. Voll, and M.
Stede. Lexicon-based methods for
sentiment analysis. Computational Linguistics, 37(2):267–307, 2011. [39] D. Tang, B. Qin, and T. Liu. Document modeling with gated recurrent neural
network for sentiment classi
cation. In EMNLP, pages 1422–1432, 2015. [40] N. Y. Times. Election 2016: Exit polls, 2016.
[41] S.
Vosoughi
, H. Zhou, and D. Roy. Enhanced
twitter
sentiment
classification using
contextual information. arXiv:1605.05195, 2016.
[42] C. Wang and D. M.
Blei
. Collaborative topic modeling for recommending
scientific
articles. In SIGKDD, 2011.
[43] Z. Wei, G. Luo, K. Yi, X. Du, and J.-R. Wen. Persistent data sketching. In SIGMOD,
2015
.
[44] Wikipedia.
Swi
gamma-ray burst mission —
wikipedia
, the free encyclopedia,
2016
. [accessed 08-Feb-2017].
[45] D.
Xie
, F. Li, B. Yao, G. Li, L. Zhou, and M.
Guo
. Simba:
Efficient
in-memory
spatial
analytics. In SIGMOD, 2016.
[46] W.
Xie
, F. Zhu, J. Jiang, E.-P. Lim, and K. Wang.
Topicsketch
: Real-time
bursty
topic
detection from
twi
er
. In ICDM, pages 837–846, 2013.
[47] W. X. Zhao, J. Jiang, J.
Weng
, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing
twitter
and traditional media using topic models. In ECIR, 2011.
[48] C. Zhou, C. Sun, Z. Liu, and F. Lau. A c-
lstm
neural network for text
classification
.
arXiv:1511.08630
, 2015.
[49] Y. Zhu and D.
Shasha
. E
cient
elastic burst detection in data streams. In SIGKDD,
2003
. Slide41
Questions
?41Slide42
Thanking You
42