/
Compass	 C OMP REHENSIVE Compass	 C OMP REHENSIVE

Compass C OMP REHENSIVE - PowerPoint Presentation

kittie-lecroy
kittie-lecroy . @kittie-lecroy
Follow
363 views
Uploaded On 2018-03-22

Compass C OMP REHENSIVE - PPT Presentation

A NALYTICS ON S ENTIMENT FOR S PATIOTEMPORAL DATA U SE C ASE S PATIO T EMPORAL S ENTIMENT A NALYSIS OF US E LECTION 2016 23rd SIGKDD Conference on Knowledge Discovery and Data ID: 661319

based sentiment 2016 data sentiment based data 2016 classification election tweets analysis topic event detection compass arxiv result model

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Compass C OMP REHENSIVE" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Compass

COMPREHENSIVE ANALYTICS ON SENTIMENT FOR SPATIO-TEMPORAL DATAUSE CASE: SPATIO TEMPORAL SENTIMENT ANALYSIS OF US ELECTION 2016 23rd SIGKDD Conference on Knowledge Discovery and Data Mining APPLIED DATA SCIENCE TRACK

LAB

Debjyoti

Paul,

Feifei

Li,

Murali

Krishna

Teja

, Xin Yu, Richie FrostSlide2

Motivation

- Recognize emerging social issues and events - Understanding crowd reaction and societal response to eventsSlide3

Motivation

3Openness in sharing their views about public events - Twitter, Facebook, Google+ - Emergency and calamity services - apps with location information (Uber, Instagram etc.)Slide4

Motivation

4Collective crowd opinion matters - crowd response changes with time - location Slide5

Motivation

5Factors motivated us: - Ubiquitous Geotagged Data - Measure Community Influence - Local Socio-Economic Events - Local Health–Food indicators - PubMed 2017 Social media indicators of the food environment and state health outcomes. - AJPH 2017 Geotagged U.S. Tweets as Predictors of County-Level Health Outcomes - Local Ad-targeting frameworkSlide6

Challenges & SolutionSlide7

Challenges

7Large volume of dataRealtimeClean up/filter non-relevant dataGeneric Topic Classification techniquesSentiment AnalysisAdhoc query support Information representationSlide8

Solution - COMPASS

8A Framework with MINIMAL human hand curation - Scalable - Streaming and batch - Easy configurable - YAML - Pluggable – (WebSocket/HTTP-REST/Program) - Data Sources - Machine learning models - Adhoc pluggable modules - Bursty Event Detection module - Analytics Support - SIMBA- Spatial In-Memory Big Analytics SIGMOD 2016 - APIs for UI

Comp

rehensive

Analytics on

Sentiment for Spatiotemporal dataSlide9

Solution - COMPASS

9Peoples sentiment play vital roleCumulative Opinion mattersGeographical location mattersReaction to pre-election events mattersUse case: Spatio Temporal Sentiment Analysis of US Election 2016Slide10

US Election SENTIMENT MAP

10http://estorm.orgSlide11

US Election SENTIMENT MAP

11http://estorm.orgSlide12

COMPASS ARCHITECTURESlide13

COMPASS ARCHITECTURE - US ELECTION

13Slide14

COMPASS Modules

14Geo-data Collection ModuleTweet Classification ModelSentiment Analysis ModelBursty Event Detection ModelSpatio Temporal Analysis FrameworkVisualizationSlide15

GEOData Collection

15Twitter APIs: - 1% Streaming API - Location based search APIGather maximum geotagged tweets: (286 million) - Partition search location - Prioritize location queries based on density - Create location based search bounding box queries such

they collect almost equal number of tweets

- Round-robin schedule of queries

Observation:

- Places in New York City will generate more

tweets than a place Southern Utah Slide16

Political Tweets

16Political Tweets Classification:Challenges: - Keywords filter approach not feasible - incomprehensive list - doesn’t guarantee correct classification (same word two meaning) - Machine Learning methods - need labeled dataset

- need manual labor - large enough datasetSlide17

Classification

17Slide18

Political Tweets

18Political Tweets Classification:Approach: - Minimal hand curation - Get the best of both the approaches (Keywords, ML ) - Use Topic modeling to get political keywords - Use word2vec model trained on 8.5 million tweets - Prepare training data - Prepare unbiased model Slide19

Political Tweets

19Political Tweets Classification:Approach: - Seeds: Keywords from topic model - Enriched Keywords: - Similar words with seeds - k-

nearest neighbor from seeds in word2vec - Cosine distance Slide20

Political Tweets

20Slide21

PARTY ClASSIFICATION

21Party Classification:Approach: - Map: Enriched Keywords <=> Democrat / Republican -- Only manual task in whole pipeline -- 300 keywords labeled by the experts (crowdsourcing) - Prepare training data from these data

- But how to prepare unbiased model ?

- remove keywords from data used to filter

- generic approach for all type of classification

- Sports, Finance, Health, Food etc. Slide22

Sentiment Model

22Compass’ sentiment model (SENT) can be instantiated with different classifiers: - SVM based (SENT1) - LR based (SENT2) - MNB based (SENT3) - LSTM-RNN based (SENT4) - FastText based (SENT5)Training Data: Stanford Twitter Sentiment Corpus (

A. Go, et. al)Approach:

- Sentiment Score range: [0.0, 1.0]

- Towards 0.0 negative sentiment - Towards 1.0 positive sentimentSlide23

GEOMAPping MODULE

23Mapping of geotagged tweets to counties is important for faster processing: - Web services are inefficient - Compass uses high precision GeoJSON - A geo-tree based approach -- country → state → county as nodes -- Efficient computationSlide24

Bursty Event Detection

24(Definition 1) Burst. Burst is a phenomenon identified when at least n number of event occurrences (of the same event) happens in τ time where τ is determined from probability distribution of gaps between occurrences. Slide25

Bursty Event Detection

25Se ={a0,a1,a2,...,am−1,am,...} a data stream of microblog article ai about the same event e each with timestamps ti.

We define term surge st,τ

e

α

is a the

time variant smoothing parameter,

α*

when

t-t

i-1=

ν

eSlide26

Bursty Event Detection

26Se is extracted from a stream S that has a mixture of microblog articles with mentioning of different events. (Definition 2) Burstiness. We define burstiness of event e at time t as area under the curve of surge over time τe .

b

t,τe

is the basis of bursty event detection moduleSlide27

Bursty Event Detection

27Two types of query can be invoked: Qe,[start,end] where e is an event type represented by a keyword and [start,end] defines a query time window. Q[start,end],γ where γ is a threshold. Query 1 returns st,τe

and bt,τe streams for e

Query 2 returns all elements whose bt,τe

value has exceeded threshold in interval [start,end]

. Slide28

Spatio

-temporal Analytics SIMBA-(SIGMOD 2016) 28Compass provide fast, scalable, and high-throughput spatial and temporal queries and analytics for location-based services (LBS) Compass integrates and uses Simba (Spatial In-Memory Big data Analytics) - Scalable - Efficient in-memory spatial query processing and analytics - Meant for Spatiotemporal data. Compass use HTTP RESTful API interface interacting in SQL with Simba Interactive web visualization system - Build with D3.js for election sentiment map - http://www.estorm.orgSlide29

Experiment & ResultSlide30

Experiment and result

30Tweet Classification:Political Model (POLM): POLM1 – SVM based POLM2 – LR basedSlide31

Experiment and result

31Party Classification:Party Model (PARTYM): PARTYM1 – SVM based PARTYM2 – LR based A0 – No keywords in training data A1 – Only Trump & Hillary in training data

Performance on crowdsourced labeled tweets and

general testbed seems similarSlide32

Experiment and result

32Sentiment Classification:Sentiment Model (SENT): SENT1 – SVM based SENT2 – LR basedSENT3 – MNB basedSENT4 – LSTM-RNN based SENT5 – FastText based

1.6 million tweets from Stanford Twitter Sentiment (STS) for train and testSlide33

Experiment and result

33Bursty Event Detection:Slide34

Experiment and result

34Overall: Florida actual election result Florida sentiment analysisSlide35

Experiment and result

35Overall: California actual election result California sentiment analysisSlide36

conclusionSlide37

conclusion

37Use case US Election 2016 proves that our approach is quite successful in achieving what we aim for.People reaction changes with time; so temporal query analysis makes more sense.In terms of election result: Twitter reflects the voice of young generation mostly. Exit polls indicates former generation and non-graduate leans toward republican party and their percent- age is statistically significant. (source: NY Times Exit Polls)Beyond election, Compass is a useful and generic end-to-end framework for spatio-temporal sentiment analysis over any topic of interest. Compass provides a configurable framework for diverse application.Slide38

ReferencesSlide39

references

39[1] D. Agarwal and B.-C. Chen. da: matrix factorization through latent dirichlet allocation. In WSDM, 2010. [2] L. AlSumait, D. Barbara ́, and C. Domeniconi. On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. In ICDM. IEEE, 2008. [3] A. Anandkumar

, D. P. Foster, D. J. Hsu, S. M. Kakade, and Y.-K. Liu. A spectral algorithm for latent

dirichlet

allocation. In NIPS, 2012. [4] D.

Anuta, J. Churchin

, and J. Luo. Election bias: Comparing polls and twi

er

in the 2016 us election. arXiv:1701.06232, 2016.

[5] AOL. 2016 presidential election timeline, 2016. [accessed 08-02-2017].

[6] D. M.

Blei

. Probabilistic topic models. CACM, 55(4):77–84, 2012.

[7] D. M.

Blei

, A. Y. Ng, and M. I. Jordan. Latent

dirichlet

allocation. JMLR, 3(Jan):

993–1022

, 2003.

[8] P. Bojanowski, E. Grave, A.

Joulin

, and T.

Mikolov

. Enriching word vectors with

subword

information.

arXiv

preprint arXiv:1607.04606, 2016.

[9] A. Bovet, F.

Morone

, and H. A.

Makse

. Predicting election trends with

twitter

:

Hillary

clinton

versus

donald

trump. arXiv:1610.01587, 2016.

[10] M.

Cataldi

, L. Di Caro, and C. Schifanella. Emerging topic detection on

twitter based on temporal and social terms evaluation. In MDM/KDD, 2010.

[11] G. Cormode

and S. Muthukrishnan. An improved data stream summary:

The count-min sketch and its applications. In LATIN, 2004.

[12] C. N. Dos Santos and M. Gatti

. Deep convolutional neural networks for sentiment analysis of short texts. In COLING, 2014.

[13] A. Duric

and F. Song. Feature selection for sentiment analysis based on content and syntax models. Decision Support Systems, 53(4):704–711, 2011.

[14] A. El-

Kishky

, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable topical phrase

mining

from text corpora. PVLDB, 8(3), 2014.

[15] A.

Genkin

, D. D. Lewis, and D. Madigan. Large-scale

bayesian

logistic regression

for

text categorization.

Technometrics

, 49(3), 2007.

[16] A. Go, R.

Bhayani

, and L. Huang. Twi

er

sentiment

classi

cation using distant

supervision

. CS224N Project, Stanford, 1(12), 2009.

[17] F. Godin, V.

Slavkovikj

, W. De Neve, B.

Schrauwen

, and R. Van de

Walle

. Using

topic

models for

twi

er

hashtag recommendation. In WWW, 2013.

[18] S.

Hochreiter

and J.

Schmidhuber

. Long short-term memory. Neural computation,

9(8

):1735–1780, 1997.

[19] T. Hofmann. Probabilistic latent semantic analysis. In Uncertainty in

artificial intelligence

, pages 289–296, 1999.

[20] IETF.

Rfc

7946 - the

geojson

format, 2017. [accessed 08-Feb-2017].

[21] L. Jiang, M. Yu, M. Zhou, X. Liu, and T. Zhao. Target-dependent

twitter

sentiment

classification

. In ACL HLT, pages 151–160.

[22] T.

Joachims

. Text categorization with support vector machines: Learning with

many

relevant features. In ECML, 1998.

[23] A.

Joulin

, E. Grave, P. Bojanowski, and T.

Mikolov

. Bag of tricks for

effcient

text

classification

.

arXiv

preprint arXiv:1607.01759, 2016.

[24]  D.

Kingma

and J. Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.

[25]  J. Kleinberg.

Bursty

and hierarchical structure in streams. Data Mining and Knowledge Discovery, 7(4):373–397, 2003. Slide40

references

40[26]  E. Kouloumpis, T. Wilson, and J. D. Moore. Twi er sentiment analysis: e good the bad and the omg! Icwsm, 11(538-541), 2011. [27]  S. Lai, L. Xu, K. Liu, and J. Zhao. Recurrent convolutional neural networks for text classi cation. In AAAI, volume 333, pages 2267–2273, 2015. [28]  Q. Li, S. Shah, X. Liu, A. Nourbakhsh, and R. Fang. Tweetsi : Tweet topic

classification based on entity knowledge base and topic enhanced word embedding. In CIKM, 2016.

[29]  R. Lu and Q. Yang. Trend analysis of news topics on twi

er

. IJMLC, 2(3), 2012. [30]  A. McCallum, K. Nigam, et al. A comparison of event models for naive bayes

text classification. In AAAI, volume 752, pages 41–48, 1998.

[31]  L.

Medsker

and L. C. Jain. Recurrent neural networks: design and applications.

CRC

press, 1999.

[32]  T.

Mikolov

, K. Chen, G.

Corrado

, and J. Dean. E

cient

estimation of word

representations

in vector space. arXiv:1301.3781, 2013.

[33]  T.

Mikolov

, I.

Sutskever

, K. Chen, G. S.

Corrado

, and J. Dean. Distributed

representations

of words and phrases and their compositionality. In NIPS, pages

3111–3119

, 2013.

[34]  D.-P. Nguyen, R. Gravel, R.

Trieschnigg

, and T.

Meder

. ” how old do you think

i

am

?” a study of language and age in

twi

er. 2013. [35]  B. Pang, L. Lee, et al. Opinion mining and sentiment analysis. FTIR, 2(1–2):1–135,

2008.

[36]  PRC. Demographics of social media users in 2016, 2016. [accessed 08-Feb-2017]. [37]  D. A.

Shamma, L. Kennedy, and E. F. Churchill. Peaks and persistence: Modeling the

shape of microblog conversations. In CSCW, 2011. [38]  M. Taboada

, J. Brooke, M. To loski

, K. Voll, and M.

Stede. Lexicon-based methods for

sentiment analysis. Computational Linguistics, 37(2):267–307, 2011. [39]  D. Tang, B. Qin, and T. Liu. Document modeling with gated recurrent neural

network for sentiment classi

cation. In EMNLP, pages 1422–1432, 2015. [40]  N. Y. Times. Election 2016: Exit polls, 2016.

[41]  S.

Vosoughi

, H. Zhou, and D. Roy. Enhanced

twitter

sentiment

classification using

contextual information. arXiv:1605.05195, 2016.

[42]  C. Wang and D. M.

Blei

. Collaborative topic modeling for recommending

scientific

articles. In SIGKDD, 2011.

[43]  Z. Wei, G. Luo, K. Yi, X. Du, and J.-R. Wen. Persistent data sketching. In SIGMOD,

2015

.

[44]  Wikipedia.

Swi

gamma-ray burst mission —

wikipedia

, the free encyclopedia,

2016

. [accessed 08-Feb-2017].

[45]  D.

Xie

, F. Li, B. Yao, G. Li, L. Zhou, and M.

Guo

. Simba:

Efficient

in-memory

spatial

analytics. In SIGMOD, 2016.

[46]  W.

Xie

, F. Zhu, J. Jiang, E.-P. Lim, and K. Wang.

Topicsketch

: Real-time

bursty

topic

detection from

twi

er

. In ICDM, pages 837–846, 2013.

[47]  W. X. Zhao, J. Jiang, J.

Weng

, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing

twitter

and traditional media using topic models. In ECIR, 2011.

[48]  C. Zhou, C. Sun, Z. Liu, and F. Lau. A c-

lstm

neural network for text

classification

.

arXiv:1511.08630

, 2015.

[49]  Y. Zhu and D.

Shasha

. E

cient

elastic burst detection in data streams. In SIGKDD,

2003

. Slide41

Questions

?41Slide42

Thanking You

42