using NGram Graphs Authors Fotis Aisopos George Papadakis Theordor a Varvarigou Presenter Konstantinos Tserpes National Technical University of Athens Greece Social Media and Sentiment Analysis ID: 760581
Download Presentation The PPT/PDF document "Sentiment Analysis of Social Media Conte..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Sentiment Analysis of Social Media Contentusing N-Gram Graphs
Authors:
Fotis
Aisopos
, George
Papadakis
,
Theordor
a
Varvarigou
Presenter: Konstantinos Tserpes
National Technical University of Athens, Greece
Slide2Social Media and Sentiment Analysis
Social Networks enable users to:Chat about everyday issuesExchange political viewsEvaluate services and productsUseful to estimate average sentiment for a topic (e.g. social analysts)Sentiments expressed Implicitly (e.g. through emoticons, specific words)Explicitly (e.g. the “Like” button in Facebook)In this work we focus on content-based patterns for detecting sentiments.
30/11/2011
2
International ACM Workshop on Social Media (WSM11)
Slide3Intricacies of Social Media Content
Inherent characteristics that turn established,language-specific methods inapplicable:Sparsity: each message comprises just 140 characters in TwitterMultilinguality: many different languages and dialectsNon-standard Vocabularty: informal textual content (i.e., slang), neologisms (e.g. “gr8” instead of “great”)Noise: misspelled words and incorrect use of phrases.Solution language-neutral method that is robust to noise
30/11/2011
3
International ACM Workshop on Social Media (WSM11)
Slide4Focus on Twitter
We selected the Twitter micro-blogging service due to:Popularity (200 million users, 1 billion posts per week)Strict rules of social interaction (i.e., sentiments are expressed through short, self-contained text messages)Data publicly available through a handy API
30/11/2011
4
International ACM Workshop on Social Media (WSM11)
Slide5Polarity Classification problem
Polarity: express of a non-neutral sentimentPolarized tweets: tweets that express either a positive or a negative sentiment (polarity is explicitly denoted by the respective emoticons)Neutral tweets: tweets lacking any polarity indicatorBinary Polarity Classification: decide for the polarity of a tweet with respect to a binary scale (i.e., negative or positive).General Polarity Classification: decide for the polarity of a tweet with respect to three scales (i.e., negative, positive or neutral).
30/11/2011
5
International ACM Workshop on Social Media (WSM11)
Slide6Representation Model 1: Term Vector Model
Aggregates the set of distinct words (i.e., tokens) contained in a set of documents.Each tweet ti is then represented as a vector:vti = (v1, v2, ..., vj) where vj is the TF-IDF value of the j-th term. The same model applies to polarity classes.Drawbacks:It requires language-specific techniques that correctly identify semantically equivalent tokens (e.g., stemming, lemmatization, P-o-S tagging).High dimensionality
30/11/2011
6
International ACM Workshop on Social Media (WSM11)
Slide7Representation Model 2: Character n-grams
Each document and polarity class is represented as the set of substrings of length n of the original text.for n = 2: bigrams, n = 3: trigrams, n = 4: fourgramsexample: “home phone" consists of the following tri-grams: {hom, ome, me , ph, pho, hon, one}.Advantages: language-independent method.Disadvantages: high dimensionality
30/11/2011
7
International ACM Workshop on Social Media (WSM11)
Slide8Representation Model 3: Character n-gram graphs
Each document and polarity class are represented as graphs, where the nodes correspond to character n-grams,the undirected edges connect neighboring n-grams (i.e., n-grams that co-occur in at least one window of n characters), andthe weight of an edge denotes the co-occurrence rate of the adjacent n-grams.Typical value space for n: n=2 (i.e., bigram graphs), n=3 (i.e., trigram graphs), and n=4 (i.e., four-gram graphs).
30/11/2011
8
International ACM Workshop on Social Media (WSM11)
Slide9Example of n-gram graphs.
The phrase “
home_phone” is represented as follows:
30/11/2011
9
International ACM Workshop on Social Media (WSM11)
Slide10Features of the n-gram graphs model
To capture textual patterns, n-gram graphs rely on the following graph similarity metrics (computed between the polarity class graphs and the tweet graphs):Containment Similarity (CS): portion of common edges, regardless of their weightsSize Similarity (SS): ratio of sizes of two graphsValue Similarity (VS): portion of common edges, taking into account their weightsNormalized Value Similarity (NVS): value similarity without the effect of the relative graph size (i.e., NVS =VS/SS)
30/11/2011
10
International ACM Workshop on Social Media (WSM11)
Slide11Features Extraction
Create Gpos, Gneg (and Gneu) by aggregating half of the training tweets with the respective polarity.For each tweet of the remaining training set:create tweet n-gram graph Gti derive a feature “vector” from graphs comparisonSame procedure for the testing tweets.
30/11/2011
11
International ACM Workshop on Social Media (WSM11)
Slide12Discretized Graph Similarities
Discretized similarity values offer higher classification efficiency. They are created according to the following function:Binary classification has three nominal features: dsim(CSneg, CSpos)dsim(NVSneg, NVSpos)dsim(VSneg, VSpos)General classification has six more nominal features:dsim(CSneg, CSneu)dsim(NVSneg, NVSneu)dsim(VSneg, VSneu)
dsim
(CSneu, CSpos)dsim(NVSneu, NVSpos)dsim(VSneu, VSpos)
30/11/2011
12
International ACM Workshop on Social Media (WSM11)
Slide13Data set
Initial dataset:475 million real tweets, posted by 17 million userspolarized tweets:6.12 million negative14.12 million positiveData set for Binary Polarity Classification: Random selection of 1 million tweets from each polarity category.Data set for General Polarity Classification: the above + random selection of 1 million neutral tweets.
30/11/2011
13
International ACM Workshop on Social Media (WSM11)
Slide14Experimental Setup
10-fold cross-validation.Classification algorithms (default configuration of Weka):Naive Bayes Multinomial (NBM)C4.5 decision tree classifierEffectiveness Metric: classification accuracy (correctly_classified_documents/all_documents).Frequency threshold for term vector and n-grams model: only features that appear in at least 1% of all documents were considered.
30/11/2011
14
International ACM Workshop on Social Media (WSM11)
Slide15Evaluation results
n-grams outperform Vector Model for n = 3, n = 4 in all cases (language-neutral, noise tolerant)n-gram graphs:low accuracy for NBM, higher values overall for C4.5n incremented by 1: performance increases by 3%-4%
30/11/2011
15
International ACM Workshop on Social Media (WSM11)
Slide16Efficiency Performance Analysis
n-grams involve the largest by far set of features -> high computational loadfour-grams: less features than trigrams (their numerous substrings are rather rare)n-gram graphs: significantly lower number of features in all cases (<10) -> much higher classification efficiency!
30/11/2011
16
International ACM Workshop on Social Media (WSM11)
Slide17Improvements (work under submission)
We lowered the frequency threshold to 0.1% for tokens and n-grams, to increase the performance of the term vector and n-grams model (at the cost of even lower efficiency).We included in the training stage the tweets that were used for building the polarity classes.Outcomes:Higher performance for all methods.N-gram graphs again outperform all other models.Accuracy takes significantly higher values (>95%)
30/11/2011
17
International ACM Workshop on Social Media (WSM11)
Slide18Thank you!
30/11/2011
International ACM Workshop on Social Media (WSM11)
18
SocIoS
project: www.sociosproject.eu