/
Sentiment Analysis of Social Media Content Sentiment Analysis of Social Media Content

Sentiment Analysis of Social Media Content - PowerPoint Presentation

alida-meadow
alida-meadow . @alida-meadow
Follow
350 views
Uploaded On 2019-06-29

Sentiment Analysis of Social Media Content - PPT Presentation

using NGram Graphs Authors Fotis Aisopos George Papadakis Theordor a Varvarigou Presenter Konstantinos Tserpes National Technical University of Athens Greece Social Media and Sentiment Analysis ID: 760581

media social international acm social media acm international polarity workshop wsm11 2011 graphs grams tweets classification gram dsim model

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Sentiment Analysis of Social Media Conte..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Sentiment Analysis of Social Media Contentusing N-Gram Graphs

Authors:

Fotis

Aisopos

, George

Papadakis

,

Theordor

a

Varvarigou

Presenter: Konstantinos Tserpes

National Technical University of Athens, Greece

Slide2

Social Media and Sentiment Analysis

Social Networks enable users to:Chat about everyday issuesExchange political viewsEvaluate services and productsUseful to estimate average sentiment for a topic (e.g. social analysts)Sentiments expressed Implicitly (e.g. through emoticons, specific words)Explicitly (e.g. the “Like” button in Facebook)In this work we focus on content-based patterns for detecting sentiments.

30/11/2011

2

International ACM Workshop on Social Media (WSM11)

Slide3

Intricacies of Social Media Content

Inherent characteristics that turn established,language-specific methods inapplicable:Sparsity: each message comprises just 140 characters in TwitterMultilinguality: many different languages and dialectsNon-standard Vocabularty: informal textual content (i.e., slang), neologisms (e.g. “gr8” instead of “great”)Noise: misspelled words and incorrect use of phrases.Solution language-neutral method that is robust to noise

30/11/2011

3

International ACM Workshop on Social Media (WSM11)

Slide4

Focus on Twitter

We selected the Twitter micro-blogging service due to:Popularity (200 million users, 1 billion posts per week)Strict rules of social interaction (i.e., sentiments are expressed through short, self-contained text messages)Data publicly available through a handy API

30/11/2011

4

International ACM Workshop on Social Media (WSM11)

Slide5

Polarity Classification problem

Polarity: express of a non-neutral sentimentPolarized tweets: tweets that express either a positive or a negative sentiment (polarity is explicitly denoted by the respective emoticons)Neutral tweets: tweets lacking any polarity indicatorBinary Polarity Classification: decide for the polarity of a tweet with respect to a binary scale (i.e., negative or positive).General Polarity Classification: decide for the polarity of a tweet with respect to three scales (i.e., negative, positive or neutral).

30/11/2011

5

International ACM Workshop on Social Media (WSM11)

Slide6

Representation Model 1: Term Vector Model

Aggregates the set of distinct words (i.e., tokens) contained in a set of documents.Each tweet ti is then represented as a vector:vti = (v1, v2, ..., vj) where vj is the TF-IDF value of the j-th term. The same model applies to polarity classes.Drawbacks:It requires language-specific techniques that correctly identify semantically equivalent tokens (e.g., stemming, lemmatization, P-o-S tagging).High dimensionality

30/11/2011

6

International ACM Workshop on Social Media (WSM11)

Slide7

Representation Model 2: Character n-grams

Each document and polarity class is represented as the set of substrings of length n of the original text.for n = 2: bigrams, n = 3: trigrams, n = 4: fourgramsexample: “home phone" consists of the following tri-grams: {hom, ome, me , ph, pho, hon, one}.Advantages: language-independent method.Disadvantages: high dimensionality

30/11/2011

7

International ACM Workshop on Social Media (WSM11)

Slide8

Representation Model 3: Character n-gram graphs

Each document and polarity class are represented as graphs, where the nodes correspond to character n-grams,the undirected edges connect neighboring n-grams (i.e., n-grams that co-occur in at least one window of n characters), andthe weight of an edge denotes the co-occurrence rate of the adjacent n-grams.Typical value space for n: n=2 (i.e., bigram graphs), n=3 (i.e., trigram graphs), and n=4 (i.e., four-gram graphs).

30/11/2011

8

International ACM Workshop on Social Media (WSM11)

Slide9

Example of n-gram graphs.

The phrase “

home_phone” is represented as follows:

30/11/2011

9

International ACM Workshop on Social Media (WSM11)

Slide10

Features of the n-gram graphs model

To capture textual patterns, n-gram graphs rely on the following graph similarity metrics (computed between the polarity class graphs and the tweet graphs):Containment Similarity (CS): portion of common edges, regardless of their weightsSize Similarity (SS): ratio of sizes of two graphsValue Similarity (VS): portion of common edges, taking into account their weightsNormalized Value Similarity (NVS): value similarity without the effect of the relative graph size (i.e., NVS =VS/SS)

30/11/2011

10

International ACM Workshop on Social Media (WSM11)

Slide11

Features Extraction

Create Gpos, Gneg (and Gneu) by aggregating half of the training tweets with the respective polarity.For each tweet of the remaining training set:create tweet n-gram graph Gti derive a feature “vector” from graphs comparisonSame procedure for the testing tweets.

30/11/2011

11

International ACM Workshop on Social Media (WSM11)

Slide12

Discretized Graph Similarities

Discretized similarity values offer higher classification efficiency. They are created according to the following function:Binary classification has three nominal features: dsim(CSneg, CSpos)dsim(NVSneg, NVSpos)dsim(VSneg, VSpos)General classification has six more nominal features:dsim(CSneg, CSneu)dsim(NVSneg, NVSneu)dsim(VSneg, VSneu)

dsim

(CSneu, CSpos)dsim(NVSneu, NVSpos)dsim(VSneu, VSpos)

30/11/2011

12

International ACM Workshop on Social Media (WSM11)

Slide13

Data set

Initial dataset:475 million real tweets, posted by 17 million userspolarized tweets:6.12 million negative14.12 million positiveData set for Binary Polarity Classification: Random selection of 1 million tweets from each polarity category.Data set for General Polarity Classification: the above + random selection of 1 million neutral tweets.

30/11/2011

13

International ACM Workshop on Social Media (WSM11)

Slide14

Experimental Setup

10-fold cross-validation.Classification algorithms (default configuration of Weka):Naive Bayes Multinomial (NBM)C4.5 decision tree classifierEffectiveness Metric: classification accuracy (correctly_classified_documents/all_documents).Frequency threshold for term vector and n-grams model: only features that appear in at least 1% of all documents were considered.

30/11/2011

14

International ACM Workshop on Social Media (WSM11)

Slide15

Evaluation results

n-grams outperform Vector Model for n = 3, n = 4 in all cases (language-neutral, noise tolerant)n-gram graphs:low accuracy for NBM, higher values overall for C4.5n incremented by 1: performance increases by 3%-4%

30/11/2011

15

International ACM Workshop on Social Media (WSM11)

Slide16

Efficiency Performance Analysis

n-grams involve the largest by far set of features -> high computational loadfour-grams: less features than trigrams (their numerous substrings are rather rare)n-gram graphs: significantly lower number of features in all cases (<10) -> much higher classification efficiency!

30/11/2011

16

International ACM Workshop on Social Media (WSM11)

Slide17

Improvements (work under submission)

We lowered the frequency threshold to 0.1% for tokens and n-grams, to increase the performance of the term vector and n-grams model (at the cost of even lower efficiency).We included in the training stage the tweets that were used for building the polarity classes.Outcomes:Higher performance for all methods.N-gram graphs again outperform all other models.Accuracy takes significantly higher values (>95%)

30/11/2011

17

International ACM Workshop on Social Media (WSM11)

Slide18

Thank you!

30/11/2011

International ACM Workshop on Social Media (WSM11)

18

SocIoS

project: www.sociosproject.eu