/
  411667   411667

411667 - PowerPoint Presentation

sherrill-nordquist
sherrill-nordquist . @sherrill-nordquist
Follow
357 views
Uploaded On 2016-07-19

411667 - PPT Presentation

Team Priya Iyer Vaidy Venkat Sonali Sharma Mentor Andy Schlaikjer Twist User Timeline Tweets Classifier Goal Auto classify tweets on the users timeline into 4 predefined categories Sports Finance Entertainment Technology ID: 411667

giants tweets amazing step tweets giants step amazing feel twitter stemming words classification input timeline text data liblinear remove

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document " 411667" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Team :Priya Iyer Vaidy VenkatSonali SharmaMentor: Andy Schlaikjer

Twist : User Timeline Tweets ClassifierSlide2

Goal

Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, TechnologyInput: user timeline tweetsOutput: list of auto classified tweetsSlide3

Rationale

Twitter allows users to create custom Friend Lists based on the user handles.Slide4

Rationale (contd.)

Our application is a twist on this functionality of Twitter where we auto classify tweets on the user’s timeline based on just the occurrence of terms in the tweet.Slide5

Approach

Step 1: Data CollectionStep 2: Text miningStep 3: Creation of the training file for the libraryStep 4: Evaluation of several classifiersStep 5: Selecting the best classifierStep 6: Validating the classificationStep 7: Tuning the parametersStep 8: Repeat; until correct classificationSlide6

Text Mining Process

Remove special charactersTokenizeRemove redundant letters in wordsSpell CheckStemmingLanguage IdentificationRemove Stop WordsGenerate bigrams and change to lower caseSlide7

Go SF Giants! Such an

amaazzzing feelin’!!!! \m/ :D 

SF Giants!

amaazzzing

feelin

’!!!!

\/

:D

SF Giants

amaazzzing

feelin

SF Giants amazing

feeling

SF Giants amazing feel me

SF Giants amazing feel

Stopwords

Special chars

Spell check

Stemming

stopwordsSlide8

Choice of ML technique

Logistic Regression ClassifierReasons: Most popular linear classification technique for text classificationAbility to handle multiple categories with easeGave the best cross-validation accuracy and precision-recall scoreLibrary: LIBLINEAR for PythonSlide9

Creation of LIBLINEAR training input

SF Giants amazing feelSF – 1 Giants -2 amazing-3 feel-4

SF-1 (1) Giants-2 (1) amazing-3 (1) feel-4(1)

1 1:1 2:1 3:1 4:1

Boolean

Training Input for the SVM

IndexingSlide10

DemoSlide11

THANK YOU

Andy,Marti & The Twitter Team

Slide12

Questions?Slide13

Data Collection Challenges – Backup Slides

Collected >2000 tweets from the “Who to follow” interest lists on Twitter for “Sports” and “Business”Tweets were not purely “Sports” or “Business” relatedPersonal messages were prominentSolution: Compared against a corpus of sports/business related terms and assigned weights accordinglySlide14

Text Mining Challenges

Noise in the data:Tweets are in inconsistent formatLots of meaningless wordsMisspellingsMore of individual expressionFor example, BAAAAAAAAAAAASSKEttt!!!! bskball

,

futball

, % , :D,\m/, ^

xoxo

Solution: Regular expressions and NLP toolkit

Different words, same root

Playing , plays , playful -

 play

Solution: StemmingSlide15

Sample LIBLINEAR input format (Train)Slide16

LIBLINEAR output for a test file of 20 tweets

Mixed bag of sports(=1), finance(=2) tweets, entertainment(=3) and technology (=4)Comma separated values of the categories that each tweetAccuracy here is 94%. Precision: 0.89 Recall: 0.89Experiment with different kernels for a better accuracySlide17

Summary: Data Source/Software/Tools

Category based tweets fromhttps://twitter.com/i/#!/who_to_follow/interestsCoding done in Python

Database – sqlite3

ML tool – lib SVM

Stemming – Porter’s Stemming

NLP Tool kit

Related Contents


Next Show more