Team Priya Iyer Vaidy Venkat Sonali Sharma Mentor Andy Schlaikjer Twist User Timeline Tweets Classifier Goal Auto classify tweets on the users timeline into 4 predefined categories Sports Finance Entertainment Technology ID: 411667
Download Presentation The PPT/PDF document " 411667" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Team :Priya Iyer Vaidy VenkatSonali SharmaMentor: Andy Schlaikjer
Twist : User Timeline Tweets ClassifierSlide2
Goal
Auto classify tweets on the user’s timeline into 4 predefined categories: Sports, Finance, Entertainment, TechnologyInput: user timeline tweetsOutput: list of auto classified tweetsSlide3
Rationale
Twitter allows users to create custom Friend Lists based on the user handles.Slide4
Rationale (contd.)
Our application is a twist on this functionality of Twitter where we auto classify tweets on the user’s timeline based on just the occurrence of terms in the tweet.Slide5
Approach
Step 1: Data CollectionStep 2: Text miningStep 3: Creation of the training file for the libraryStep 4: Evaluation of several classifiersStep 5: Selecting the best classifierStep 6: Validating the classificationStep 7: Tuning the parametersStep 8: Repeat; until correct classificationSlide6
Text Mining Process
Remove special charactersTokenizeRemove redundant letters in wordsSpell CheckStemmingLanguage IdentificationRemove Stop WordsGenerate bigrams and change to lower caseSlide7
Go SF Giants! Such an
amaazzzing feelin’!!!! \m/ :D
SF Giants!
amaazzzing
feelin
’!!!!
\/
:D
SF Giants
amaazzzing
feelin
SF Giants amazing
feeling
SF Giants amazing feel me
SF Giants amazing feel
Stopwords
Special chars
Spell check
Stemming
stopwordsSlide8
Choice of ML technique
Logistic Regression ClassifierReasons: Most popular linear classification technique for text classificationAbility to handle multiple categories with easeGave the best cross-validation accuracy and precision-recall scoreLibrary: LIBLINEAR for PythonSlide9
Creation of LIBLINEAR training input
SF Giants amazing feelSF – 1 Giants -2 amazing-3 feel-4
SF-1 (1) Giants-2 (1) amazing-3 (1) feel-4(1)
1 1:1 2:1 3:1 4:1
Boolean
Training Input for the SVM
IndexingSlide10
DemoSlide11
THANK YOU
Andy,Marti & The Twitter Team
Slide12
Questions?Slide13
Data Collection Challenges – Backup Slides
Collected >2000 tweets from the “Who to follow” interest lists on Twitter for “Sports” and “Business”Tweets were not purely “Sports” or “Business” relatedPersonal messages were prominentSolution: Compared against a corpus of sports/business related terms and assigned weights accordinglySlide14
Text Mining Challenges
Noise in the data:Tweets are in inconsistent formatLots of meaningless wordsMisspellingsMore of individual expressionFor example, BAAAAAAAAAAAASSKEttt!!!! bskball
,
futball
, % , :D,\m/, ^
xoxo
Solution: Regular expressions and NLP toolkit
Different words, same root
Playing , plays , playful -
play
Solution: StemmingSlide15
Sample LIBLINEAR input format (Train)Slide16
LIBLINEAR output for a test file of 20 tweets
Mixed bag of sports(=1), finance(=2) tweets, entertainment(=3) and technology (=4)Comma separated values of the categories that each tweetAccuracy here is 94%. Precision: 0.89 Recall: 0.89Experiment with different kernels for a better accuracySlide17
Summary: Data Source/Software/Tools
Category based tweets fromhttps://twitter.com/i/#!/who_to_follow/interestsCoding done in Python
Database – sqlite3
ML tool – lib SVM
Stemming – Porter’s Stemming
NLP Tool kit