Bias using Simple Textual Analysis Yuval Pinter Shuki Tausig Oren Persico Motivation Hypotheses Media is biased Israeli media is superbiased Machine Learning detects bias Headlines could be enough ID: 736424
Download Presentation The PPT/PDF document "Breaking News Exploring Israeli News" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Breaking News
Exploring Israeli News
Bias
using Simple Textual AnalysisYuval Pinter Shuki Tausig Oren PersicoSlide2
Motivation / Hypotheses
Media is biasedIsraeli media is super-biasedMachine Learning detects bias
Headlines could be enough"כותרות הן עיתונות בצורתה הצרופה ביותר"Simon Jenkins, 1992
Which is more significant – class bias or agenda bias?The idea: classify the news outlet using basic featuresMost of the “agenda bias” part will have to waitNo prior work AFAIK; closest field – Authorship attributionSlide3
DataGeneral news sites only
Homepage headlines onlyScraped in 15-minute intervalsJuly 2014 – May 2015Most experiments on February
Data and extraction code is available:
github.com/
yuvalpinter
/
MediaAnalysisSlide4
Data Samples
Nov 23, 15:00:
Feb 15, 15:30:Slide5
Text Processing
Consecutive appearance de-dupingTokenization (inc. lemmatization, affix deletion) using hspell (Har’el
and Kenigsberg)Mostly good, sometimes not so much
הפרלמנט הירדני עמד דקת דומייה לזכר המחבלים =>פרלמנט ירדן עימד דקה דומייה זכר מחבל(NRG, 20/11/2014, 0:15)רעידת אדמה קטלנית כאלף נהרגו בנפאל: "שעות קריטיות" =>
רעידה דימה קטלוניה אילף
נהרג נפאל שעה קריטי
(
Mako
, 25/4/2015, 19:30)Slide6
Features
Form: character length, word count, word length (average/min/median/max), punctuation token count
Lexicon:
quantile word/lemma frequenciesaverage/min/median/maxWordlists (Hermit Dave), Israblog (Linzen 2009)Morphology: affix lettersWord featuresProbably the media cycleFeatures and extraction code are availablehttp://www.the7eye.org.il/50916Slide7
Setup & Results
7 classes, 1785 headlines (all of February)Weka’s Random ForestAccuracy:10 trees: 45.4%
50 trees: 49.5%Most significant features:Number of words
Average word lengthAverage position in word frequency tableSlide8
Feature Example
Character length
Character countSlide9
Pairwise Setup
Binary
classifier accuracy
72.3
88
92.1
73.4
78.5
76.5
84.5
91.8
75.8
78.1
77.9
72.9
86.7
79
79.4
88.9
74
78.2
69.4
64.9
58.6
Class over agenda:
Mako
,
Walla, NRG form a cluster – “online ethos”
Ha’aretz
and
Ma’ariv
relatively unique (newspaper-derived)
Israel
Hayom
resembles tabloid competitor
ynet
most, more than agenda-sharing NRG
(Higher = easier to classify = less similar)Slide10
Results – changing the scenery
Protective Edge: July 8-Aug. 26 (only 4 sites)2768 headlines, 53.4% acc (10 trees)Control: Oct. 8-November 261877, 54.2%
Single week: Jan. 1-7 (no Walla)426, 45.3% (10 trees), 51.4% (100)Single day: December 2 (Tuesday)
89, 39.3% (10), 46% (100)Train on 5 months (Sep-Jan), test on Feb (no Walla)8113 train:1514 test, 45.8% (10), 49.9% (50)Train on 9 months (July-Mar), test on AprNo Walla: 14628:1685, 40.9% (10), 45.6% (50)All sites: 15285:2001, 35.8% (10), 39.8% (50)Slide11
Future Work
Better content (“agenda”) features
Topic Models?
Sentiment?Some weird phenomena to be ironed outAlternating headlines: dedup based on recent kVery similar headlines: merge or use edit distanceLocation-sensitive featuresHeadlines starting with נתניהו: ~ balancedHeadlines starting with רה"מ: 50% in Israel Hayom, another 25% in NRGMore text: main leads / other headlinesSlide12
Thanks!
github.com/
yuvalpinter
/MediaAnalysisSlide13
Thanks!