/
Design and Evaluation of a Real-Time URL Spam Design and Evaluation of a Real-Time URL Spam

Design and Evaluation of a Real-Time URL Spam - PowerPoint Presentation

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
391 views
Uploaded On 2015-12-04

Design and Evaluation of a Real-Time URL Spam - PPT Presentation

Filtering Service Kurt Thomas Chris Grier Justin Ma Vern Paxson Dawn Song University of California Berkeley International Computer Science Institute Motivation Social Networks Facebook Twitter ID: 213511

url spam email twitter spam url twitter email accuracy monarch content feature message error service training time architecture social

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Design and Evaluation of a Real-Time URL..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Design and Evaluation of a Real-Time URL Spam Filtering Service

Kurt Thomas, Chris Grier, Justin Ma,Vern Paxson, Dawn Song

University of California, Berkeley

International Computer Science InstituteSlide2

Motivation

Social Networks

(

Facebook, Twitter)

Web Mail

(Gmail, Live Mail)

Blogs, Services

(Blogger, Yelp)

SpamSlide3

MotivationExisting solutions:

BlacklistsService-specific, account heuristicsDevelop new spam filter service:Filter spam: scams, phishing, malwareReal-time, fine-grained, generalizableSlide4

Overview

Our system – Monarch:Accepts millions of URLs from web serviceCrawls, labels each URL in real-timeSpam ClassificationDecision based on URL content, page behavior, hostingLarge-scale; distributed collection, classificationImplemented as a cloud serviceSlide5

Monarch in Action

Social Network

1. Spam Message

Spam Account

URLSlide6

Monarch in Action

MonarchSocial Network

1. Spam Message

2. Message URL

Spam Account

URLSlide7

Monarch in Action

Monarch

Social Network

1. Spam Message

2. Message URL

3

. Fetch Content

Spam URL Content

Spam Account

URLSlide8

Monarch in Action

Monarch

Social Network

1. Spam Message

2. Message URL

4. Decision

3

. Fetch Content

Spam URL Content

Spam Account

URLSlide9

Monarch in Action

Monarch

Social Network

Message Recipients

1. Spam Message

2. Message URL

4. Decision

3

. Fetch Content

Spam URL Content

Spam Account

URLSlide10

Challenges

AccuracyReal-Time

Scalability

Tolerant to Feature EvolutionSlide11

OutlineArchitectureResults & Performance

LimitationsConclusionSlide12

System ArchitectureSlide13

System ArchitectureSlide14

System ArchitectureSlide15

System ArchitectureSlide16

URL Aggregation

SourceSample SizeSpam email URLs1.25 million

Blacklisted

Twitter URLs

567

,000

Non-spam Twitter URLs9 million

Collection period: 9/8/2010 – 10/29/2010Slide17

Feature Collection

High Fidelity BrowserNavigationLexical features of URLs (length, subdomains)Obfuscation (directory operations, nested encoding)HostingIP/ASNA, NS, MX recordsCountry, city if availableSlide18

Feature Collection

ContentCommon HTML templates, keywordsSearch engine optimizationContent of request, response headersBehaviorPrevent navigating awayPop-up windowsPlugin, JavaScript redirectsSlide19

Classification

Distributed Logistic RegressionData overload for single machineSlide20

Classification

Distributed Logistic RegressionData overload for single machineL1-regularizationReduces feature space, over-fitting50 million features -> 100,000 featuresSlide21

ImplementationSystem implemented as a cloud service on Amazon EC2

Aggregation: 1 machineFeature Collection: 20 machinesFirefox, extension + modified sourceClassification & Feature Extraction: 50 machinesHadoop - Spark, Mesos

Straightforward to scale the architectureSlide22

Result OverviewHigh-level summary:

PerformanceOverall accuracyHighlight important featuresFeature evolutionSpam independence between servicesSlide23

PerformanceRate: 638,000 URLs/day

Cost: $1,600/moProcess time: 5.54 secNetwork delay: 5.46 secCan scale to 15 million URLs/dayEstimated $22,000/moSlide24

Measuring AccuracyDataset: 12 million URLs (<2 million spam)

Sample 500K spam (half tweets, half email)Sample 500K non-spamTraining, Testing5-fold validationVary training folds non-spam:spam ratioTest fold equal parts spam, non-spamSlide25

Overall Accuracy

Training RatioAccuracyFalse Positive RateFalse Negative Rate

1:1

94%

4.23%

7.5%

4:191%

0.87%17.6%10:1

87%0.29%26.5%

Non-spam labeleda

s spamSpam labeledas non-spam

Correctly labeled

samplesSlide26

Overall Accuracy

Non-spam labeledas spamSpam labeledas non-spam

Correctly labeled

samples

Training Ratio

Accuracy

False Positive Rate

False Negative Rate

1:1

94%

4.23%

7.5%

4:1

91%

0.87%

17.6%

10:1

87%

0.29%

26.5%Slide27

Error by Feature

Error (%)

Error =

1 - AccuracySlide28

Error by Feature

Error (%)

Error =

1 - AccuracySlide29

Error by Feature

Error (%)

Error =

1 - AccuracySlide30

Feature Evolution – Retraining Required

Accuracy (%)Slide31

Spam IndependenceUnexpected result: Twitter, email spam qualitatively different

Training SetTesting Set

Accuracy

False Negatives

Twitter

Twitter

94%22%

TwitterEmail81%

88%EmailTwitter80%

99%Email

Email99%4%Slide32

Spam IndependenceUnexpected result: Twitter, email spam qualitatively different

Training SetTesting Set

Accuracy

False Negatives

Twitter

Twitter

94%22%Twitter

Email81%

88%EmailTwitter

80%99%

EmailEmail99%4%Slide33

Distinct Email, Twitter FeaturesSlide34

Email Features Shorter LivedSlide35

LimitationsAdversarial Machine Learning

We provide oracle to spammersCan adversaries tweak content until passing?Time-based EvasionChange content after URL submitted for verificationCrawler FingerprintingIdentify IP space of Monarch, fingerprint Monarch browser clientDual-personality DNS, page behaviorSlide36

Related WorkC. Whittaker, B.

Ryner, and M. Nazif, “Large-Scale Automatic Classification of Phishing Pages”J. Ma, L. Saul, S. Savage, and G. Voelker, “Identifying suspicious URLs: an application of large-scale online learning”

Y. Zhang, J. Hong, and L.

Cranor

, “

Cantina: a content-based approach to detecting phishing web sites”

M. Cova, C. Kruegel, and G. Vigna, “Detection and analysis of drive- by-download attacks and malicious JavaScript code”Slide37

ConclusionMonarch provides:

Real-time scam, phishing, malware detectionExperiments show 91% accuracy, 0.87% false positivesReadily scalable cloud serviceApplicable to all URL-based spamSpam not guaranteed to overlap between web servicesTwitter, email qualitatively differentDespite overlap, can still provide generalizable filteringRequire training data from each service