Ethan Grefe December 13 2013 Motivation Spam email is constantly cluttering inboxes Commonly removed using rule based filters Spam often has very similar characteristics This allows ID: 560737
Download Presentation The PPT/PDF document "Spam Email Detection" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Spam Email Detection
Ethan Grefe
December
13,
2013Slide2
Motivation
Spam email
is constantly cluttering inboxes
Commonly removed using rule based filters
Spam often has
very similar characteristics
This allows
them to be detected using
machine learning
Naïve Bayes Classifiers
Support Vector Machines Slide3
SVM Solution
Used training data from
CSDMC2010 SPAM
corpus
4327 labeled emails
2949 non-spam messages (HAM)
1378 spam messages (SPAM).
Extracted features from the subject and body of emails
Used resulting feature vectors to train an SVM
classifier in
MatlabSlide4
Email Features
Features were determined by research and observation
Best results were obtained with the following features
Percentage
of letters that
are
capitalized
Types of punctuation used
Average
length of
a word
Amount of html in the emailSlide5
Classifier Results
Trained on a random 35% of emails
Tested SVM classifier on remaining 65%
Trained SVM using three different kernel functions
Kernel Function
Spam Classification Rate
Ham Classification
Rate
Total Classification Rate
RBF
80.06%
92.33%
86.20%
Linear
78.69%
80.66%
79.67%
Quadratic
82.75%
84.85%
83.80%Slide6
Possible Improvements
Use Naïve Bayes
to classify emails using word frequency
Obtain
a wider variety of input
features
Test other types of learning algorithms