Support Vector Machine Courtesy of Jinwei Gu Today Support Vector Machine SVM A classifier derived from statistical learning theory by Vapnik et al in 1992 SVM became famous when using images as input it gave accuracy comparable to neuralnetwork with ID: 574130
Download Presentation The PPT/PDF document "An Introduction of" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
An Introduction toSupport Vector Machine
In part from of
Jinwei
GuSlide2
Support Vector Machine
A classifier derived from statistical learning theory by
Vapnik
, et al. in 1992
SVM became famous when, using images as input, it gave accuracy comparable to neural networks in a handwriting recognition task
Currently, SVM is widely used in
object detection & recognition, content-based image retrieval, text recognition, biometrics, speech recognition
, etc.
Still one of the best non-deep methods – in many tasks, comparable performanceSlide3
Support Vector Machine
A Support Vector Machine (SVM) is a
discriminative classifier
formally defined by a
separating hyperplane (like for NN)
.
In other words, given labeled training data (
supervised learning both for classification and regression
), the algorithm outputs an
optimal hyperplane
which categorizes new examples.
To understand SVM we need to brush up vector algebra a bit…Slide4
Function learned by SVM
It can be an
arbitrary
function
f
of
x
, such as:
Nonlinear
Functions (like feed forward neural nets)
Linear
Functions
(like perceptron)Slide5
Linear Function (Linear separator)
f(
x
) is a linear function in
R
n
:
x
1
x
2
w
T
x
+ b = 0
w
T
x
+ b < 0
w
T
x
+ b > 0
n
w
1
x
1
+w
2
x
2
+b=0
If you divide a vector by its norm, you obtain a unit vector (with norm =1)
Unit-length vector
: The normal vector of the hyperplane
Matrix/vector notation, where
w
is vector of weights,
x
is an input vector and T
→
transposeSlide6
Recall: Vector Representation and dot product
w
x
a
x
b
x
c
b
The black line is perpendicular to vector
w
therefore the vector dot product is 0
=
Slide7
Linear Separator
x
1
x
2
w
T
x
+ b = 0
w
T
x
+ b < 0
w
T
x
+ b > 0
denotes +1
denotes -1
x
y
The
classification
function is:
f(
x,w,b
)
= sign(
w
T
x
+ b)=y
If sign() > 0
→ x is positively classified
If sign() < 0
→ x is negatively classified
f(x)
Slide8
How would you classify these points using a linear function?
denotes +1
denotes -1
x
1
x
2
Infinite number of answers!
Linear FunctionSlide9
Linear Discriminant Function
How would you classify these points using a linear discriminant function in order to minimize the error rate?
x
1
x
2
Infinite number of answers!
denotes +1
denotes -1Slide10
Linear Discriminant Function
x
1
x
2
How would you classify these points using a linear discriminant function in order to minimize the error rate?
Infinite number of answers!
denotes +1
denotes -1Slide11
x
1
x
2
Linear Discriminant Function
How would you classify these points using a linear discriminant function in order to minimize the error rate?
Infinite number of answers!
Which one is the best?
denotes +1
denotes -1Slide12
Linear Classifiers
h(x)=sign(
w
T
x
+ b)
If the classifier is producing a correct classification, then:
y
i
sign
(
w
T
x
i
+ b)=1
Both
yi
and sign(
w
T
x
i
+ b) are either -1 or +1.
Misclassified
to +1 class
How would you classify this data?
denotes +1
denotes -1
x
2
x
1
In
this
example
,
datapoints
too
close
to the
decision
boundary
may
lead
to future
wrong
classifications
! Slide13
Large Margin Linear Classifier
The linear discriminant function (classifier) with the
maximum
margin
is the best!
Why it is the best?
Robust to outliers and thus stronger generalization ability
“safe zone”
Margin
x
1
x
2
Margin
is defined as the
width that the boundary could be increased by before hitting a data point
in the learning set
denotes +1
denotes -1Slide14
Functional margins
Functional
margin
of a training
example
vector in the dataset D is defined
as:
If
and
have the
same
sign
.
If
then the prediction is CORRECT and has
a high CONFIDENCE, since fm is high iff
is large, which means
that
is «far enough» from the decision boundary
=0. However the definition of our classification function h(x)=sign(
) is
scale
invariant: is we multiply w and b by any
constant k, h(x) is the same (but the functional margin
will be k times larger!)Therefore maximizing functional margins is
not a good choice – we can make them arbitraily large without
changing anything meaningful!
Slide15
Geometric margins
They are
defined
as
the
distance
between a data point xi and the decision boundary
:
Geometric margin is scale invariant wrt the parameters of the
decision boundary! If we multiply w and b by
k the margin
is the same!
xi
)=
Slide16
Task Objective: Maximizing the geometric margin
Let’s
now
consider
the data
points (x+, x-) in the training set that are placed «on the margins», i.e.,
those closest to the
separating hyperplane. They are called SUPPORT VECTORSSince as we said geometric margins are scale invariant, we can apply a scale transformation
to w and b and set the geometric margins of SVs equal to 1
x
1
x
2
denotes +1
denotes -1
They are the
support vectors
Positive
support
vectors
Negative
support
vectors
H
2
=
w
T
+ b = -1
H
1
=
w
T
+ b = +1
Slide17
Consider
every
instance
of the training set in the
form
<x, y> where y is the class of x:If
y = +1 →
> 0If y = -1 →
< 0
x
1
x
2
Task Objective: Maximizing the geometric margin
denotes +1
denotes -1
Considering
the
margins
defined
by
support
vectors
,
we
obtain
the
equivalent
formulation
:
If
y = +1 →
1
If
y = -1 →
-1
w
T
+ b = -1
w
T
x + b =
0
w
T
+ b = 1
n
Since
by
construction
,
there
are no
points
in the
dataset
lying
within
the
margins
.
The
equality
applies
ONLY
to
SVs
.Slide18
The Margin Width
The distance between
and
H
is then:
The distance between
H
1
and
H
2 is then:
x
1
x
2
denotes +1
denotes -1
:
w
T
+ b = -1
H:
w
T
x + b =
0
They are the support vectors
n
:
w
T
+ b = 1
Since
we
know
that
:
= 1
Slide19
Maximizing The Margin
Remember: Our objective is finding
w
and b such that the geometric margin is maximized.
So, we need to
maximize
such that
The “such that” means that there must be
no data points between
and
, for any x
i
in D
y = +1 →
1
y = -1 →
-1
Equivalently: we need to
minimize
such that
y = +1 →
1
y = -1 →
-1
Notice that
can be combined in
1
y = +1 →
1
y = -1 →
-1
Slide20
Solving the Optimization Problem
s.t.
Quadratic problem
with linear constraints
s.t.
Lagrangian
Method:
We can rewrite all as an optimization problem.
Quadratic problem
(
well-known class of mathematical programming problems)Slide21
The Lagrangian
Given a function f(x) and a set of constraints
c
1
,..
c
n
, a
Lagrangian is a function L(f, c1,..
cn, α
1
,.. αn) that “incorporates” the constraints
in the optimization problem
Karush-Kuhn-Tucker (KKT) conditions: The optimum is at a point where
This condition is known as
complementarity condition
(it means that at least one of , c(x) must be zero).
The derivative is 0
1Slide22
Solving the Optimization Problem
s.t.
To minimize
L
we need to maximize the red box
Note: our variables here are
w
,
b
and the
α
i
2
Note,
this
is a system
of
equations
!Slide23
Dual Problem Formulation
s.t.
s.t.
, and
Lagrangian Dual
Problem
Since
And
3Slide24
We
know
from
previous
steps
and conditions that:
All the steps
NOTE: Because of condition 4, each non-zero
α
i
indicates that corresponding
x
i
is a support vector SV since (
+b)-1)
is zero ONLY for SVs.
s.t.
α
i
≥0
=
By 3.
By 1., 2., 3.
+b)-1))=0
4Slide25
Why only Support Vectors have α>0?
The
complementarity condition
(the
third one of KKT
) implies that one of the 2 multiplicands is zero.
However,
for non-
SupportVector
.
Therefore
only data points on the margin (=SV)
will have a non-zero
α (
because they are the
only ones
for which
= 1
0)
5Slide26
Final Lagrangian Formulation
To summarize:
Q
(
α
)
can be computed since we know the
y
iyjxiTxj (they are the pairs <xi,y
i> of the training set D!!) and the only variables are the αi But remember that the constraint 2 can be verified with a non-equality to zero only for the SV
!!
So, in this final formulation the only variables to be computed are the αi of the support vectors. Wrt the original formulation, b and w
i have disappeared!
Find α1…α
n such that
Q(α) =
Σαi
- ½ΣΣαi
αjyi
yjxiTx
j is maximized and Σ
α
iyi = 0
αi ≥ 0 for all αi
6Slide27
Summary of Optimization Problem Solution
Quadratic optimization problems
(Primal Problem)
are a well-known class of mathematical programming problems, and many algorithms exist for solving them.
The solution involves constructing a
dual problem
where a Lagrange multiplier
α
i is associated with every constraint in the primal problem. In the dual problem, the only unknown are αi
s.t.
Find
w
and
b
such that:
Find
α
1
…α
n
such that
Q(α
) =Σαi -
½ΣΣαi
α
jyiyj
xiTx
j is maximized and
Σαiy
i = 0
αi ≥ 0 for all α
i
7
Primal
Problem
Lagrangian
Dual ProblemSlide28
The Optimization Problem Solution
Given a solution to the
Lagrangian
dual problem
Q(
α
)
(i.e. computing the α1…αn ); the
solution to the primal (i.e. computing w
and b) is: Each non-zero αi indicates that corresponding xi is a support vector.Once we solve the problem, the classification function for a new
point x is the sign of : (note that we don’t need to compute w explicitly)
Notice that to predict the class of
x we perform an inner (dot) product xiTx
between the test point x and the support vectors xi . Only the support vectors are useful for classification of new instances!! (since the other α values are zero).
Also keep in mind that solving the optimization problem involved computing the inner (dot) products xiTxk between all training points.
w
=Σ
αiyi
xi b =
yk - Σ
αiyixi
Txk
for any
αk > 0
Φ(x) = Σ
αiyi
x
iTx + b (
xi are the support vectors)
8Slide29
Kernel functions and linear kernel
Inner (dot) product belongs to a category of mathematical functions known as
kernel functions
, i.e. SIMILARITY functions between vectors. Specifically, dot product is a
linear kernel
.
Why
the dot product (linear kernel) is a similarity function
?
=
=
Slide30
Example
Suppose we have the dataset:
P
ositive
N
egative
1Slide31
By simple inspection, we can identify 3 Support Vectors:
A
Example
2
B
CSlide32
Example:What we know?
We know that in general:
Positive
margin H1 (SVs (B, C)):
w
T
x
+b
=
+
b =
1
Negative
margin H2(SVs (A)): w
Tx+b =
+
b =
-1
Since:
(
first constraint from derivative) and
α
i
are all non-zero for SVs.
The kernel is the inner dot product for linear separator:
k(
,
)
For the optimization condition:
=>
+
+
= 0
3Slide33
Example: First step
Now, we can write system of equations to find the “alphas”:
So, we can write a system of equations for each of the 3 points:
Where k(
,
) (
i.e. the kernel) is the dot product
in this case (linear)
5
+
+ b =
-1
+
+ b = 1
+
+ b = 1
Slide34
Then we compute the inner dot products
K(·, ·)
A
B
C
A
1
3
3
B3108C3
810
Example: K(B, C) =B
TC=
6
=
Slide35
Finally, we obtain the system of equations
+
+ b =
+
b
=>
b =-1
+
+ b =
+
b
=>
b = 1
+
+ b =
+
b
=>
b = 1
7
We
have
3
equations
and 4
variables
,
but
..Slide36
Finally, we obtain the system of equations
From the optimization conditions, we also know that:
=>
+
+
= 0
So the result is:
= -
;
=
= -
8Slide37
The solution (graphically)
The
w
i
are computed using
w
=
Σ
αiyi
xi
9Slide38
Data with «noise»
SVM
try
to
perfectly
separate training data with a
separating
hyperplane with maximum marginsHowever rarely data are «perfectly»
separable
In some case it is better to allow some classification error to favor high generalization power, rather than adapting to training examples (overfitting) Slide39
Soft
Margins
: the
intuition
The red line
correctly
separates the data, but in a sense, it
«overfits»
The margin is very narrow and reduces the generalization power over unseen examplesThe green line allows to define larger margins, but
it makes one error (the green point that lies above the green line)
The idea behind soft
margin is to «sacrifice» consistency over training data but allow for larger marginsSlide40
Soft Margins (2)
The general idea is to
change
the
objective
funtion
. Rather than minimizing
we minimize
+C(#of mistakes
)Here, C is a hyperparameter that decides the trade-off between maximizing the margin and minimizing the mistakes.
When C is small, classification mistakes are given less importance and focus is more on maximizing the margin, whereas when C is large, the focus is more on avoiding misclassification at the expense of keeping the margin small.However, not all mistakes are equal. Data points that are far away on the wrong side of the decision boundary should incur more penalty as compared to the ones that are closer.
Slide41
Soft Margins (3)
The idea is: for every data point x
i
, we introduce a slack variable
ξ
i
. The value of
ξi is the distance of xi from the corresponding class’s margin if xi is on the wrong side of the margin, otherwise zero. Thus the points that are far away from the margin on the wrong side would get more penalty.
If
the
left
side (
remember: it is a functional margin) is ≥1 then
it means tha xi is correctly classified, if
instead it has a negative sign, it means that xi is on the «
wrong side», and we add a penalty
that is proportional to the distance of xi from the
separating line.
is called a slack variable.
Slide42
Soft Margin Classification
Soft Margin:
The SVM try to find a hyperplane to separate the two classes, but
tolerate few misclassified points
.
Slack variables
ξi can be added to allow misclassification of outliers or noisy examples, resulting
margins are called soft.
ξ
i
ξ
i
x
i
are the misclassified examples
The «
relaxed
»
constraint
is
that
:
Slide43
New Formulation of the Primal Problem:
such that
Parameter
C
can be viewed as a way to control over-fitting.
For large C a large penalty is assigned to errors
.
When incorporating the constraints (as for standard SVM) we obtain:
The slack variables
allow an example in the margin 0
1
to be misclassified
, while
if
>1 it is an error
with
Soft Margin Classification
Large Margin Linear Classifier Slide44
Soft-Margin Classification: Effect of the parameter C
C is a hyperparameter that decides the trade-off between maximizing the margin and minimizing the mistakes. Slide45
The classifier is a separating hyperplane.
Most “important” training points are support vectors; they define the position of the hyperplane.
Quadratic optimization algorithms can identify which training points x
i
are support vectors with non-zero
Lagrangian
multipliers
α
i
.Both in the dual formulation of the problem and in the solution, the training points appear only inside dot productsSlack variables allow to tolerate errprs in “quasi-linearly” separable data
Summary:
Linear SVMSlide46
Datasets that are linearly separable (possibly with noise) work out great:
0
x
0
x
x
2
0
x
But what are we going to do if the dataset is just too hard?
How about
…
mapping data to a higher-dimensional space:
This slide is courtesy of
LINK
Non-linear SVMs
1Slide47
General idea
:
The original input space can be mapped to some higher-dimensional feature space where the training set is separable:
Φ:
x
→
φ(
x
)
Non-linear SVMs:
Feature Space
This slide is courtesy of
LINK
2
Note that φ(
x
) is still a vector in a higher dimensional space!!Slide48
The original points (left side of the schematic) are mapped by
, i.e., rearranged, using non-linear
kernels
.
Space Mapping
Function
3Slide49Slide50
With this mapping, our discriminant function is now:
The original formulation was:
The “Kernel Trick”
1
Slide51
As we said, the dot product is a linear kernel, and computes the similarity between 2 vectors
in the feature-vector space
In general, a
kernel function
is defined as a function that corresponds to a dot product of two feature vectors “
in some expanded” feature space
:
The Kernel Trick
2
The linear classifier relies on dot product between vectors
K
(
x
i
,x
j
)=
x
i
T
x
jSlide52
Example
Second degree
polinomial
mapping
It
satisfies
the
definition
of kernel function
, since:corresponds to a dot product of two feature vectors “
in some expanded” feature spaceSlide53
Linear kernel:
Polynomial kernel:
Gaussian
kernel (Radial-Basis Function (RBF) ) :
Sigmoid:
The Kernel Trick
Popular Kernel Functions
4
In general, functions that satisfy
Mercer
’
s condition (
https://www.svms.org/mercer/
)
can be kernel functions.Slide54
Nonlinear SVM: the optimization problem
Lagrangian
Dual Problem Formulation:
such that
The solution of the discriminant function (decision boundary) is
The optimization technique for finding
α
i
‘s is the same.
and
Find
α
1
, α
2
,.. α
n
such that:
is maximized andSlide55
Nonlinear SVM: Summary
SVM locates a separating hyperplane in the feature space and classify points in that space
It does not need to represent the space explicitly, rather it simply defines a kernel function
The kernel function (a dot product in an “augmented” space) plays the role of the dot product in the feature space.Slide56
Support Vector Machine: Algorithm
Choose a kernel function
Choose a value for
C
Solve the quadratic programming problem (many software packages available)
Construct the discriminant function from the support vectors Slide57
Summary
Maximum Margin Classifier
Better generalization ability & less over-fitting
The Kernel Trick
Map data points to higher dimensional space in order to make them linearly separable.
Since only dot product is used, we do not need to represent the mapping explicitly.Slide58
Issues
Choice of kernel function:
Gaussian or polynomial kernel is the default
If they are ineffective, more elaborate kernels are needed
Domain experts can give assistance in formulating the appropriate similarity measures
Choice of kernel parameters:
e.g.
σ
in Gaussian kernel
σ is the distance between closest points with different classifications In the absence of reliable criteria, applications rely on the use of a validation set or cross-validation to set such parameters.
Optimization criterion – Hard margin vs. Soft margin:A lengthy series of experiments in which various parameters are tested
This slide is courtesy of
LINKSlide59
Issues
Maximum Margin classifiers are known to be sensitive to the scaling transformation applied to the features. Therefore it is essential to
normalize
the data! (see slides on data transformation at the beginning of the course)
Maximum Margin classifiers are also sensible to
unbalanced
data
Hyper-parameter tuning (C, kernel): readSlide60
Properties of SVM
Flexibility
in choosing a similarity function
Ability to deal with
large data sets.
Only support vectors
are used to specify the separating hyperplane
Ability to
handle large feature spaces. Complexity of the algorithm does not depend on the dimensionality of the feature spaceOverfitting can be controlled by soft margin approachNice math property: a simple convex optimization problem which is guaranteed to converge to a single global solution (
link)Slide61
Weakness of SVM
Sensitive to noise
: A relatively small number of mislabeled examples can dramatically decrease the performance
It only
considers two classes
:
how to do multi-class classification with SVM?
Answer:
1) with output arity m, learn m SVM’s
SVM 1 learns “Output==1” vs “Output not 1”SVM 2 learns “Output==2” vs “Output not 2”:SVM m learns “Output==m” vs “Output not m” 2) To predict the output for a new input, just predict with each SVM, and find out which one puts the prediction the furthest into the positive region.Slide62
Additional Resource
Recommended lectures:
LINK
,
LINK
Additional Resource
LINK
LibSVM best implementation for SVM LINK
Supplementary Material: LINK
,LINK, LINK, LINK, LINKInterpretation of SVM learning: LINK, LINKSVM sheet LINK