Introduction to Support Vector Machines A Reassuring Introduction to SVM 1 Mark Stamp Supervised vs Unsupervised Often use supervised learning where training relies on labeled data ID: 582716
Download Presentation The PPT/PDF document "A Reassuring" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
A Reassuring Introduction to Support Vector Machines
A Reassuring Introduction to SVM
1
Mark StampSlide2
Supervised vs Unsupervised
Often use supervised learning……where training relies on labeled data
Training data must be pre-processedIn contrast, unsupervised learning……uses unlabeled data
No
labels
required for trainingAlso semi-supervised algorithmsSupervised, but not too much?
A Reassuring Introduction to SVM
2Slide3
HMM for Supervised Learning
Suppose we want to use HMM for malware detectionTrain model on set of malwareAll from one specific familyData labeled as malware of that type
Test resulting model to see how well it distinguishes malware from benignThis is supervised
learning
A Reassuring Introduction to SVM
3Slide4
Unsupervised Learning?Recall HMM for English text example
Using N = 2, we find hidden states correspond to consonants and vowels
We did not specify consonants/vowelsHMM extracted this info from raw dataUnsupervised or semi-supervised?It seems to depend on your definition
A Reassuring Introduction to SVM
4Slide5
Unsupervised LearningClustering
Classic example of unsupervised learningOther examples?For
mixed dataset, clustering can
reveal
“hidden” structure
No pre-processingOften, no idea how to pre-processUsually used in “data exploration” mode
A Reassuring Introduction to SVM
5Slide6
Supervised Learning
SVM one of the most popular supervised learning methodAlso, HMM, PHMM, PCA, MLP, etc., used for supervised learning
SVM is for binary classificationI.e., 2 classes, such as malware vs benign
SVM generalizes to multiple classes
As does LDA and many other techniques
A Reassuring Introduction to SVM
6Slide7
Support Vector Machine
According to another author…“SVMs are a rare example of a methodology where geometric intuition, elegant mathematics, theoretical guarantees, and practical algorithms meet”
We have something to say about each aspect of this…Geometry, math, theory, and algorithms
A Reassuring Introduction to SVM
7Slide8
Support Vector MachineSVM based on four BIG ideas
Separating hyperplane
Maximize the “margin”Maximize minimum separation between classes
Work in a higher dimensional space
More “room”, so easier to separate
Kernel trick
This is intimately related to 3
Both
1
and
2
are fairly intuitive
A Reassuring Introduction to SVM
8Slide9
SVM
SVMs can apply to any training dataNote that SVM yields classifier
…… not just a scoring function
With HMM, for example
We first train a model
……t
hen generate scores and set a thresholdSVM directly gives classificationSkip the intermediate (testing) step
A Reassuring Introduction to SVM
9Slide10
Separating Classes
Consider labeled dataBinary classifierRed class is type “1”B
lue class is “-1”And (
x,y
)
are featuresHow to separate?We’ll use a “
hyperplane”……a line in this case
A Reassuring Introduction to SVM
10Slide11
Separating Hyperplanes
Consider labeled dataHere, easy to separateDraw a hyperplane
to separate pointsClassify new data based on separating hyperplane
Which
hyperplane
is better? Or best? Why?
A Reassuring Introduction to SVM
11Slide12
Maximize Margin
Margin is min distance to misclassificationMaximize the marginYellow hyperplane
is better than purpleSeems like a good ideaBut, may not be possible
See next slide…
A Reassuring Introduction to SVM
12Slide13
Separating… NOT
What about this case?Yellow line not an optionWhy not?No longer “separating”
What to do?Allow for some errors?E.g., hyperplane need not completely separate
A Reassuring Introduction to SVM
13Slide14
Soft Margin
Ideally, large margin and no errorsBut allowing some misclassifications might increase the margin by a lot
I.e., relax “separating” requirementHow many errors to allow?This will be a user defined parameter
Tradeoff?
Errors vs
larger marginIn practice, trial and error to determine optimal tradeoff
A Reassuring Introduction to SVM
14Slide15
Feature Space
Transform data to “feature space”Feature space usually in higher dimensionBut what about curse of dimensionality?
Q: Why increase dimensionality???
A
: Easier to separate in feature space
Goal is to make data “linearly separable” Want to separate classes with hyperplane
But not pay a price for high dimensionality
A Reassuring Introduction to SVM
15Slide16
Input Space & Feature Space
Why transform?Sometimes nonlinear can become linear…
A Reassuring Introduction to SVM
16
ϕ
Input space
Feature spaceSlide17
Feature Space in Higher Dimension
An example of what can happen when transforming to a higher dimension
A Reassuring Introduction to SVM
17Slide18
Feature Space
Usually, higher dimension is worseFrom computational complexity POV…...and from statistical significance POV
But higher dimensional feature space can make data linearly separableCan we have our cake and eat it too?Linearly separable
and
easy to compute?
Yes! Thanks to the kernel trick
A Reassuring Introduction to SVM
18Slide19
Kernel Trick
Enables us to work in input spaceWith results mapped to feature space
No work done explicitly in feature spaceComputations in input spaceLower dimension, so computation easierBut, things “happen” in feature space
Higher dimension, so easier to separate
Very, very cool trick!
A Reassuring Introduction to SVM
19Slide20
Kernel Trick
Unfortunately, to understand kernel trick, must dig a little (a lot?) deeperWill make all aspects of SVM clearerWe won’t cover every detail here
Just enough to get idea acrossWell, maybe a little more than that…We’ll need
Lagrange multipliers
But first,
constrained optimization
A Reassuring Introduction to SVM
20Slide21
Constrained Optimization
General problem (in 2 variables)Maximize: f(
x,y) Subject to:
g(
x,y
) = c
Objective function f and
constraint
g
For example,
Maximize:
f
(
x,y
) = 16 – (x
2
+ y
2
)
Subject to:
2x – y = 4
We’ll look at this example in detail
A Reassuring Introduction to SVM
21Slide22
Specific Example
Maximize:
f(x,y) = 16 – (x2
+ y
2
) Subject to:
2x – y = 4
Graph of
f(
x,y
)
A Reassuring Introduction to SVM
22Slide23
IntersectionIntersection of
f(x,y) and
2x – y = 4
What is the solution to problem?
A Reassuring Introduction to SVM
23Slide24
Constrained OptimizationThis example looks easy
But how to solve in general?Recall, general case (in 2 variables) isMaximize: f(
x,y)
Subject
to:
g(x,y) = c
How to “simplify”?We combine objective function
f(
x,y
)
and constraint
g(
x,y
) =
c
into one equation!
A Reassuring Introduction to SVM
24Slide25
Proposed SolutionDefine
J(x,y) = f(
x,y) + I
(
x,y
)Where I
(x,y)
is 0 whenever
g(
x,y
) = c
and
-∞
otherwise
Recall the general problem
…
Maximize:
f(
x,y
)
Subject
to:
g(
x,y
) = c
Solution is given by
max J(
x,y
)
Here,
max
is over all
(
x,y
)
A Reassuring Introduction to SVM
25Slide26
Proposed SolutionWe know how to solve
maximization problems using calculusSo, we’ll use calculus to solve the problem
max J(x,y)
, right?
WRONG!
The function J(x,y
) is not at all “nice”This function is not differentiable
It’s not even continuous!
A Reassuring Introduction to SVM
26Slide27
Proposed Solution
Again, let J(x,y
) = f(x,y) +
I
(
x,y)Where
I(x,y
)
is 0 whenever
g(
x,y
) = c
and
-∞
otherwise
Then
max J(
x,y
)
is solution to problem
This is
good
But we can’t solve this max problem
This is very
bad
What to do???
A Reassuring Introduction to SVM
27Slide28
New-and-Improved Solution
Let’s replace I(
x,y) with “nice”
function
What are the nicest functions of all?
Linear function (in the constraint)
To maximize f(x,y
)
, subject
to
g(
x,y
) = c
we first define
the
Lagrangian
L(
x,y,λ
) = f(
x,y
) +
λ
(
g(
x,y
) – c
)
N
ice function in
λ
, so calculus applies
But, not just a max problem (next slide
…)
A Reassuring Introduction to SVM
28Slide29
New-and-Improved Solution
Maximize: f(x,y
), subject to: g(x,y
) = c
Again, the Lagrangian is
L(x,y,λ) = f(
x,y
) +
λ
(
g(
x,y
) – c
)
Observe that
min L(
x,y,λ
) = J(
x,y
)
Where
min
is over
λ
Recall that
max J(
x,
y
)
solves problem
So
max min
L(
x,y,λ
)
also solves problem
Advantage of this form of problem ?
A Reassuring Introduction to SVM
29Slide30
Lagrange Multipliers
Maximize: f(x,y
), subject to: g(
x,y
) = c
Lagrangian:
L(x,y,λ)=f(
x,y
)+
λ
(g(
x,y
)–c)
Solution given by
max min L(
x,y,λ
)
Note this is
max
wrt
(
x,
y
)
variables
…
...and
min
is wrt
λ
parameter
So, solution is at a “saddle point” wrt overall function, i.e.,
(
x,y,λ
)
variables
By definition of a saddle point
A Reassuring Introduction to SVM
30Slide31
Saddle Points
Graph of
L(x,λ) = 4
-
x
2 +
λ(x
-
1)
Note
, f(x)
= 4
-
x
2
and constraint is
x
=
1
A Reassuring Introduction to SVM
31Slide32
New-and-Improved Solution
Maximize: f(x,y
), subject to: g(x,y
) = c
Lagrangian is L
(x,y,λ)=f
(
x,y
)+
λ
(g(
x,y
)–c
)
Solved by
max min L(
x,y,λ
)
Calculus to the rescue!
And which implies
g(
x,y
) = c
Langrangian
: Constrained optimization converted to unconstrained optimization
A Reassuring Introduction to SVM
32Slide33
More, More, More
Lagrangian generalizes to more variables and/or more constraintsOr, more succinctly
Where x=(x1
,x
2
,
…,xn) and
λ
=(
λ
1
,
λ
2
,
…
,
λ
m
)
A Reassuring Introduction to SVM
33Slide34
Another ExampleLots of good geometric examples
First, we do a non-geometric exampleConsider discrete probability distribution on n points:
p1
,p
2
,p3
,…,pn
What distribution has
max
entropy?
We want to
m
aximize entropy function
Subject to constraint that the
p
j
form a probability distribution
A Reassuring Introduction to SVM
34Slide35
Maximize EntropyShannon entropy:
Σ p
j log
2
pj
Have a probability distribution, so…Require 0 ≤
p
j
≤ 1
for all
j
, and
Σ
p
j
= 1
We will solve this simplified problem:
Maximize:
f(p
1
,..,p
n
) =
Σ
p
j
log
2
p
j
Subject to constraint:
Σ
p
j
= 1
How should we solve this?
Do you really have to ask?
A Reassuring Introduction to SVM
35Slide36
Entropy Example
Recall L(x,y,λ) =
f(x,y) +
λ
(
g(x,y) –
c) Problem statement
Maximize
f(p
1
,..,p
n
) =
Σ
p
j
log
2
p
j
Subject to constraint
Σ
p
j
= 1
In this case,
Lagrangian
is
L(p
1
,…,
p
n
,λ
) =
Σ
p
j
log
2
pj
+ λ (Σ
pj – 1) Compute partial derivatives wrt each
pj and partial derivative wrt λ
A Reassuring Introduction to SVM
36Slide37
Entropy Example
Have L(x,y,λ
) = Σ
p
j
log
2 p
j
+
λ
(
Σ
p
j
– 1)
Partial derivatives
wrt
any
p
j
yields
log
2
p
j
+ 1/ln(2) +
λ
= 0 (#)
And
wrt
λ
yields the constraint
Σ
p
j
– 1 = 0
or
Σ
p
j
= 1 (##)Equation (#) implies all p
j are equalWith equation (##), all p
j = 1/n Conclusion?
A Reassuring Introduction to SVM37Slide38
Notation
Let x=(x
1,x2
,…,
x
n)
and λ=(λ
1
,λ
2
,…,
λ
m
)
Again, we write
Lagrangian
as
L(x,λ
) =
f(x
) +
Σ
λ
i
(
g
i
(x
) –
c
i
)
Note:
L
is a function of
n+m
variables
Can view the problem as
…
Constraints
g
i
define a feasible region
Maximize the objective function f over this feasible regionA Reassuring Introduction to SVM
38Slide39
Lagrangian Duality
For Lagrange multipliers…Primal problem: max min
L(x,y,λ)Where
max
over
(x,y
) and min over
λ
Dual problem:
min max
L(x,y,λ
)
As above,
max
over
(
x,y
)
and
min
over
λ
We claim it’s easy to see that
min max
L
(
x,y,λ
) ≥ max min L(
x,y,λ
)
Why is this true
? Next slide...
A Reassuring Introduction to SVM
39Slide40
Dual Problem
Recall J(x,y) = f(
x,y) + I
(
x,y
)Where
I(x,y
)
is 0 whenever
g(
x,y
) = c
and
-∞
otherwise
And
max J(
x,y
)
is a solution
Then
L(
x,y,λ
) ≥ J(
x,y
)
And
max L(
x,y,λ
) ≥
max J
(
x,y
)
for all
λ
Therefore,
min max
L(
x,y,λ
) ≥ max J(
x,y
)
min max L(x,y,λ) ≥ max min L(
x,y,λ)A Reassuring Introduction to SVM
40Slide41
Dual ProblemSo, we have shown that dual problem provides upper bound
min max L(x,y,λ) ≥ max min L(
x,y,λ)
That is, dual solution
≥
primal solutionBut it’s even better than thatFor Lagrangian,
equality holds true Why equality?Because Lagrangian is convex function
A Reassuring Introduction to SVM
41Slide42
Primal Problem
Maximize: f(x,y) = 16 – (x
2 + y
2
)
Subject to: 2x –
y = 4 Then
L(x,y,λ
) = 16 – (x
2
+ y
2
) + λ(2x –
y
- 4)
Compute partial derivatives…
dL/dx
= -
2x +
2λ = 0
dL/dy
= -2y –
λ
= 0
dL/dλ
= 2x –
y
– 4 = 0
Result:
(
x,y,λ
) = (-8/5,4/5,-8/5)
Which yields max of
f(x,y
) = 64/5
A Reassuring Introduction to SVM
42Slide43
Dual Problem
Maximize: f(x,y) = 16 – (x
2 + y2
)
Subject to: 2x –
y = 4 Then
L(x,y,λ
) = 16–(x
2
+y
2
)+λ(2x–y-4)
Recall that dual problem is
min max
L(x,y,λ
)
Where
max
is over
(
x,y
)
,
min is
over
λ
How can we solve this?
A Reassuring Introduction to SVM
43Slide44
Dual Problem
Dual problem: min max L(x,y,λ
)So, can first take max
of
L
over (x,y
) Then we are left with function L
only in
λ
To solve problem, then find
min
L(λ)
On next slide, we illustrate this for
L(x,y,λ
) = 16 – (x
2
+ y
2
) + λ(2x –
y
- 4)
Same example as considered above
A Reassuring Introduction to SVM
44Slide45
Dual Problem
Given L(x,y,λ
) = 16 – (x2
+ y
2
) + λ(2x – y
– 4)Maximize over (
x,y
)
by computing
dL/dx
= -2x + 2λ = 0
dL/dy
= -2y -
λ
= 0
Which implies
x
=
λ
and
y
= -λ/2
Substitute these into
L
to obtain
L(λ) = 5/4
λ
2
+ 4λ + 16
A Reassuring Introduction to SVM
45Slide46
Dual Problem
Original problemMaximize: f(x,y
) = 16 – (x2 + y
2
)
Subject to: 2x –
y = 4 Solution can be found by minimizing
L(λ) = 5/4
λ
2
+ 4λ + 16
Then
L
’
(λ) = 5/2
λ
+ 4 = 0,
which gives
λ
= -8/5
and
(
x,y
) = (-8/5,4/5)
Same solution as the primal problem!
A Reassuring Introduction to SVM
46Slide47
Summary of Dual ProblemMaximize
L to find (
x,y) in terms of λ
Then rewrite
L as function of λ
onlyFinally, minimize L(λ) to solve problem
But, why all of the fuss?
Dual problem allows us to write SVM problem in much more user-friendly way
In SVM, we’ll consider dual
L(λ)
A Reassuring Introduction to SVM
47Slide48
Lagrange Multipliers and SVM
Lagrange multipliers very cool indeedBut what does this have to do with SVM?Can view (soft) margin computation as constrained optimization problem
In this form, kernel trick becomes clearWe can kill 2 birds with 1 stoneMake margin calculation clearerMake kernel trick perfectly clear
A Reassuring Introduction to SVM
48Slide49
Problem SetupLet
X1,X
2,…,
X
n
be data pts (vectors)Spse
each Xi
= (
x
i
,y
i
)
a point in the plane
In general, could be higher dimension
Let
z
1
,z
2
,…,
z
n
be corresponding class labels, where each
z
i
{-1,1}
Where
z
i
= 1
if classified as “red” type
And
z
i
= -1
if classified as “blue” type
Note this is a binary classification
A Reassuring Introduction to SVM
49Slide50
Geometric View
Equation of yellow linew1
x + w2
y +
b
= 0 Equation of red line
w1
x + w
2
y +
b
= 1
Equation of blue line
w
1
x + w
2
y +
b
= -1
Margin
m
is length of green line
A Reassuring Introduction to SVM
50
x
y
mSlide51
Training
Any red point X
= (
x,y
) must
satisfyw
1x + w
2
y +
b
≥ 1
Any blue point
X
=
(
x,y
) must
satisfy
w
1
x + w
2
y +
b
≤ -1
Want inequalities
all
true after training
A Reassuring Introduction to SVM
51
x
ySlide52
Scoring
With lines defined… Given new data point X = (
x,y) to classify“Red” provided that
w
1
x + w
2y + b
> 0
“Blue” provided that
w
1
x + w
2
y +
b
< 0
This is scoring phase
A Reassuring Introduction to SVM
52
x
ySlide53
How to Train?
The real question is...How to find equation of the yellow line?Given {X
i} and
{
z
i}
The X
i
points in plane
…
…a
nd
z
i
are classes
Finding yellow line is SVM training phase
A Reassuring Introduction to SVM
53
x
ySlide54
Maximize the Margin
Distance from origin to line Ax+By+C
= 0 is|C| / sqrt(A
2
+ B
2
)Origin to red dashed line:|1-b| /
sqrt
(
w
1
2
+
w
2
2
)
where
W = (w
1
,w
2
)
Origin to blue dashed line:
|-1-b| /
sqrt
(w
1
2
+ w
2
2
)
Margin is
m = 2/
sqrt
(w
1
2
+ w
2
2
)
A Reassuring Introduction to SVM
54
y
x
mSlide55
Training Phase
Given {Xi
} and {
z
i
}, find largest margin m
that classifies all points correctlyThat is, find red, blue lines in pictureRecall red line is of the form
w
1
x + w
2
y +
b
= 1
Blue line is of the form
w
1
x + w
2
y +
b
= -1
And max margin:
m = 2/
sqrt
(w
1
2
+ w
2
2
)
A Reassuring Introduction to SVM
55Slide56
TrainingSince
zi
{-1,1}
, correct classification occurs provided
zi
(w1
x
i
+ w
2
y
i
+
b
) ≥ 1
for all
i
Training problem to solve:
Maximize:
m = 2
/
sqrt
(w
1
2
+ w
2
2
)
Subject to constraints:
z
i
(w
1
x
i
+ w
2
y
i
+
b
) ≥ 1
for i
=1,2,…,n Can we determine W
=(w1,w2
) and b ?
A Reassuring Introduction to SVM56Slide57
TrainingThe problem on previous slide is equivalent to the following
Minimize: F(W) = (w1
2 + w
2
2
) / 2
Subject to constraints: 1 -
z
i
(w
1
x
i
+ w
2
y
i
+
b
) ≤ 0
for all
i
Should be starting to look familiar…
A Reassuring Introduction to SVM
57Slide58
LagrangianPretend inequalities are
equalities… L(w
1,w2
,b,λ) = (w
1
2 + w
22) / 2
+
Σ
λ
i
(1 -
z
i
(w
1
x
i
+ w
2
y
i
+
b
))
Compute
dL/dw
1
= w
1
-
Σ
λ
i
z
i
x
i
= 0
dL/dw
2
= w
2
-
Σ
λiz
iyi = 0
dL/db = Σ λ
izi
= 0dL/dλi = 1 - z
i (w1
xi + w2y
i + b) = 0
A Reassuring Introduction to SVM58Slide59
LagrangianDerivatives yield constraints and
W = (w1
w2)
T
=
Σ
λi
z
i
X
i
and
Σ
λ
i
z
i
= 0
Substitute these into
L
yields
L(λ) =
Σ
λ
i
– ½ ΣΣ
λ
i
λ
j
z
i
z
j
(
X
i
X
j
)
Where “” is dot product: X
iXj
= xix
j + y
iyjHere, L
is only a function of λ We still have the constraint Σ
λizi
= 0Note: If we find λi
then we know WA Reassuring Introduction to SVM
59Slide60
New-and-Improved Problem
Max: L(λ) = Σ
λi
– ½ ΣΣ
λ
i
λjz
i
z
j
(
X
i
X
j
)
Subject to:
Σ
λ
i
z
i
= 0
and all
λ
i
≥ 0
Why maximize
L(λ)
? Intuitively …
Goal is to minimize
F(W) = (w
1
2
+ w
2
2
) / 2
Subject to constraints in
L(λ)
functionMaximize L(λ) finds “best” parameters
λ And “best” λ will solve this min problemRecall, this is dual problem
A Reassuring Introduction to SVM
60Slide61
Dual Version of Problem
Max: L(λ) = Σ
λi
– ½ ΣΣ
λ
i
λjz
i
z
j
(
X
i
X
j
)
Subject to:
Σ
λ
i
z
i
= 0
and all
λ
i
≥ 0
Again, this is the dual problem
Can always solve it (if solution exists)
And will find a
global
maximum
It doesn’t get any better than that!
With HMM (for example), no guarantee of global maximum
A Reassuring Introduction to SVM
61Slide62
All Together Now: TrainingGiven data points
X1,X
2,…,
X
n
Label each X
i with z
i
{-1,1}
Solve dual problem (previous slide)
Solving it yields the
λ
i
Once
λ
i
known, compute
W=(w
1
,w
2
)
and
b
Obtain equation of line:
w
1
x + w
2
y +
b
What have we accomplished?
A Reassuring Introduction to SVM
62Slide63
All Together Now: Scoring
From training, we’ve found λi
Yields W=(w1
,w
2
) and
b in w
1
x + w
2
y +
b
Given new data point
X = (
x,y
)
That is,
X
not in training set
Compute
w
1
x + w
2
y +
b
If
greater than
0,
classify
X
as
“red” (
+1
)
Otherwise, classify
X
as
“blue” (
-1
)
What happened, in terms of picture?
A Reassuring Introduction to SVM
63Slide64
Geometric Viewpoint
Training?Find equation of yellow line, f(X
)Score X = (
x,y
)
?If f(X
) > 0, then X is above yellow line (classify as red)
Else
X
below line (classify as blue)
A Reassuring Introduction to SVM
64
y
x
mSlide65
Scoring RevisitedUse yellow line for scoring
…There is an alternative (better) wayHave
f(X) = w1
x + w
2
y + b
= W
X +
b
Recall that
W =
Σ
λ
i
z
i
X
i
So,
f(X
) =
Σ
λ
i
z
i
(X
i
X
) +
b
Why is this better?
No need to explicitly compute
W
Any better reasons why it’s better?
A Reassuring Introduction to SVM
65Slide66
Support VectorsWhen solving
L(λ), find most λ
i = 0 Specifically, the
X
i
for whichz
i
(w
1
x
i
+ w
2
y
i
+
b
) > 1
Only constraints that can matter are
z
i
(w
1
x
i
+ w
2
y
i
+
b
) = 1
The latter are
support vectors
Not known in advance
training determines the support vectors
A Reassuring Introduction to SVM
66Slide67
Support Vectors
Picture worth 1000 words?Where are the support vectors?Other vectors (training points) don’t matter
Why not?
A Reassuring Introduction to SVM
67
x
y
support
vectorsSlide68
Scoring Re-revisitedScore
X using f(X
) = Σ
λ
i
zi
(Xi
X
) +
b
Generally, most of the
λ
i
will be 0
So, sum is not really
i
=1
to
n
Instead, sum is
i
=1
to
s
Where
s
is number of support vectors
Why does this matter?
Typically,
n
large,
s
small, so
fast
scoring
This
f(X
)
is useful for other reasons too
A Reassuring Introduction to SVM
68Slide69
Training: Soft Margin
Suppose we relax “linearly separable”Tradeoff errors for bigger margin m
More errors, but gain bigger marginTwo types of errors illustrated here…
A Reassuring Introduction to SVM
69
x
y
mSlide70
ErrorsTo account for errors, introduce “slack variables”
εi
≥ 0 in optimizationFor red point
X
i
= (x
i,y
i
)
, constraint is
w
1
x
i
+ w
2
y
i
+
b
≥ 1 -
ε
i
For blue point
X
i
= (
x
i
,y
i
)
, constraint is
w
1
x
i
+ w
2
y
i
+
b
≤ -1 +
ε
i
Minimize: (w
12 + w22
) / 2 + C Σ
εi Subject to constraints above
A Reassuring Introduction to SVM70Slide71
Dual Problem
Work thru details, dual problem is…Max: L(λ) =
Σ λ
i
– ½ ΣΣ
λi
λjz
i
z
j
(
X
i
X
j
)
Subject to:
Σ
λ
i
z
i
= 0
and
C ≥
λ
i
≥ 0
Only difference is
C ≥
λ
i
condition
We specify
C
when training
Bottom line?
Allowing for errors changes constraints
A Reassuring Introduction to SVM
71Slide72
Training and ScoringRe-re-revisited
TrainingMaximize: L(λ) =
Σ
λ
i
– ½ ΣΣ λ
iλj
z
i
z
j
(
X
i
X
j
)
Subject to:
Σ
λ
i
z
i
= 0
and
C ≥
λ
i
≥ 0
Where
C
specified by user
Scoring: Given
X=(
x,y
)
Compute
f(X
)=
Σ
λi
zi(Xi
X)+b, where sum is over support vectorsIf f(X
) < 0, then X is “blue”; else it’s “red”
A Reassuring Introduction to SVM72Slide73
Kernel Trick
Finally, we can make sense of kernel trick
Recall X1
,X
2
,…,
Xn are training vectorsFor
training
, the
X
i
only appear as
X
i
X
j
When
scoring
X
, the
X
i
only appear as
X
i
X
Can transform input
data to
feature space
Then c
ompute
dot products in
the feature
space
Effect is to replace “
” with
s
omething
defined in higher
dimensions
And since dot product in feature space……it is easy to compute (in terms of input space)
A Reassuring Introduction to SVM
73Slide74
Kernel Example
For example, suppose we defineThen ϕ
maps element in 2-d to 6-dFor X
i
=(
xi
,yi)
and
X
j
=(
x
j
,y
j
)
, we have
ϕ
(X
i
)
ϕ
(
X
j
) = (1 +
x
i
x
j
+
y
i
y
j
)
2
= (
X
i
X
j
+ 1)
2
Define the kernel function K as
K(Xi,Xj
) = (Xi
Xj + 1)2
Note: K is composition of
ϕ and “” A Reassuring Introduction to SVM
74Slide75
Kernel ExampleIn the input space,
Xi and
Xj are 2-d
Map
X
i and
Xj to 6-d feature space
These are
ϕ
(X
i
)
and
ϕ
(
X
j
)
Perform dot product in feature space
That is, compute
ϕ
(X
i
)
ϕ
(
X
j
)
Math
works
whether 2-d
input
(
X
i
,
X
j
) or 6-d
feature space (
ϕ
(X
i),
ϕ(Xj
))Dot product is in feature space! A Reassuring Introduction to SVM
75Slide76
The Big Picture
Training data lives in input spaceWhere data may
not be linearly separableMap input space to higher dimension feature space using function
ϕ
Do training & scoring in feature spaceWhere data may be linearly separable
But don’t want to suffer performance penalty due to higher dimensionChoose kernel function wisely
…
A Reassuring Introduction to SVM
76Slide77
Training & Scoring with Kernel
Simply replace Xi
X
j
with
K(Xi
,Xj
)
Training
Max:
L(λ) =
Σ
λ
i
– ½
ΣΣ
λ
i
λ
j
z
i
z
j
K(X
i
,
X
j
)
Subject to:
Σ
λ
i
z
i
= 0
and
C ≥
λ
i
≥ 0
Where C specified by userScoring: Given X=(
x,y) Compute f(X)=
Σ λi
zi K(Xi
,X)+b If f(X
) < 0, then X is “blue”; else “red”
A Reassuring Introduction to SVM77Slide78
Kernel Trick
No need to explicitly map input data to feature spaceWe don’t even need to know
ϕ Only need resulting kernel function K
Bottom line
Obtain the benefit of working in higher dimension space (linear separable)……with no significant performance penalty
That’s a really awesome trick !!!
A Reassuring Introduction to SVM
78Slide79
Popular KernelsLinear kernel:
K(Xi
,Xj
) =
X
i
Xj
Polynomial learning machine
K(X
i
,X
j
) = (
X
i
X
j
+ 1)
p
Gaussian radial-basis function (RBF)
K(X
i
,X
j
) = exp(-(X
i
–
X
j
)
(X
i
–
X
j
)/(2σ
2
))
Two-layer
perceptron
K(X
i
,X
j) = tanh(β0 X
iXj + β
1) Many other possibilitiesSelecting “right” kernel is the real trick
A Reassuring Introduction to SVM
79Slide80
A Brief History of SVM
SVM invented in 1962But not actually useful until 1990sNonlinearity added in 1992
That is, the kernel trickSoft margins developed in 1993 Although not published until 1995SVM is a relative newcomer
Only useful if we can efficiently train
...
A Reassuring Introduction to SVM
80Slide81
Algorithm to Train SVM?
To train (linear) SVM must solve
quadratic programming
problem:
Once the are
λ
i are known, we classify samples using
A Reassuring Introduction to SVM
81Slide82
How to Solve QP Problem?Many general techniques available to solve QP problems
Interior point, conjugate gradient, …But SVM problem is very special
Solution is “sparse”We have inequality constraints
Problem is LARGE (more on this later
…)
So, SVM is a not a standard QP case
A Reassuring Introduction to SVM
82Slide83
Peculiarities of SVM Training
Compared to “standard” QP problems, SVM does not require great precision
True of most ML algorithmsSolution to SVM problem is sparseThat is, most of the
λ
i
will be 0……
but don’t know which will be 0 in advanceNumber of training samples and size of each training vector can both be HUGE
A Reassuring Introduction to SVM
83Slide84
QP SolversGeneral QP solvers
Precompute Kij
= K(Xi
,X
j
) for all
i and j And store entire
K
matrix in memory
M
akes the algorithms efficient
…
…
b
ut, not suitable for SVM training
What to do?
Recall that SVM solution is “sparse”
C
an we take advantage of sparseness?
A Reassuring Introduction to SVM
84Slide85
Early SVM Solvers
“…all early SVM results were obtained using ad hoc algorithms”“Early” means until mid-to-late 1990s
An idea: Combine “iterative chunking” with “simple direction search”Chunking means that we only deal with part of the training data at a timeThen optimize by direction search, based on gradient computations
A Reassuring Introduction to SVM
85Slide86
Iterative ChunkingCan’t deal with whole problem at once
So, a possible plan of attack is…Solve SVM on “working set”, i.e., a subset (or chunk
) of the training dataThen consider data pts inside this margin
These are
candidate
support vectors!So, use them as working set, and repeat
Sounds very plausible!A Reassuring Introduction to SVM
86Slide87
Direction SearchChoose a “good” direction
Kind of like Newton’s method, or conjugate gradient, or gradient descent…Improve on current solution by moving in
a “good” directionWe won’t discuss details here
Idea is fairly straightforward
But, details can be somewhat complex
A Reassuring Introduction to SVM
87Slide88
SMOCurrent best SVM solvers use SMO
…Sequential minimal optimizationFirst proposed in 1999
Divide into minimal QP sub-problemsMinimal “working set” (just 2 variables)
Makes direction computation easy
Convergence/termination properties of SMO are good and well understood
A Reassuring Introduction to SVM
88Slide89
SMO Working Set
In SMO, working set always 2 variablesHow to select (i
,j) to speed convergence?M
aximize gain in objective function?
Requires exhaustive search
, too inefficient Instead, use heuristics so that…
Cached kernel values used (Dual) objective function increases
And don
’
t worry too much about precision
A Reassuring Introduction to SVM
89Slide90
Simplified SMOSSMO is like SMO,
but simplified! We’ll ignore heuristics used to select working set pair (i,j)
at each stepAssume working set is given at each stepHomework considers simplified scenarios for generating working set
Otherwise, the SSMO algorithm we discuss is (almost) same as
SMO
A Reassuring Introduction to SVM
90Slide91
SSMO
Solve
F
ind the
λ
i in Lagrangian
(dual) Given n training samples
(
X
i
,z
i
)
Let
A Reassuring Introduction to SVM
91Slide92
SSMO Details
Book has details of SSMO algorithmKey is so-called “KKT conditions”
Necessary and sufficient conditions for SVM (QP) problem to converge
We enforce KKT conditions by choice of
b
in SMO algorithm (and SSMO)
A Reassuring Introduction to SVM
92Slide93
SMO Bottom LineSMO is fast and efficient for the SVM training problem
Can solve problems with a very LARGE number of training vectorsAnd individual vectors can be LARGE tooChoosing the “working set” at each step is the tricky part
That is, how to choose indices (i,j
)
in “good” way
Heuristics to choose (
i,j), speed up convergence
A Reassuring Introduction to SVM
93Slide94
SVM +’s and –’s
Strengths
In training, obtain a global maximum, not just a local maximumCan tradeoff margin and training errors
Kernel trick is
totally awesome
WeaknessesChoosing kernel is more art than scienceSuccess depends heavily on kernel (and parameter) choice(s)
A Reassuring Introduction to SVM
94Slide95
References
R. Berwick, An idiot’s guide to support vector machines
E. Kim, Everything you wanted to know about
the kernel trick (but were too afraid to ask)
M. Law,
A simple introduction to support
vector machinesW.S. Noble, What is a support vector
machine?
, Nature Biotechnology, 24(12):1565-1567, 2006
A Reassuring Introduction to SVM
95Slide96
References: Lagrange Multipliers
D. Klein, Lagrange multipliers without
permanent scarringWikipedia,
Lagrange multiplier
A Reassuring Introduction to SVM
96Slide97
References: SMO AlgorithmA. Ng, Simplified SMO algorithm,
http://cs229.stanford.edu/materials/smo.pdfL. Bottou
and C.-J. Lin, Support vector machine solvers, https://www.csie.ntu.edu.tw/~cjlin/papers/bottou_lin.pdf
A Reassuring Introduction to SVM
97