/
A Reassuring A Reassuring

A Reassuring - PowerPoint Presentation

jane-oiler
jane-oiler . @jane-oiler
Follow
375 views
Uploaded On 2017-08-27

A Reassuring - PPT Presentation

Introduction to Support Vector Machines A Reassuring Introduction to SVM 1 Mark Stamp Supervised vs Unsupervised Often use supervised learning where training relies on labeled data ID: 582716

introduction svm problem reassuring svm introduction reassuring problem max training subject min maximize dual feature kernel function line solution

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "A Reassuring" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

A Reassuring Introduction to Support Vector Machines

A Reassuring Introduction to SVM

1

Mark StampSlide2

Supervised vs Unsupervised

Often use supervised learning……where training relies on labeled data

Training data must be pre-processedIn contrast, unsupervised learning……uses unlabeled data

No

labels

required for trainingAlso semi-supervised algorithmsSupervised, but not too much?

A Reassuring Introduction to SVM

2Slide3

HMM for Supervised Learning

Suppose we want to use HMM for malware detectionTrain model on set of malwareAll from one specific familyData labeled as malware of that type

Test resulting model to see how well it distinguishes malware from benignThis is supervised

learning

A Reassuring Introduction to SVM

3Slide4

Unsupervised Learning?Recall HMM for English text example

Using N = 2, we find hidden states correspond to consonants and vowels

We did not specify consonants/vowelsHMM extracted this info from raw dataUnsupervised or semi-supervised?It seems to depend on your definition

A Reassuring Introduction to SVM

4Slide5

Unsupervised LearningClustering

Classic example of unsupervised learningOther examples?For

mixed dataset, clustering can

reveal

“hidden” structure

No pre-processingOften, no idea how to pre-processUsually used in “data exploration” mode

A Reassuring Introduction to SVM

5Slide6

Supervised Learning

SVM one of the most popular supervised learning methodAlso, HMM, PHMM, PCA, MLP, etc., used for supervised learning

SVM is for binary classificationI.e., 2 classes, such as malware vs benign

SVM generalizes to multiple classes

As does LDA and many other techniques

A Reassuring Introduction to SVM

6Slide7

Support Vector Machine

According to another author…“SVMs are a rare example of a methodology where geometric intuition, elegant mathematics, theoretical guarantees, and practical algorithms meet”

We have something to say about each aspect of this…Geometry, math, theory, and algorithms

A Reassuring Introduction to SVM

7Slide8

Support Vector MachineSVM based on four BIG ideas

Separating hyperplane

Maximize the “margin”Maximize minimum separation between classes

Work in a higher dimensional space

More “room”, so easier to separate

Kernel trick

This is intimately related to 3

Both

1

and

2

are fairly intuitive

A Reassuring Introduction to SVM

8Slide9

SVM

SVMs can apply to any training dataNote that SVM yields classifier

…… not just a scoring function

With HMM, for example

We first train a model

……t

hen generate scores and set a thresholdSVM directly gives classificationSkip the intermediate (testing) step

A Reassuring Introduction to SVM

9Slide10

Separating Classes

Consider labeled dataBinary classifierRed class is type “1”B

lue class is “-1”And (

x,y

)

are featuresHow to separate?We’ll use a “

hyperplane”……a line in this case

A Reassuring Introduction to SVM

10Slide11

Separating Hyperplanes

Consider labeled dataHere, easy to separateDraw a hyperplane

to separate pointsClassify new data based on separating hyperplane

Which

hyperplane

is better? Or best? Why?

A Reassuring Introduction to SVM

11Slide12

Maximize Margin

Margin is min distance to misclassificationMaximize the marginYellow hyperplane

is better than purpleSeems like a good ideaBut, may not be possible

See next slide…

A Reassuring Introduction to SVM

12Slide13

Separating… NOT

What about this case?Yellow line not an optionWhy not?No longer “separating”

What to do?Allow for some errors?E.g., hyperplane need not completely separate

A Reassuring Introduction to SVM

13Slide14

Soft Margin

Ideally, large margin and no errorsBut allowing some misclassifications might increase the margin by a lot

I.e., relax “separating” requirementHow many errors to allow?This will be a user defined parameter

Tradeoff?

Errors vs

larger marginIn practice, trial and error to determine optimal tradeoff

A Reassuring Introduction to SVM

14Slide15

Feature Space

Transform data to “feature space”Feature space usually in higher dimensionBut what about curse of dimensionality?

Q: Why increase dimensionality???

A

: Easier to separate in feature space

Goal is to make data “linearly separable” Want to separate classes with hyperplane

But not pay a price for high dimensionality

A Reassuring Introduction to SVM

15Slide16

Input Space & Feature Space

Why transform?Sometimes nonlinear can become linear…

A Reassuring Introduction to SVM

16

ϕ

Input space

Feature spaceSlide17

Feature Space in Higher Dimension

An example of what can happen when transforming to a higher dimension

A Reassuring Introduction to SVM

17Slide18

Feature Space

Usually, higher dimension is worseFrom computational complexity POV…...and from statistical significance POV

But higher dimensional feature space can make data linearly separableCan we have our cake and eat it too?Linearly separable

and

easy to compute?

Yes! Thanks to the kernel trick

A Reassuring Introduction to SVM

18Slide19

Kernel Trick

Enables us to work in input spaceWith results mapped to feature space

No work done explicitly in feature spaceComputations in input spaceLower dimension, so computation easierBut, things “happen” in feature space

Higher dimension, so easier to separate

Very, very cool trick!

A Reassuring Introduction to SVM

19Slide20

Kernel Trick

Unfortunately, to understand kernel trick, must dig a little (a lot?) deeperWill make all aspects of SVM clearerWe won’t cover every detail here

Just enough to get idea acrossWell, maybe a little more than that…We’ll need

Lagrange multipliers

But first,

constrained optimization

A Reassuring Introduction to SVM

20Slide21

Constrained Optimization

General problem (in 2 variables)Maximize: f(

x,y) Subject to:

g(

x,y

) = c

Objective function f and

constraint

g

For example,

Maximize:

f

(

x,y

) = 16 – (x

2

+ y

2

)

Subject to:

2x – y = 4

We’ll look at this example in detail

A Reassuring Introduction to SVM

21Slide22

Specific Example

Maximize:

f(x,y) = 16 – (x2

+ y

2

) Subject to:

2x – y = 4

Graph of

f(

x,y

)

A Reassuring Introduction to SVM

22Slide23

IntersectionIntersection of

f(x,y) and

2x – y = 4

What is the solution to problem?

A Reassuring Introduction to SVM

23Slide24

Constrained OptimizationThis example looks easy

But how to solve in general?Recall, general case (in 2 variables) isMaximize: f(

x,y)

Subject

to:

g(x,y) = c

How to “simplify”?We combine objective function

f(

x,y

)

and constraint

g(

x,y

) =

c

into one equation!

A Reassuring Introduction to SVM

24Slide25

Proposed SolutionDefine

J(x,y) = f(

x,y) + I

(

x,y

)Where I

(x,y)

is 0 whenever

g(

x,y

) = c

and

-∞

otherwise

Recall the general problem

Maximize:

f(

x,y

)

Subject

to:

g(

x,y

) = c

Solution is given by

max J(

x,y

)

Here,

max

is over all

(

x,y

)

A Reassuring Introduction to SVM

25Slide26

Proposed SolutionWe know how to solve

maximization problems using calculusSo, we’ll use calculus to solve the problem

max J(x,y)

, right?

WRONG!

The function J(x,y

) is not at all “nice”This function is not differentiable

It’s not even continuous!

A Reassuring Introduction to SVM

26Slide27

Proposed Solution

Again, let J(x,y

) = f(x,y) +

I

(

x,y)Where

I(x,y

)

is 0 whenever

g(

x,y

) = c

and

-∞

otherwise

Then

max J(

x,y

)

is solution to problem

This is

good

But we can’t solve this max problem

This is very

bad

What to do???

A Reassuring Introduction to SVM

27Slide28

New-and-Improved Solution

Let’s replace I(

x,y) with “nice”

function

What are the nicest functions of all?

Linear function (in the constraint)

To maximize f(x,y

)

, subject

to

g(

x,y

) = c

we first define

the

Lagrangian

L(

x,y,λ

) = f(

x,y

) +

λ

(

g(

x,y

) – c

)

N

ice function in

λ

, so calculus applies

But, not just a max problem (next slide

…)

A Reassuring Introduction to SVM

28Slide29

New-and-Improved Solution

Maximize: f(x,y

), subject to: g(x,y

) = c

Again, the Lagrangian is

L(x,y,λ) = f(

x,y

) +

λ

(

g(

x,y

) – c

)

Observe that

min L(

x,y,λ

) = J(

x,y

)

Where

min

is over

λ

Recall that

max J(

x,

y

)

solves problem

So

max min

L(

x,y,λ

)

also solves problem

Advantage of this form of problem ?

A Reassuring Introduction to SVM

29Slide30

Lagrange Multipliers

Maximize: f(x,y

), subject to: g(

x,y

) = c

Lagrangian:

L(x,y,λ)=f(

x,y

)+

λ

(g(

x,y

)–c)

Solution given by

max min L(

x,y,λ

)

Note this is

max

wrt

(

x,

y

)

variables

...and

min

is wrt

λ

parameter

So, solution is at a “saddle point” wrt overall function, i.e.,

(

x,y,λ

)

variables

By definition of a saddle point

A Reassuring Introduction to SVM

30Slide31

Saddle Points

Graph of

L(x,λ) = 4

-

x

2 +

λ(x

-

1)

Note

, f(x)

= 4

-

x

2

and constraint is

x

=

1

A Reassuring Introduction to SVM

31Slide32

New-and-Improved Solution

Maximize: f(x,y

), subject to: g(x,y

) = c

Lagrangian is L

(x,y,λ)=f

(

x,y

)+

λ

(g(

x,y

)–c

)

Solved by

max min L(

x,y,λ

)

Calculus to the rescue!

And which implies

g(

x,y

) = c

Langrangian

: Constrained optimization converted to unconstrained optimization

A Reassuring Introduction to SVM

32Slide33

More, More, More

Lagrangian generalizes to more variables and/or more constraintsOr, more succinctly

Where x=(x1

,x

2

,

…,xn) and

λ

=(

λ

1

,

λ

2

,

,

λ

m

)

A Reassuring Introduction to SVM

33Slide34

Another ExampleLots of good geometric examples

First, we do a non-geometric exampleConsider discrete probability distribution on n points:

p1

,p

2

,p3

,…,pn

What distribution has

max

entropy?

We want to

m

aximize entropy function

Subject to constraint that the

p

j

form a probability distribution

A Reassuring Introduction to SVM

34Slide35

Maximize EntropyShannon entropy:

Σ p

j log

2

pj

Have a probability distribution, so…Require 0 ≤

p

j

≤ 1

for all

j

, and

Σ

p

j

= 1

We will solve this simplified problem:

Maximize:

f(p

1

,..,p

n

) =

Σ

p

j

log

2

p

j

Subject to constraint:

Σ

p

j

= 1

How should we solve this?

Do you really have to ask?

A Reassuring Introduction to SVM

35Slide36

Entropy Example

Recall L(x,y,λ) =

f(x,y) +

λ

(

g(x,y) –

c) Problem statement

Maximize

f(p

1

,..,p

n

) =

Σ

p

j

log

2

p

j

Subject to constraint

Σ

p

j

= 1

In this case,

Lagrangian

is

L(p

1

,…,

p

n

) =

Σ

p

j

log

2

pj

+ λ (Σ

pj – 1) Compute partial derivatives wrt each

pj and partial derivative wrt λ

A Reassuring Introduction to SVM

36Slide37

Entropy Example

Have L(x,y,λ

) = Σ

p

j

log

2 p

j

+

λ

(

Σ

p

j

– 1)

Partial derivatives

wrt

any

p

j

yields

log

2

p

j

+ 1/ln(2) +

λ

= 0 (#)

And

wrt

λ

yields the constraint

Σ

p

j

– 1 = 0

or

Σ

p

j

= 1 (##)Equation (#) implies all p

j are equalWith equation (##), all p

j = 1/n Conclusion?

A Reassuring Introduction to SVM37Slide38

Notation

Let x=(x

1,x2

,…,

x

n)

and λ=(λ

1

2

,…,

λ

m

)

Again, we write

Lagrangian

as

L(x,λ

) =

f(x

) +

Σ

λ

i

(

g

i

(x

) –

c

i

)

Note:

L

is a function of

n+m

variables

Can view the problem as

Constraints

g

i

define a feasible region

Maximize the objective function f over this feasible regionA Reassuring Introduction to SVM

38Slide39

Lagrangian Duality

For Lagrange multipliers…Primal problem: max min

L(x,y,λ)Where

max

over

(x,y

) and min over

λ

Dual problem:

min max

L(x,y,λ

)

As above,

max

over

(

x,y

)

and

min

over

λ

We claim it’s easy to see that

min max

L

(

x,y,λ

) ≥ max min L(

x,y,λ

)

Why is this true

? Next slide...

A Reassuring Introduction to SVM

39Slide40

Dual Problem

Recall J(x,y) = f(

x,y) + I

(

x,y

)Where

I(x,y

)

is 0 whenever

g(

x,y

) = c

and

-∞

otherwise

And

max J(

x,y

)

is a solution

Then

L(

x,y,λ

) ≥ J(

x,y

)

And

max L(

x,y,λ

) ≥

max J

(

x,y

)

for all

λ

Therefore,

min max

L(

x,y,λ

) ≥ max J(

x,y

)

min max L(x,y,λ) ≥ max min L(

x,y,λ)A Reassuring Introduction to SVM

40Slide41

Dual ProblemSo, we have shown that dual problem provides upper bound

min max L(x,y,λ) ≥ max min L(

x,y,λ)

That is, dual solution

primal solutionBut it’s even better than thatFor Lagrangian,

equality holds true Why equality?Because Lagrangian is convex function

A Reassuring Introduction to SVM

41Slide42

Primal Problem

Maximize: f(x,y) = 16 – (x

2 + y

2

)

Subject to: 2x –

y = 4 Then

L(x,y,λ

) = 16 – (x

2

+ y

2

) + λ(2x –

y

- 4)

Compute partial derivatives…

dL/dx

= -

2x +

2λ = 0

dL/dy

= -2y –

λ

= 0

dL/dλ

= 2x –

y

– 4 = 0

Result:

(

x,y,λ

) = (-8/5,4/5,-8/5)

Which yields max of

f(x,y

) = 64/5

A Reassuring Introduction to SVM

42Slide43

Dual Problem

Maximize: f(x,y) = 16 – (x

2 + y2

)

Subject to: 2x –

y = 4 Then

L(x,y,λ

) = 16–(x

2

+y

2

)+λ(2x–y-4)

Recall that dual problem is

min max

L(x,y,λ

)

Where

max

is over

(

x,y

)

,

min is

over

λ

How can we solve this?

A Reassuring Introduction to SVM

43Slide44

Dual Problem

Dual problem: min max L(x,y,λ

)So, can first take max

of

L

over (x,y

) Then we are left with function L

only in

λ

To solve problem, then find

min

L(λ)

On next slide, we illustrate this for

L(x,y,λ

) = 16 – (x

2

+ y

2

) + λ(2x –

y

- 4)

Same example as considered above

A Reassuring Introduction to SVM

44Slide45

Dual Problem

Given L(x,y,λ

) = 16 – (x2

+ y

2

) + λ(2x – y

– 4)Maximize over (

x,y

)

by computing

dL/dx

= -2x + 2λ = 0

dL/dy

= -2y -

λ

= 0

Which implies

x

=

λ

and

y

= -λ/2

Substitute these into

L

to obtain

L(λ) = 5/4

λ

2

+ 4λ + 16

A Reassuring Introduction to SVM

45Slide46

Dual Problem

Original problemMaximize: f(x,y

) = 16 – (x2 + y

2

)

Subject to: 2x –

y = 4 Solution can be found by minimizing

L(λ) = 5/4

λ

2

+ 4λ + 16

Then

L

(λ) = 5/2

λ

+ 4 = 0,

which gives

λ

= -8/5

and

(

x,y

) = (-8/5,4/5)

Same solution as the primal problem!

A Reassuring Introduction to SVM

46Slide47

Summary of Dual ProblemMaximize

L to find (

x,y) in terms of λ

Then rewrite

L as function of λ

onlyFinally, minimize L(λ) to solve problem

But, why all of the fuss?

Dual problem allows us to write SVM problem in much more user-friendly way

In SVM, we’ll consider dual

L(λ)

A Reassuring Introduction to SVM

47Slide48

Lagrange Multipliers and SVM

Lagrange multipliers very cool indeedBut what does this have to do with SVM?Can view (soft) margin computation as constrained optimization problem

In this form, kernel trick becomes clearWe can kill 2 birds with 1 stoneMake margin calculation clearerMake kernel trick perfectly clear

A Reassuring Introduction to SVM

48Slide49

Problem SetupLet

X1,X

2,…,

X

n

be data pts (vectors)Spse

each Xi

= (

x

i

,y

i

)

a point in the plane

In general, could be higher dimension

Let

z

1

,z

2

,…,

z

n

be corresponding class labels, where each

z

i

{-1,1}

Where

z

i

= 1

if classified as “red” type

And

z

i

= -1

if classified as “blue” type

Note this is a binary classification

A Reassuring Introduction to SVM

49Slide50

Geometric View

Equation of yellow linew1

x + w2

y +

b

= 0 Equation of red line

w1

x + w

2

y +

b

= 1

Equation of blue line

w

1

x + w

2

y +

b

= -1

Margin

m

is length of green line

A Reassuring Introduction to SVM

50

x

y

mSlide51

Training

Any red point X

= (

x,y

) must

satisfyw

1x + w

2

y +

b

≥ 1

Any blue point

X

=

(

x,y

) must

satisfy

w

1

x + w

2

y +

b

≤ -1

Want inequalities

all

true after training

A Reassuring Introduction to SVM

51

x

ySlide52

Scoring

With lines defined… Given new data point X = (

x,y) to classify“Red” provided that

w

1

x + w

2y + b

> 0

“Blue” provided that

w

1

x + w

2

y +

b

< 0

This is scoring phase

A Reassuring Introduction to SVM

52

x

ySlide53

How to Train?

The real question is...How to find equation of the yellow line?Given {X

i} and

{

z

i}

The X

i

points in plane

…a

nd

z

i

are classes

Finding yellow line is SVM training phase

A Reassuring Introduction to SVM

53

x

ySlide54

Maximize the Margin

Distance from origin to line Ax+By+C

= 0 is|C| / sqrt(A

2

+ B

2

)Origin to red dashed line:|1-b| /

sqrt

(

w

1

2

+

w

2

2

)

where

W = (w

1

,w

2

)

Origin to blue dashed line:

|-1-b| /

sqrt

(w

1

2

+ w

2

2

)

Margin is

m = 2/

sqrt

(w

1

2

+ w

2

2

)

A Reassuring Introduction to SVM

54

y

x

mSlide55

Training Phase

Given {Xi

} and {

z

i

}, find largest margin m

that classifies all points correctlyThat is, find red, blue lines in pictureRecall red line is of the form

w

1

x + w

2

y +

b

= 1

Blue line is of the form

w

1

x + w

2

y +

b

= -1

And max margin:

m = 2/

sqrt

(w

1

2

+ w

2

2

)

A Reassuring Introduction to SVM

55Slide56

TrainingSince

zi

{-1,1}

, correct classification occurs provided

zi

(w1

x

i

+ w

2

y

i

+

b

) ≥ 1

for all

i

Training problem to solve:

Maximize:

m = 2

/

sqrt

(w

1

2

+ w

2

2

)

Subject to constraints:

z

i

(w

1

x

i

+ w

2

y

i

+

b

) ≥ 1

for i

=1,2,…,n Can we determine W

=(w1,w2

) and b ?

A Reassuring Introduction to SVM56Slide57

TrainingThe problem on previous slide is equivalent to the following

Minimize: F(W) = (w1

2 + w

2

2

) / 2

Subject to constraints: 1 -

z

i

(w

1

x

i

+ w

2

y

i

+

b

) ≤ 0

for all

i

Should be starting to look familiar…

A Reassuring Introduction to SVM

57Slide58

LagrangianPretend inequalities are

equalities… L(w

1,w2

,b,λ) = (w

1

2 + w

22) / 2

+

Σ

λ

i

(1 -

z

i

(w

1

x

i

+ w

2

y

i

+

b

))

Compute

dL/dw

1

= w

1

-

Σ

λ

i

z

i

x

i

= 0

dL/dw

2

= w

2

-

Σ

λiz

iyi = 0

dL/db = Σ λ

izi

= 0dL/dλi = 1 - z

i (w1

xi + w2y

i + b) = 0

A Reassuring Introduction to SVM58Slide59

LagrangianDerivatives yield constraints and

W = (w1

w2)

T

=

Σ

λi

z

i

X

i

and

Σ

λ

i

z

i

= 0

Substitute these into

L

yields

L(λ) =

Σ

λ

i

– ½ ΣΣ

λ

i

λ

j

z

i

z

j

(

X

i

X

j

)

Where “” is dot product: X

iXj

= xix

j + y

iyjHere, L

is only a function of λ We still have the constraint Σ

λizi

= 0Note: If we find λi

then we know WA Reassuring Introduction to SVM

59Slide60

New-and-Improved Problem

Max: L(λ) = Σ

λi

– ½ ΣΣ

λ

i

λjz

i

z

j

(

X

i

X

j

)

Subject to:

Σ

λ

i

z

i

= 0

and all

λ

i

≥ 0

Why maximize

L(λ)

? Intuitively …

Goal is to minimize

F(W) = (w

1

2

+ w

2

2

) / 2

Subject to constraints in

L(λ)

functionMaximize L(λ) finds “best” parameters

λ And “best” λ will solve this min problemRecall, this is dual problem

A Reassuring Introduction to SVM

60Slide61

Dual Version of Problem

Max: L(λ) = Σ

λi

– ½ ΣΣ

λ

i

λjz

i

z

j

(

X

i

X

j

)

Subject to:

Σ

λ

i

z

i

= 0

and all

λ

i

≥ 0

Again, this is the dual problem

Can always solve it (if solution exists)

And will find a

global

maximum

It doesn’t get any better than that!

With HMM (for example), no guarantee of global maximum

A Reassuring Introduction to SVM

61Slide62

All Together Now: TrainingGiven data points

X1,X

2,…,

X

n

Label each X

i with z

i

{-1,1}

Solve dual problem (previous slide)

Solving it yields the

λ

i

Once

λ

i

known, compute

W=(w

1

,w

2

)

and

b

Obtain equation of line:

w

1

x + w

2

y +

b

What have we accomplished?

A Reassuring Introduction to SVM

62Slide63

All Together Now: Scoring

From training, we’ve found λi

Yields W=(w1

,w

2

) and

b in w

1

x + w

2

y +

b

Given new data point

X = (

x,y

)

That is,

X

not in training set

Compute

w

1

x + w

2

y +

b

If

greater than

0,

classify

X

as

“red” (

+1

)

Otherwise, classify

X

as

“blue” (

-1

)

What happened, in terms of picture?

A Reassuring Introduction to SVM

63Slide64

Geometric Viewpoint

Training?Find equation of yellow line, f(X

)Score X = (

x,y

)

?If f(X

) > 0, then X is above yellow line (classify as red)

Else

X

below line (classify as blue)

A Reassuring Introduction to SVM

64

y

x

mSlide65

Scoring RevisitedUse yellow line for scoring

…There is an alternative (better) wayHave

f(X) = w1

x + w

2

y + b

= W 

X +

b

Recall that

W =

Σ

λ

i

z

i

X

i

So,

f(X

) =

Σ

λ

i

z

i

(X

i

X

) +

b

Why is this better?

No need to explicitly compute

W

Any better reasons why it’s better?

A Reassuring Introduction to SVM

65Slide66

Support VectorsWhen solving

L(λ), find most λ

i = 0 Specifically, the

X

i

for whichz

i

(w

1

x

i

+ w

2

y

i

+

b

) > 1

Only constraints that can matter are

z

i

(w

1

x

i

+ w

2

y

i

+

b

) = 1

The latter are

support vectors

Not known in advance

training determines the support vectors

A Reassuring Introduction to SVM

66Slide67

Support Vectors

Picture worth 1000 words?Where are the support vectors?Other vectors (training points) don’t matter

Why not?

A Reassuring Introduction to SVM

67

x

y

support

vectorsSlide68

Scoring Re-revisitedScore

X using f(X

) = Σ

λ

i

zi

(Xi

X

) +

b

Generally, most of the

λ

i

will be 0

So, sum is not really

i

=1

to

n

Instead, sum is

i

=1

to

s

Where

s

is number of support vectors

Why does this matter?

Typically,

n

large,

s

small, so

fast

scoring

This

f(X

)

is useful for other reasons too

A Reassuring Introduction to SVM

68Slide69

Training: Soft Margin

Suppose we relax “linearly separable”Tradeoff errors for bigger margin m

More errors, but gain bigger marginTwo types of errors illustrated here…

A Reassuring Introduction to SVM

69

x

y

mSlide70

ErrorsTo account for errors, introduce “slack variables”

εi

≥ 0 in optimizationFor red point

X

i

= (x

i,y

i

)

, constraint is

w

1

x

i

+ w

2

y

i

+

b

≥ 1 -

ε

i

For blue point

X

i

= (

x

i

,y

i

)

, constraint is

w

1

x

i

+ w

2

y

i

+

b

≤ -1 +

ε

i

Minimize: (w

12 + w22

) / 2 + C Σ

εi Subject to constraints above

A Reassuring Introduction to SVM70Slide71

Dual Problem

Work thru details, dual problem is…Max: L(λ) =

Σ λ

i

– ½ ΣΣ

λi

λjz

i

z

j

(

X

i

X

j

)

Subject to:

Σ

λ

i

z

i

= 0

and

C ≥

λ

i

≥ 0

Only difference is

C ≥

λ

i

condition

We specify

C

when training

Bottom line?

Allowing for errors changes constraints

A Reassuring Introduction to SVM

71Slide72

Training and ScoringRe-re-revisited

TrainingMaximize: L(λ) =

Σ

λ

i

– ½ ΣΣ λ

iλj

z

i

z

j

(

X

i

X

j

)

Subject to:

Σ

λ

i

z

i

= 0

and

C ≥

λ

i

≥ 0

Where

C

specified by user

Scoring: Given

X=(

x,y

)

Compute

f(X

)=

Σ

λi

zi(Xi

X)+b, where sum is over support vectorsIf f(X

) < 0, then X is “blue”; else it’s “red”

A Reassuring Introduction to SVM72Slide73

Kernel Trick

Finally, we can make sense of kernel trick

Recall X1

,X

2

,…,

Xn are training vectorsFor

training

, the

X

i

only appear as

X

i

X

j

When

scoring

X

, the

X

i

only appear as

X

i

X

Can transform input

data to

feature space

Then c

ompute

dot products in

the feature

space

Effect is to replace “

” with

s

omething

defined in higher

dimensions

And since dot product in feature space……it is easy to compute (in terms of input space)

A Reassuring Introduction to SVM

73Slide74

Kernel Example

For example, suppose we defineThen ϕ

maps element in 2-d to 6-dFor X

i

=(

xi

,yi)

and

X

j

=(

x

j

,y

j

)

, we have

ϕ

(X

i

)

ϕ

(

X

j

) = (1 +

x

i

x

j

+

y

i

y

j

)

2

= (

X

i

X

j

+ 1)

2

Define the kernel function K as

K(Xi,Xj

) = (Xi

Xj + 1)2

Note: K is composition of

ϕ and “” A Reassuring Introduction to SVM

74Slide75

Kernel ExampleIn the input space,

Xi and

Xj are 2-d

Map

X

i and

Xj to 6-d feature space

These are

ϕ

(X

i

)

and

ϕ

(

X

j

)

Perform dot product in feature space

That is, compute

ϕ

(X

i

)

ϕ

(

X

j

)

Math

works

whether 2-d

input

(

X

i

,

X

j

) or 6-d

feature space (

ϕ

(X

i),

ϕ(Xj

))Dot product is in feature space! A Reassuring Introduction to SVM

75Slide76

The Big Picture

Training data lives in input spaceWhere data may

not be linearly separableMap input space to higher dimension feature space using function

ϕ

Do training & scoring in feature spaceWhere data may be linearly separable

But don’t want to suffer performance penalty due to higher dimensionChoose kernel function wisely

A Reassuring Introduction to SVM

76Slide77

Training & Scoring with Kernel

Simply replace Xi

X

j

with

K(Xi

,Xj

)

Training

Max:

L(λ) =

Σ

λ

i

– ½

ΣΣ

λ

i

λ

j

z

i

z

j

K(X

i

,

X

j

)

Subject to:

Σ

λ

i

z

i

= 0

and

C ≥

λ

i

≥ 0

Where C specified by userScoring: Given X=(

x,y) Compute f(X)=

Σ λi

zi K(Xi

,X)+b If f(X

) < 0, then X is “blue”; else “red”

A Reassuring Introduction to SVM77Slide78

Kernel Trick

No need to explicitly map input data to feature spaceWe don’t even need to know

ϕ Only need resulting kernel function K

Bottom line

Obtain the benefit of working in higher dimension space (linear separable)……with no significant performance penalty

That’s a really awesome trick !!!

A Reassuring Introduction to SVM

78Slide79

Popular KernelsLinear kernel:

K(Xi

,Xj

) =

X

i

Xj

Polynomial learning machine

K(X

i

,X

j

) = (

X

i

X

j

+ 1)

p

Gaussian radial-basis function (RBF)

K(X

i

,X

j

) = exp(-(X

i

X

j

)

(X

i

X

j

)/(2σ

2

))

Two-layer

perceptron

K(X

i

,X

j) = tanh(β0 X

iXj + β

1) Many other possibilitiesSelecting “right” kernel is the real trick

A Reassuring Introduction to SVM

79Slide80

A Brief History of SVM

SVM invented in 1962But not actually useful until 1990sNonlinearity added in 1992

That is, the kernel trickSoft margins developed in 1993 Although not published until 1995SVM is a relative newcomer

Only useful if we can efficiently train

...

A Reassuring Introduction to SVM

80Slide81

Algorithm to Train SVM?

To train (linear) SVM must solve

quadratic programming

problem:

Once the are

λ

i are known, we classify samples using

A Reassuring Introduction to SVM

81Slide82

How to Solve QP Problem?Many general techniques available to solve QP problems

Interior point, conjugate gradient, …But SVM problem is very special

Solution is “sparse”We have inequality constraints

Problem is LARGE (more on this later

…)

So, SVM is a not a standard QP case

A Reassuring Introduction to SVM

82Slide83

Peculiarities of SVM Training

Compared to “standard” QP problems, SVM does not require great precision

True of most ML algorithmsSolution to SVM problem is sparseThat is, most of the

λ

i

will be 0……

but don’t know which will be 0 in advanceNumber of training samples and size of each training vector can both be HUGE

A Reassuring Introduction to SVM

83Slide84

QP SolversGeneral QP solvers

Precompute Kij

= K(Xi

,X

j

) for all

i and j And store entire

K

matrix in memory

M

akes the algorithms efficient

b

ut, not suitable for SVM training

What to do?

Recall that SVM solution is “sparse”

C

an we take advantage of sparseness?

A Reassuring Introduction to SVM

84Slide85

Early SVM Solvers

“…all early SVM results were obtained using ad hoc algorithms”“Early” means until mid-to-late 1990s

An idea: Combine “iterative chunking” with “simple direction search”Chunking means that we only deal with part of the training data at a timeThen optimize by direction search, based on gradient computations

A Reassuring Introduction to SVM

85Slide86

Iterative ChunkingCan’t deal with whole problem at once

So, a possible plan of attack is…Solve SVM on “working set”, i.e., a subset (or chunk

) of the training dataThen consider data pts inside this margin

These are

candidate

support vectors!So, use them as working set, and repeat

Sounds very plausible!A Reassuring Introduction to SVM

86Slide87

Direction SearchChoose a “good” direction

Kind of like Newton’s method, or conjugate gradient, or gradient descent…Improve on current solution by moving in

a “good” directionWe won’t discuss details here

Idea is fairly straightforward

But, details can be somewhat complex

A Reassuring Introduction to SVM

87Slide88

SMOCurrent best SVM solvers use SMO

…Sequential minimal optimizationFirst proposed in 1999

Divide into minimal QP sub-problemsMinimal “working set” (just 2 variables)

Makes direction computation easy

Convergence/termination properties of SMO are good and well understood

A Reassuring Introduction to SVM

88Slide89

SMO Working Set

In SMO, working set always 2 variablesHow to select (i

,j) to speed convergence?M

aximize gain in objective function?

Requires exhaustive search

, too inefficient Instead, use heuristics so that…

Cached kernel values used (Dual) objective function increases

And don

t worry too much about precision

A Reassuring Introduction to SVM

89Slide90

Simplified SMOSSMO is like SMO,

but simplified! We’ll ignore heuristics used to select working set pair (i,j)

at each stepAssume working set is given at each stepHomework considers simplified scenarios for generating working set

Otherwise, the SSMO algorithm we discuss is (almost) same as

SMO

A Reassuring Introduction to SVM

90Slide91

SSMO

Solve

F

ind the

λ

i in Lagrangian

(dual) Given n training samples

(

X

i

,z

i

)

Let

A Reassuring Introduction to SVM

91Slide92

SSMO Details

Book has details of SSMO algorithmKey is so-called “KKT conditions”

Necessary and sufficient conditions for SVM (QP) problem to converge

We enforce KKT conditions by choice of

b

in SMO algorithm (and SSMO)

A Reassuring Introduction to SVM

92Slide93

SMO Bottom LineSMO is fast and efficient for the SVM training problem

Can solve problems with a very LARGE number of training vectorsAnd individual vectors can be LARGE tooChoosing the “working set” at each step is the tricky part

That is, how to choose indices (i,j

)

in “good” way

Heuristics to choose (

i,j), speed up convergence

A Reassuring Introduction to SVM

93Slide94

SVM +’s and –’s

Strengths

In training, obtain a global maximum, not just a local maximumCan tradeoff margin and training errors

Kernel trick is

totally awesome

WeaknessesChoosing kernel is more art than scienceSuccess depends heavily on kernel (and parameter) choice(s)

A Reassuring Introduction to SVM

94Slide95

References

R. Berwick, An idiot’s guide to support vector machines

E. Kim, Everything you wanted to know about

the kernel trick (but were too afraid to ask)

M. Law,

A simple introduction to support

vector machinesW.S. Noble, What is a support vector

machine?

, Nature Biotechnology, 24(12):1565-1567, 2006

A Reassuring Introduction to SVM

95Slide96

References: Lagrange Multipliers

D. Klein, Lagrange multipliers without

permanent scarringWikipedia,

Lagrange multiplier

A Reassuring Introduction to SVM

96Slide97

References: SMO AlgorithmA. Ng, Simplified SMO algorithm,

http://cs229.stanford.edu/materials/smo.pdfL. Bottou

and C.-J. Lin, Support vector machine solvers, https://www.csie.ntu.edu.tw/~cjlin/papers/bottou_lin.pdf

A Reassuring Introduction to SVM

97