Support vector machines When the data is linearly separable, which of the many possible - PowerPoint Presentation

347 views
Uploaded On 2018-10-13

Support vector machines When the data is linearly separable, which of the many possible - PPT Presentation

SVM criterion maximize the margin or distance between the hyperplane and the closest training example Support vector machines When the data is linearly separable which of the many possible solutions should we prefer ID: 688892

kernel margin loss svm margin kernel svm loss support data separable vectors hyperplane formulation gaussian training hinge optimization distance

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/688892" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Support vector machines When the data is..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

Support vector machines

When the data is linearly separable, which of the many possible solutions should we prefer?SVM criterion: maximize the margin, or distance between the hyperplane and the closest training exampleSlide2

Support vector machines

When the data is linearly separable, which of the many possible solutions should we prefer?

SVM criterion: maximize the

margin, or distance between the hyperplane and the closest training example

Margin

Support vectorsSlide3

Finding the maximum margin hyperplane

Define the

margin as the distance between the hyperplane

and the closest point

Distance between point and hyperplane is given by

Assuming the data is linearly separable, we can fix the scale of

and

so that

and

for all other points

Then the margin is

Slide4

Finding the maximum margin hyperplane

Find separating hyperplane that maximizes the distance to the closest training example

Margin

Support vectors

Positive examples:

Negative examples:

For support vectors,

The margin is

Slide5

Finding the maximum margin hyperplane

Maximize margin

while correctly classifying all training data:

positive:

negative:

Equivalent problem:

This is a quadratic objective with linear constraints: convex optimization, global optimum can be found using well-studied methods

Slide6

“Soft margin” formulation

What about non-separable data?

And even for separable data, we may prefer a larger margin with a few constraints violated

SourceSlide7

“Soft margin” formulation

What about non-separable data?

And even for separable data, we may prefer a larger margin with a few constraints violated

SourceSlide8

“Soft margin” formulation

-1

Penalize margin violations using

hinge loss

Incorrectly classified

Correctly classified

Hinge lossSlide9

“Soft margin” formulation

-1

Penalize margin violations using

hinge loss

Incorrectly classified

Correctly classified

Hinge loss

Recall hinge loss used by the perceptron update algorithm!Slide10

Penalize margin violations using

hinge loss:

“Soft margin” formulation

Maximize margin –

a.k.a.

regularization

Minimize misclassification lossSlide11

SGD optimization (omitting bias)

Recall:

Slide12

SGD optimization (omitting bias)

SGD update:

Otherwise:

S. Shalev-Schwartz et al.,

Pegasos: Primal Estimated sub-

GrAdient

SOlver

for SVM

Mathematical Programming

, 2011Slide13

SVM vs. perceptron

SVM loss:

SVM update:

Otherwise:

Perceptron loss:

Perceptron update:

Otherwise: do nothing

What are the differences?

Slide14

Dual SVM formulation

SVM objective:

Directly solving for

using SGD is called the

primal

approach

Instead, SVM optimization can be formulated to learn a classifier in the form

by solving a

dual

optimization problem over

Slide15

Dual SVM formulation

The dual problem (given without derivation):

Important properties of the dual:

At the optimum,

are nonzero only for support vectors

Feature vectors appear only inside dot products

this enables nonlinear SVMs via

kernel functions

Slide16

→

(

)

Nonlinear SVMs

General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable

Slide credit: Andrew MooreSlide17

Nonlinear SVMs

The kernel trick: instead of explicitly computing the lifting transformation

define a kernel function

To be valid, the kernel function must satisfy

Mercer’s condition

This gives a nonlinear decision boundary in the original feature space:

Slide18

Non-separable data in 1D:

Apply mapping

ExampleSlide19

Kernel example 1: Polynomial

Polynomial kernel with degree and constant

What this looks like for

Thus, the explicit feature transformation consists of all polynomial combinations of individual dimensions of degree up to

Slide20

Kernel example 1: PolynomialSlide21

Kernel example 2: Gaussian

Gaussian kernel with bandwith

Fun fact: the corresponding mapping

is infinite-dimensional!

||x – x’||

(

x, x’

)Slide22

Kernel example 2: Gaussian

Gaussian kernel with bandwith

The predictor

is a sum of “bumps” centered on support vectors

SV’s

It’s also called a

Radial Basis Function

(RBF) kernelSlide23

Kernel example 2: Gaussian

Gaussian kernel with bandwith

The predictor

is a sum of “bumps” centered on support vectors

How does the value of

affect the behavior of the predictor?

What if

is close to zero?

What if

is very large?

Slide24

SVM: Pros and cons

Pros

Margin maximization and kernel trick are elegant, amenable to convex optimization and theoretical analysisSVM loss gives very good accuracy in practiceLinear SVMs can scale to large datasets

Kernel SVMs are flexible, can be used with problem-specific kernelsPerfect “off-the-shelf” classifier, many packages are availableCons

Kernel SVM training does not scale to large datasets: memory cost is quadratic and computation cost even worse“Shallow” method: predictor is a “flat” combination of kernel functions of support vectors and test example, no explicit feature representations are learned