/
Support Vector Machines Elegant combination of statistical learning theory and machine Support Vector Machines Elegant combination of statistical learning theory and machine

Support Vector Machines Elegant combination of statistical learning theory and machine - PowerPoint Presentation

alexa-scheidler
alexa-scheidler . @alexa-scheidler
Follow
383 views
Uploaded On 2018-10-30

Support Vector Machines Elegant combination of statistical learning theory and machine - PPT Presentation

Vapnik Good empirical results Nontrivial implementation Can be slow and memory intensive Binary classifier Was the big wave before graphical models and then deep learning important part of your knowledge base ID: 703051

svm kernel order vectors kernel svm vectors order space support feature training learning higher margin dual assume terms polynomial

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Support Vector Machines Elegant combinat..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Support Vector Machines

Elegant combination of statistical learning theory and machine learning – VapnikGood empirical resultsNon-trivial implementationCan be slow and memory intensiveBinary classifierWas the big wave before graphical models and then deep learning, important part of your knowledge base

2Slide2

SVM Overview

Non-linear mapping from input space into a higher dimensional feature spaceNon adaptive – User defined by Kernel functionLinear decision surface (hyper-plane) sufficient in the high dimensional feature spaceNote that this is the same as we do with standard MLP/BP

Avoid complexity of high dimensional feature space with kernel functions which allow computations to take place in the input space, while giving the power of being in the feature space – “kernel trick”

Get improved generalization by placing hyper-plane at the maximum margin

3Slide3

4Slide4

SVM Comparisons

Note that MLP/deep nets follow similar strategyNon-linear map of input features into new feature space which is now linearly separableBut, MLP learns the non-linear mapping In order to have a natural way to compare and gain intuition on SVMs, we will first do a brief review of models which do not learn the initial non-linear feature mapping:

Quadric/Higher Order Machines

Radial Basis Function Networks

5Slide5

Maximum Margin and Support Vectors

6Slide6

Standard (Primal) Perceptron Algorithm

Assume weight vector starts at 0 and learning rate is 1

Assume

R

(type of adaptive LR) is also 1 for this discussion

Target minus output not needed since targets/outputs are binary

Learning is just adding (or subtracting based on target) the current training pattern (multiplied by the learning rate) to the current weight vector

7Slide7

Dual and Primal Equivalence

Note that the final weight vector is a linear combination of the training patterns

The basic decision function (primal and dual) is

How do we obtain the coefficients

α

i

8Slide8

Dual Perceptron Training Algorithm

Assume initial 0 weight vector

9Slide9

Dual vs. Primal Form

Gram Matrix: all (xi·xj) pairs – Done once and stored (can be large)

α

i

: One for each pattern in the training set. Incremented each time it is misclassified, which would have led to a weight change in primal form

Magnitude of

α

i

is an indicator of effect of pattern on weights (

embedding strength

)

Note that patterns on borders have large

α

i

while easy patterns never effect the weights

Could have trained with just the subset of patterns with

α

i

> 0 (support vectors) and ignored the others

Can train in dual. How about execution? Either way (dual could be efficient if support vectors are few)

Would if transformed feature space is still not linearly separable?

α

i

would keep growing. Could do early stopping or bound the

α

i

with some maximum

C

, thus allowing and bounding outliers.

10Slide10

Feature Space and Kernel Functions

Since most problems require a non-linear decision surface, we do a static non-linear map Φ(x) = (Φ

1

(

x),Φ

2

(

x

), …,

Φ

N

(

x

))

from input space to feature space

Feature space can be of very high (even infinite) dimensionality

By choosing a proper kernel function/feature space, the high dimensionality can be avoided in computation but effectively used for the decision surface to solve complex problems - "Kernel Trick"

A Kernel is appropriate if the matrix of all

K

(

x

i

,

x

j

) is positive semi-definite (has non-negative eigenvalues). Even when this is not satisfied many kernels still work in practice (sigmoid).

11Slide11

Basic Kernel Execution

Primal:

Dual:

Kernel version:

Now we see the real advantage of working in the dual form

Note intuition of execution: Gaussian (and other) Kernel similar to reduced weighted K-nearest neighbor (and like RBF)

12Slide12

13Slide13

Polynomial Kernels

For greater dimensionality we can do

14Slide14

Polynomial Kernel Example

Assume a simple 2-d feature vector x: x1, x

2

Note that a new instance

x will be paired with training vectors xi from the training set using

K

(

x

,

x

i

). We'll call these

x

and

z

for this example.

Note that in the input space

x

we are getting the 2

nd

order terms:

x

1

2

,

x

2

2

, and

x

1

x2

15Slide15

Polynomial Kernel Example

Following is the 3rd degree polynomial kernel

16Slide16

Polynomial Kernel Example

Note that for the 2nd degree polynomial with two variables we get the 2nd order terms x1

2

,

x22, and x1

x

2

Compare with quadric. We also add bias weight with SVM.

For the 2

nd

degree polynomial with three variables we would get the 2

nd

order terms

x

1

2

,

x

2

2

,

x

3

2

,

x

1

x

2

,

x1x3, and x2x3Note that we only get the dth degree terms. However, with some kernel creation/manipulation we can also include the lower order terms

17Slide17

SVM Execution

Assume novel instance x = <.4, .8>Assume training set vectors (with bias = 0).5, .3 y= -1 α

=1

-.2, .7

y= 1 α=2What is the output for this case?

Show kernel

and

higher-order computation

18Slide18

SVM Execution

Assume novel instance x = <.4, .8>Assume training set vectors (with bias = 0).5, .3 y= -1 α

=1

-.2, .7

y= 1 α=21·-1·(<.5,.3>·<.4,.8>)

2

+ 2·1·(<-.2,.7>·<.4,.8>)

2

= -.1936 + .4608 = .2672

This is Kernel version, what about higher order version?

19Slide19

SVM Execution

Assume novel instance x = <.4, .8>Assume training set vectors (with bias = 0).5, .3 y= -1 α

=1

-.2, .7

y= 1 α=21·-1·(<.5,.3>·<.4,.8>)

2

+ 2·1·(<-.2,.7>·<.4,.8>)

2

= -.1936 + .4608 = .2672

1·-1·(.5

2

·.4

2

+ 2·.5·.3·.4·.8 + .3

2

·.8

2

) = -.04 + -.096 + -.0576 = -.1936

2·1·((-.2)

2

·.4

2

+ 2·-.2·.7·.4·.8 + .7

2

·.8

2

) = 2·(.0064 - .0896 + .3136) = .4608

20Slide20

SVM Homework

Assume novel instance x = <.7, .2>Assume training set vectors (with bias = 0).1, .6 y= -1 α=3

.2, -.7

y

= 1 α=2What is the output for this case?Show kernel

and

higher-order computation

21Slide21

Kernel Trick

So are we getting the same power as the Quadric machine without having to directly calculate the 2nd order terms?

22Slide22

Kernel Trick

So are we getting the same power as the Quadric machine without having to directly calculate the 2nd order terms?No. With SVM we weight the scalar result of the kernel, which has constant coefficients in the 2

nd

order equation!

With Quadric we can have a separate learned weight (coefficient) for each termBut we do get to individually weight each support vectorAssume that the 3rd

term above was really irrelevant. How would Quadric/SVM deal with that?

23Slide23

Kernel Trick Realities

Polynomial Kernel - all monomials of degree 2x1x3y

1

y

3 + x3x3y

3

y

3

+ .... (all 2nd order terms)

K(

x

,

y

) =

<

Φ

(

x

Φ

(

y

)> = … + (

x

1

x

3

)(y

1

y3) + (x3x3)(y3y3) + ...

Lot of stuff represented with just one <

x·y

>

2

However, in a full higher order solution we would would like adaptive coefficients for each of these higher order terms (i.e.

-2

x

1

+ 3·

x

1

x

2

+ …)

SVM does a weighted (embedding coefficients) sum of them all with individual constant internal coefficients

Thus, not as powerful as a higher order system with arbitrary weighting

The more desirable arbitrary weighting can be done in an MLP because learning is done in the layers between inputs and hidden nodes

SVM input to higher order feature is a fixed mapping. No learning at that level.

Of course, individual weighting requires a theoretically exponential increase in terms/hidden nodes for which we need to find weights as the polynomial degree increases. Also need learning algorithms which can actually find these most salient higher-order features.

BUT, with SVM we do get access to the higher-order terms (though not individually weighted) while working in the much more efficient kernel space which would not happen if we had to use the expanded space with individual coefficients.

24Slide24

Kernel Trick Common

Kernel trick used in lots of other modelsKernel PCA, etc.Anytime we want to get power of a non-linear map, but still work in the dimensions of the original space25Slide25

26Slide26

SVM vs RBF Comparison

SVM commonly uses a Gaussian kernel Kernel is a distance metric (ala K nearest neighbor)How does the SVM differ from RBF?

27Slide27

SVM vs RBF Comparison

SVM commonly uses a Gaussian kernel How does the SVM differ from RBF? SVM will automatically discover which training instances to use as prototypes (i.e. support vectors)SVM works only in the kernel space while RBF calculates values in the potentially much larger exploded space

Both weight the different prototypes

RBF uses a perceptron style learning algorithm to create weights between prototypes and output classes

RBF supports multiple output classes and nodes have a vote for each output class, whereas SVM support vectors can only vote for their target class: 1 or -1

SVM will create a maximum margin

hyperplane

decision surface

Since internal feature coefficients are constants in the Gaussian distance kernel (for both SVM and RBF), SVM will suffer from fixed/irrelevant features just like RBF/

k

-nearest neighbor

They both have a static mapping – no learning in the kernel map from input to higher order feature space

28Slide28

Choosing a Kernel

Can start from a desired feature space and try to construct a kernelMore often one starts from a reasonable kernel and try a few (CV)Some kernels are a better fit for certain problems, domain knowledge can be helpfulCommon kernels:

Polynomial

Gaussian

SigmoidalApplication specific

29Slide29

Maximum Margin

Maximum margin can lead to overfit due to noiseProblem may not be linearly separable even in the transformed feature spaceSoft Margin is a common solution, allows slack variablesα

i

constrained to be >= 0 and less than

C. The C

allows outliers.

How to pick

C

? Can try different values for the particular application to see which works best.

30Slide30

Soft Margins

31Slide31

Optimizing the margin in the higher order feature space is convex and thus there is one guaranteed solution at the minimum (or maximum)

SVM Optimizes the dual representation (avoiding the higher order feature space) with variations on the following

Maximizing

Σ

αi tends towards larger α subject to Σ

α

i

y

i

= 0 and

α

C

(which both tend towards smaller

α

)

Without this term,

α

= 0 could suffice

2

nd

term minimizes number of support vectors since

Two positive (or negative) instances that are similar (high Kernel result) would increase size of term. Thus both (or either) instances usually not needed.

Two non-matched instances which are similar should have larger

α

since they are likely support vectors at the decision boundary (negative term helps maximize)

The optimization is quadratic in the

α

i

terms and linear in the constraints – can drop C maximum for non soft marginWhile quite solvable, requires complex code and usually done with a numerical methods software package – Quadratic programmingQuadratic Optimization

32Slide32

Execution

Typically use dual form which can take advantage of Kernel efficiencyIf the number of support vectors is small then dual is fastIn cases of low dimensional feature spaces, could derive weights from α

i

and use normal primal execution

Can also get speed-up (and potential regularization) by dropping support vectors with embedding coefficients below some threshold

33Slide33

Standard SVM Approach

Select a 2 class training set, a kernel function (optionally calculate the Gram Matrix), and choose the C value (soft margin parameter)Pass these to a Quadratic optimization package which will return an α

for each training pattern based on a variation of the following (non-bias version)

Patterns with non-zero

α

are the support vectors for the maximum margin SVM classifier.

Execute by using the support vectors

34Slide34

A Simple On-Line Alternative

Stochastic on-line gradient ascentCould be effective This version assumes no biasSensitive to learning rateStopping criteria tests whether it is an appropriate solution –

can just go until little change is occurring or can test optimization conditions directly

Can be quite slow and usually quadratic programming is used to get an exact solution

Newton and conjugate gradient techniques also used – Can work well since it is a guaranteed convex surface – bowl shaped

35Slide35

Maintains a margin of 1 (typical in standard SVM implementation) which can always be done by scaling

 or equivalently w and

b

This is done with the (1 - actual) term below, which can update even when correct, as it tries to make the distance of support vectors to the decision surface be exactly 1

If parenthetical term < 0 (i.e. current instance is correct and beyond margin), then don’t update

36Slide36

Large Training Sets

Big problem since the Gram matrix (all (xi·xj) pairs) is O(n2) for n data patterns

10

6

data patterns require 1012 memory itemsCan’t keep them in memoryAlso makes for a huge inner loop in dual training

Key insight: most of the data patterns will not be support vectors so they are not needed

37Slide37

Chunking

Start with a reasonably sized subset of the Data set (one that fits in memory and does not take too long during training)Train on this subset and just keep the support vectors or the m patterns with the highest α

i

values

Grab another subset, add the current support vectors to it and continue trainingNote that this training may allow previous support vectors to be dropped as better ones are discovered

Repeat until all data is used and no new support vectors are added or some other stopping criteria is fulfilled

38Slide38

SVM Notes

Excellent empirical and theoretical potentialMaximum Margin is a great regularizer (even without kernel)Multi-class problems not handled naturally. Basic model classifies into just two classes. Can do one model for each class (class

i

is 1 and all else 0) and then decide between conflicting models using confidence, etc.

How to choose kernel – main learning parameter other than margin penalty C. Kernel choice may include other hyper-parameters to be defined (degree of polynomials, variance of Gaussians, etc.)

Speed and Size: both training and testing, how to handle very large training sets (millions of patterns and/or support vectors) not yet solved

Adaptive Kernels: trained during learning?

Kernel trick common in other models

39