/
An Introduction of An Introduction of

An Introduction of - PowerPoint Presentation

alida-meadow
alida-meadow . @alida-meadow
Follow
369 views
Uploaded On 2017-07-29

An Introduction of - PPT Presentation

Support Vector Machine Courtesy of Jinwei Gu Today Support Vector Machine SVM A classifier derived from statistical learning theory by Vapnik et al in 1992 SVM became famous when using images as input it gave accuracy comparable to neuralnetwork with ID: 574130

denotes linear function margin linear denotes margin function kernel vector set points discriminant problem support formulation classifier svm large

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "An Introduction of" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

An Introduction toSupport Vector Machine

In part from of

Jinwei

GuSlide2

Support Vector Machine

A classifier derived from statistical learning theory by

Vapnik

, et al. in 1992

SVM became famous when, using images as input, it gave accuracy comparable to neural networks in a handwriting recognition task

Currently, SVM is widely used in

object detection & recognition, content-based image retrieval, text recognition, biometrics, speech recognition

, etc.

Still one of the best non-deep methods – in many tasks, comparable performanceSlide3

Support Vector Machine

A Support Vector Machine (SVM) is a

discriminative classifier

formally defined by a

separating hyperplane (like for NN)

.

In other words, given labeled training data (

supervised learning both for classification and regression

), the algorithm outputs an

optimal hyperplane

which categorizes new examples.

To understand SVM we need to brush up vector algebra a bit…Slide4

Function learned by SVM

It can be an

arbitrary

function

f

of

x

, such as:

Nonlinear

Functions (like feed forward neural nets)

Linear

Functions

(like perceptron)Slide5

Linear Function (Linear separator)

f(

x

) is a linear function in

R

n

:

x

1

x

2

w

T

x

+ b = 0

w

T

x

+ b < 0

w

T

x

+ b > 0

n

w

1

x

1

+w

2

x

2

+b=0

If you divide a vector by its norm, you obtain a unit vector (with norm =1)

Unit-length vector

: The normal vector of the hyperplane

Matrix/vector notation, where

w

is vector of weights,

x

is an input vector and T 

 

transposeSlide6

Recall: Vector Representation and dot product

w

x

a

x

b

x

c

b

 

The black line is perpendicular to vector

w

therefore the vector dot product is 0

=

 Slide7

Linear Separator 

x

1

x

2

w

T

x

+ b = 0

w

T

x

+ b < 0

w

T

x

+ b > 0

denotes +1

denotes -1

x

y

The

classification

function is:

f(

x,w,b

)

= sign(

w

T

x

+ b)=y

If sign() > 0 

→ x is positively classified

If sign() < 0 

→ x is negatively classified

f(x)

 Slide8

How would you classify these points using a linear function?

denotes +1

denotes -1

x

1

x

2

Infinite number of answers!

Linear FunctionSlide9

Linear Discriminant Function

How would you classify these points using a linear discriminant function in order to minimize the error rate?

x

1

x

2

Infinite number of answers!

denotes +1

denotes -1Slide10

Linear Discriminant Function

x

1

x

2

How would you classify these points using a linear discriminant function in order to minimize the error rate?

Infinite number of answers!

denotes +1

denotes -1Slide11

x

1

x

2

Linear Discriminant Function

How would you classify these points using a linear discriminant function in order to minimize the error rate?

Infinite number of answers!

Which one is the best?

denotes +1

denotes -1Slide12

 Linear Classifiers

h(x)=sign(

w

T

x

+ b)

If the classifier is producing a correct classification, then:

y

i

sign

(

w

T

x

i

+ b)=1

Both

yi

and sign(

w

T

x

i

+ b) are either -1 or +1.

Misclassified

to +1 class

How would you classify this data?

denotes +1

denotes -1

x

2

x

1

In

this

example

,

datapoints

too

close

to the

decision

boundary

may

lead

to future

wrong

classifications

! Slide13

Large Margin Linear Classifier

The linear discriminant function (classifier) with the

maximum

margin

is the best!

Why it is the best?

Robust to outliers and thus stronger generalization ability

“safe zone”

Margin

x

1

x

2

Margin

is defined as the

width that the boundary could be increased by before hitting a data point

in the learning set

denotes +1

denotes -1Slide14

Functional margins

Functional

margin

of a training

example

vector in the dataset D is defined

as:

If

and

have the

same

sign

.

If

then the prediction is CORRECT and has

a high CONFIDENCE, since fm is high iff

is large, which means

that

is «far enough» from the decision boundary

=0. However the definition of our classification function h(x)=sign(

) is

scale

invariant: is we multiply w and b by any

constant k, h(x) is the same (but the functional margin

will be k times larger!)Therefore maximizing functional margins is

not a good choice – we can make them arbitraily large without

changing anything meaningful!

 Slide15

Geometric margins

They are

defined

as

the

distance

between a data point xi and the decision boundary

:

Geometric margin is scale invariant wrt the parameters of the

decision boundary! If we multiply w and b by

k the margin

is the same!

 

xi

)=

 Slide16

Task Objective: Maximizing the geometric margin

Let’s

now

consider

the data

points (x+, x-) in the training set that are placed «on the margins», i.e.,

those closest to the

separating hyperplane. They are called SUPPORT VECTORSSince as we said geometric margins are scale invariant, we can apply a scale transformation

to w and b and set the geometric margins of SVs equal to 1

x

1

x

2

denotes +1

denotes -1

They are the

support vectors

 

 

 

Positive

support

vectors

 

Negative

support

vectors

 

H

2

=

w

T

+ b = -1

 

H

1

=

w

T

+ b = +1

 Slide17

Consider

every

instance

of the training set in the

form

<x, y> where y is the class of x:If

y = +1 →

> 0If y = -1 →

< 0

 

x

1

x

2

Task Objective: Maximizing the geometric margin

denotes +1

denotes -1

Considering

the

margins

defined

by

support

vectors

,

we

obtain

the

equivalent

formulation

:

If

y = +1 →

1

If

y = -1 →

-1

 

w

T

+ b = -1

 

w

T

x + b =

0

w

T

+ b = 1

 

n

Since

by

construction

,

there

are no

points

in the

dataset

lying

within

the

margins

.

The

equality

applies

ONLY

to

SVs

.Slide18

The Margin Width

The distance between

and

H

is then:

 

The distance between

 

H

1

and

H

2 is then:

 

 

x

1

x

2

denotes +1

denotes -1

:

w

T

+ b = -1

 

H:

w

T

x + b =

0

They are the support vectors

n

:

w

T

+ b = 1

 

Since

we

know

that

:

= 1

 Slide19

Maximizing The Margin

Remember: Our objective is finding

w

and b such that the geometric margin is maximized.

So, we need to

maximize

such that

The “such that” means that there must be

no data points between

and

, for any x

i

in D

 

y = +1 →

1

y = -1 →

-1

 

Equivalently: we need to

minimize

such that

 

y = +1 →

1

y = -1 →

-1

 

Notice that

can be combined in

1

 

y = +1 →

1

y = -1 →

-1

 Slide20

Solving the Optimization Problem

s.t.

Quadratic problem

with linear constraints

s.t.

Lagrangian

Method:

We can rewrite all as an optimization problem.

Quadratic problem

(

well-known class of mathematical programming problems)Slide21

The Lagrangian

Given a function f(x) and a set of constraints

c

1

,..

c

n

, a

Lagrangian is a function L(f, c1,..

cn, α

1

,.. αn) that “incorporates” the constraints

in the optimization problem

Karush-Kuhn-Tucker (KKT) conditions: The optimum is at a point where

This condition is known as

complementarity condition

(it means that at least one of , c(x) must be zero).

 

The derivative is 0

1Slide22

Solving the Optimization Problem

s.t.

To minimize

L

we need to maximize the red box

Note: our variables here are

w

,

b

and the

α

i

2

Note,

this

is a system

of

equations

!Slide23

Dual Problem Formulation

s.t.

s.t.

, and

Lagrangian Dual

Problem

Since

And

 

3Slide24

We

know

from

previous

steps

and conditions that:

 

All the steps

NOTE: Because of condition 4, each non-zero

α

i

indicates that corresponding

x

i

is a support vector SV since (

+b)-1)

is zero ONLY for SVs.

 

s.t.

α

i

≥0

=

 

 

By 3.

By 1., 2., 3.

+b)-1))=0

 

4Slide25

Why only Support Vectors have α>0?

The

complementarity condition

(the

third one of KKT

) implies that one of the 2 multiplicands is zero.

However,

for non-

SupportVector

.

Therefore

only data points on the margin (=SV)

will have a non-zero

α (

because they are the

only ones

for which

= 1

0)

 

 

5Slide26

Final Lagrangian Formulation

To summarize:

Q

(

α

)

can be computed since we know the

y

iyjxiTxj (they are the pairs <xi,y

i> of the training set D!!) and the only variables are the αi But remember that the constraint 2 can be verified with a non-equality to zero only for the SV

!!

So, in this final formulation the only variables to be computed are the αi of the support vectors. Wrt the original formulation, b and w

i have disappeared!

Find α1…α

n such that

Q(α) =

Σαi

- ½ΣΣαi

αjyi

yjxiTx

j is maximized and Σ

α

iyi = 0

αi ≥ 0 for all αi

6Slide27

Summary of Optimization Problem Solution

Quadratic optimization problems

(Primal Problem)

are a well-known class of mathematical programming problems, and many algorithms exist for solving them.

The solution involves constructing a

dual problem

where a Lagrange multiplier

α

i is associated with every constraint in the primal problem. In the dual problem, the only unknown are αi

s.t.

Find

w

and

b

such that:

Find

α

1

…α

n

such that

Q(α

) =Σαi -

½ΣΣαi

α

jyiyj

xiTx

j is maximized and

Σαiy

i = 0

αi ≥ 0 for all α

i

7

Primal

Problem

Lagrangian

Dual ProblemSlide28

The Optimization Problem Solution

Given a solution to the

Lagrangian

dual problem

Q(

α

)

(i.e. computing the α1…αn ); the

solution to the primal (i.e. computing w

and b) is: Each non-zero αi indicates that corresponding xi is a support vector.Once we solve the problem, the classification function for a new

point x is the sign of : (note that we don’t need to compute w explicitly)

Notice that to predict the class of

x we perform an inner (dot) product xiTx

between the test point x and the support vectors xi . Only the support vectors are useful for classification of new instances!! (since the other α values are zero).

Also keep in mind that solving the optimization problem involved computing the inner (dot) products xiTxk between all training points.

w

αiyi

xi b =

yk - Σ

αiyixi

Txk

for any

αk > 0

Φ(x) = Σ

αiyi

x

iTx + b (

xi are the support vectors)

8Slide29

Kernel functions and linear kernel

Inner (dot) product belongs to a category of mathematical functions known as

kernel functions

, i.e. SIMILARITY functions between vectors. Specifically, dot product is a

linear kernel

.

Why

the dot product (linear kernel) is a similarity function

?

=

=

 Slide30

Example

Suppose we have the dataset:

P

ositive

N

egative

1Slide31

By simple inspection, we can identify 3 Support Vectors:

A

Example

2

 

B

CSlide32

Example:What we know?

We know that in general:

Positive

margin H1 (SVs (B, C)):

w

T

x

+b

=

+

b =

1

Negative

margin H2(SVs (A)): w

Tx+b =

+

b =

-1

Since:

(

first constraint from derivative) and

α

i

are all non-zero for SVs.

The kernel is the inner dot product for linear separator:

k(

,

)

For the optimization condition:

=>

+

+

= 0

 

3Slide33

Example: First step

Now, we can write system of equations to find the “alphas”:

So, we can write a system of equations for each of the 3 points:

Where k(

,

) (

i.e. the kernel) is the dot product

in this case (linear)

 

5

+

+ b =

-1

 

+

+ b = 1

 

+

+ b = 1

 Slide34

Then we compute the inner dot products

K(·, ·)

A

B

C

A

1

3

3

B3108C3

810

Example: K(B, C) =B

TC=

 

6

 

=

 Slide35

Finally, we obtain the system of equations

+

+ b =

+

b

=>

b =-1

 

+

+ b =

+

b

=>

b = 1

 

+

+ b =

+

b

=>

b = 1

 

7

We

have

3

equations

and 4

variables

,

but

..Slide36

Finally, we obtain the system of equations

From the optimization conditions, we also know that:

=>

+

+

= 0

 

 

So the result is:

= -

;

=

= -

 

8Slide37

The solution (graphically)

The

w

i

are computed using

w

=

Σ

αiyi

xi

 

9Slide38

Data with «noise»

SVM

try

to

perfectly

separate training data with a

separating

hyperplane with maximum marginsHowever rarely data are «perfectly»

separable

In some case it is better to allow some classification error to favor high generalization power, rather than adapting to training examples (overfitting) Slide39

Soft

Margins

: the

intuition

The red line

correctly

separates the data, but in a sense, it

«overfits»

The margin is very narrow and reduces the generalization power over unseen examplesThe green line allows to define larger margins, but

it makes one error (the green point that lies above the green line)

The idea behind soft

margin is to «sacrifice» consistency over training data but allow for larger marginsSlide40

Soft Margins (2)

The general idea is to

change

the

objective

funtion

. Rather than minimizing

we minimize

+C(#of mistakes

)Here, C is a hyperparameter that decides the trade-off between maximizing the margin and minimizing the mistakes.

When C is small, classification mistakes are given less importance and focus is more on maximizing the margin, whereas when C is large, the focus is more on avoiding misclassification at the expense of keeping the margin small.However, not all mistakes are equal. Data points that are far away on the wrong side of the decision boundary should incur more penalty as compared to the ones that are closer.

 Slide41

Soft Margins (3)

The idea is: for every data point x

i

, we introduce a slack variable

ξ

i

. The value of

ξi is the distance of xi from the corresponding class’s margin if xi is on the wrong side of the margin, otherwise zero. Thus the points that are far away from the margin on the wrong side would get more penalty.

If

the

left

side (

remember: it is a functional margin) is ≥1 then

it means tha xi is correctly classified, if

instead it has a negative sign, it means that xi is on the «

wrong side», and we add a penalty

that is proportional to the distance of xi from the

separating line.

is called a slack variable.

 Slide42

Soft Margin Classification

Soft Margin:

The SVM try to find a hyperplane to separate the two classes, but

tolerate few misclassified points

.

Slack variables

ξi can be added to allow misclassification of outliers or noisy examples, resulting

margins are called soft.

ξ

i

ξ

i

x

i

are the misclassified examples

The «

relaxed

»

constraint

is

that

:

 Slide43

New Formulation of the Primal Problem:

such that

Parameter

C

can be viewed as a way to control over-fitting.

For large C a large penalty is assigned to errors

.

When incorporating the constraints (as for standard SVM) we obtain:

 

The slack variables

allow an example in the margin 0

1

to be misclassified

, while

if

>1 it is an error

 

with

Soft Margin Classification

Large Margin Linear Classifier Slide44

Soft-Margin Classification: Effect of the parameter C

C is a hyperparameter that decides the trade-off between maximizing the margin and minimizing the mistakes. Slide45

The classifier is a separating hyperplane.

Most “important” training points are support vectors; they define the position of the hyperplane.

Quadratic optimization algorithms can identify which training points x

i

are support vectors with non-zero

Lagrangian

multipliers

α

i

.Both in the dual formulation of the problem and in the solution, the training points appear only inside dot productsSlack variables allow to tolerate errprs in “quasi-linearly” separable data

Summary:

Linear SVMSlide46

Datasets that are linearly separable (possibly with noise) work out great:

0

x

0

x

x

2

0

x

But what are we going to do if the dataset is just too hard?

How about

mapping data to a higher-dimensional space:

This slide is courtesy of

LINK

Non-linear SVMs

1Slide47

General idea

:

The original input space can be mapped to some higher-dimensional feature space where the training set is separable:

Φ:

x

φ(

x

)

Non-linear SVMs:

Feature Space

This slide is courtesy of

LINK

2

Note that φ(

x

) is still a vector in a higher dimensional space!!Slide48

The original points (left side of the schematic) are mapped by

, i.e., rearranged, using non-linear

kernels

.

 

Space Mapping

Function

3Slide49
Slide50

With this mapping, our discriminant function is now:

The original formulation was:

 

The “Kernel Trick”

1

 Slide51

As we said, the dot product is a linear kernel, and computes the similarity between 2 vectors

in the feature-vector space

In general, a

kernel function

is defined as a function that corresponds to a dot product of two feature vectors “

in some expanded” feature space

:

The Kernel Trick

2

 

The linear classifier relies on dot product between vectors

K

(

x

i

,x

j

)=

x

i

T

x

jSlide52

Example

Second degree

polinomial

mapping

It

satisfies

the

definition

of kernel function

, since:corresponds to a dot product of two feature vectors “

in some expanded” feature spaceSlide53

Linear kernel:

Polynomial kernel:

Gaussian

kernel (Radial-Basis Function (RBF) ) :

Sigmoid:

The Kernel Trick

Popular Kernel Functions

4

In general, functions that satisfy

Mercer

s condition (

https://www.svms.org/mercer/

)

can be kernel functions.Slide54

Nonlinear SVM: the optimization problem

Lagrangian

Dual Problem Formulation:

such that

The solution of the discriminant function (decision boundary) is

The optimization technique for finding

α

i

‘s is the same.

and

 

 

Find

α

1

, α

2

,.. α

n

such that:

is maximized andSlide55

Nonlinear SVM: Summary

SVM locates a separating hyperplane in the feature space and classify points in that space

It does not need to represent the space explicitly, rather it simply defines a kernel function

The kernel function (a dot product in an “augmented” space) plays the role of the dot product in the feature space.Slide56

Support Vector Machine: Algorithm

Choose a kernel function

Choose a value for

C

Solve the quadratic programming problem (many software packages available)

Construct the discriminant function from the support vectors Slide57

Summary

Maximum Margin Classifier

Better generalization ability & less over-fitting

The Kernel Trick

Map data points to higher dimensional space in order to make them linearly separable.

Since only dot product is used, we do not need to represent the mapping explicitly.Slide58

Issues

Choice of kernel function:

Gaussian or polynomial kernel is the default

If they are ineffective, more elaborate kernels are needed

Domain experts can give assistance in formulating the appropriate similarity measures

Choice of kernel parameters:

e.g.

σ

in Gaussian kernel

σ is the distance between closest points with different classifications In the absence of reliable criteria, applications rely on the use of a validation set or cross-validation to set such parameters.

Optimization criterion – Hard margin vs. Soft margin:A lengthy series of experiments in which various parameters are tested

This slide is courtesy of

LINKSlide59

Issues

Maximum Margin classifiers are known to be sensitive to the scaling transformation applied to the features. Therefore it is essential to

normalize

the data! (see slides on data transformation at the beginning of the course)

Maximum Margin classifiers are also sensible to

unbalanced

data

Hyper-parameter tuning (C, kernel): readSlide60

Properties of SVM

Flexibility

in choosing a similarity function

Ability to deal with

large data sets.

Only support vectors

are used to specify the separating hyperplane

Ability to

handle large feature spaces. Complexity of the algorithm does not depend on the dimensionality of the feature spaceOverfitting can be controlled by soft margin approachNice math property: a simple convex optimization problem which is guaranteed to converge to a single global solution (

link)Slide61

Weakness of SVM

Sensitive to noise

: A relatively small number of mislabeled examples can dramatically decrease the performance

It only

considers two classes

:

how to do multi-class classification with SVM?

Answer:

1) with output arity m, learn m SVM’s

SVM 1 learns “Output==1” vs “Output not 1”SVM 2 learns “Output==2” vs “Output not 2”:SVM m learns “Output==m” vs “Output not m” 2) To predict the output for a new input, just predict with each SVM, and find out which one puts the prediction the furthest into the positive region.Slide62

Additional Resource

Recommended lectures:

LINK

,

LINK

Additional Resource

LINK

LibSVM best implementation for SVM LINK

Supplementary Material: LINK

,LINK, LINK, LINK, LINKInterpretation of SVM learning: LINK, LINKSVM sheet LINK