/
Supervised Learning Methods Supervised Learning Methods

Supervised Learning Methods - PowerPoint Presentation

briana-ranney
briana-ranney . @briana-ranney
Follow
364 views
Uploaded On 2018-02-02

Supervised Learning Methods - PPT Presentation

knearestneighbors kNN Decision trees Support vector machines SVM Neural networks Support Vector Machines Chapter 189 and the paper Support vector machines by M Hearst ed 1998 ID: 627477

class margin denotes data margin class data denotes kernel plane linear zone predict points svm sign optimization examples vector

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Supervised Learning Methods" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Notes for HW #4, Problem 3

At each hidden unit, use the ReLU activation functionIn forward pass use the ReLU activation function: g(x) = max(0, x)In backward pass use the derivative of the ReLU function: Slide2

Multi-Class Classification with Neural Networks

Use a number of output units equal to the number of classesRepresent each class with 1 at a particular output unit and 0 at all other output units

 

 

 

Cat

Dog

ToasterSlide3

Multi-Class Classification with Neural Networks

At each output unit

use the Softmax activation function

:

where

z

i

is the weighted sum of the inputs to the

i

th

output unit, and there are

K output units

Means output units all have values between 0 and 1, and sum to 1; can be interpreted as probabilities

Note: [3,1,1] does

not

become [.6, .2, .2] but rather [.78, .11, .11] since we are

exponentiating before normalizing

 Slide4

Multi-class Classification with Neural Networks

For the error function, instead of SSE, use “Cross-Entropy” loss:

 

 

Derivative has a nice property when used with

Softmax

:

where

is the computed output at output unit

i

and

T

i

is the target output at unit

i

 

Measures distance between target distribution and output distributionSlide5

Notes for HW #4, Problem 3

At all

output units

, use the

Softmax

activation function

and define the Error (Loss) function using

Cross-EntropySlide6

Updating Weights in a 2-Layer Neural Network

For weights between hidden and output units, using the Softmax activation function and

Cross-Entropy error function:

=

α aj (

Tk - Ok)

= α aj ∆k where

wj,k

weight on link from hidden unit j to output unit k α learning rate parameter

aj activation (i.e., output) of hidden unit

j

Tk

teacher output for output unit

k

Ok

Softmax

output of output unit

kSlide7

Updating Weights in a 2-Layer Neural Network

For weights between input and hidden units with ReLU at hidden units:

w

i,j

weight on link from input i to hidden unit

j wj,k weight on link from hidden unit j to output unit k

K number of output units

α learning rate parameter Tk teacher output for output unit k O

k Softmax output of output unit k

ai

input value

i

g’(in

j

)

derivative of

ReLU

activation function

k

= (Tk - Ok)

=

-

=

α

g

´

(

in

j

) ∆k

 Slide8

Updating Weights in a 2-Layer Neural Network

For weights between hidden and output units, using a Sigmoid activation function

= -α

-

a

j (T

k - O

k) g'(ink) = α

aj (

Tk - Ok) Ok

(1 -

Ok

) =

α

aj Δ

k

w

j,k

weight on link

from hidden unit

j

to output unit k α learning rate parameter

aj activation (i.e., output) of hidden unit

j

T

k

teacher output for output unit

k

O

k

(=

ak) actual output of output unit k g' derivative of the sigmoid activation function, which is g’(x) = g(x)(1 – g(x))Δk modified errorΔk =

Err

k

×

g'

(

in

k

)Slide9

For Problem 2(b) this means

∇C = ErrC * g’(inC

) = (TC – OC ) g’(in

C

)

where

g’(inC) = g(inC) (1 – g(in

C))g(inC) = 1 / (1 + e –inC

)inC = aA * wac + aB * wbc + wh2caA = g(inA)

OC = g(inC ) = aCSlide10

Updating Weights in a

2-Layer Neural NetworkFor weights between inputs and hidden units:

w

i,j

weight on link

from input i to hidden unit j wj,k weight on link from hidden unit

j to output unit k

α learning rate parameter aj activation (i.e., output) of hidden unit j

Tk teacher output for output unit

k O

k actual output of output unit

k

ai input value

i

g'

derivative of sigmoid activation function, which is

g' = g

(1-

g

)Slide11

Then

∇A = g’(inA) * wac * ∇C

= aA * (1 – aA) * wac * ∇

CSlide12

Supervised Learning Methods

k-nearest-neighborsDecision treesNeural networksNaïve BayesSupport vector machines (SVM)Slide13

Support Vector Machines

Chapter 18.9 and the optional paper “Support vector machines” by M. Hearst, ed., 1998

Acknowledgments

: These slides combine and modify ones provided by Andrew Moore (CMU), Carla Gomes (Cornell),

Mingyue

Tan (UBC), Jerry Zhu (Wisconsin), Glenn Fung (Wisconsin), and

Olvi

Mangasarian (Wisconsin)Slide14

What are Support Vector Machines (SVMs) Used For?

Classification Regression and data-fitting Supervised and unsupervised learningSlide15

Lake Mendota, Madison, WI

Identify areas of land cover (land, ice, water, snow) in a sceneTwo methods:Scientist manually-derivedSupport Vector Machine (SVM)

Visible Image

Expert Labeled

Expert Derived

Automated Ratio

SVM

Lake Mendota, Wisconsin

Classifier

Expert Derived

SVM

cloud

45.7%

58.5%

ice

60.1%

80.4%

land

93.6%

94.0%

snow

63.5%

71.6%

water

84.2%

89.1%

unclassified

45.7%

Courtesy of Steve Chien of NASA/JPL

Slide16

Linear Classifiers

f

x

y

denotes +1

denotes -1

How would you classify this data?Slide17

Linear Classifiers

(aka Linear Discriminant Functions)Definition:A function that is a linear combination of the components of the input (column vector) x:where w is the weight (column vector) and

b is the biasA 2-class classifier then uses the rule: Decide class c1

if

f

(

x) ≥ 0 and class c2 if f(x) < 0 or, equivalently,

decide c1 if wTx ≥ -b and c

2 otherwiseSlide18

w

is the plane’s normal vector

b

is the distance from the origin

Planar decision surface in

d

dimensions is parameterized by (

w

,

b

)

w

b

Slide19
Slide20

Slide21

Linear Classifiers

f

x

y

denotes +1

denotes -1

f

(

x

,

w

, b

)

=

sign

(

w

T

x

+

b

)

How would you classify this data?Slide22

Linear Classifiers

f

x

y

denotes +1

denotes -1

f

(

x

,

w

, b

) =

sign

(

w

T

x

+

b

)

How would you classify this data?Slide23

Linear Classifiers

f

x

y

denotes +1

denotes -1

f

(

x

,

w

, b

)

=

sign

(

w

T

x

+

b

)

How would you classify this data?Slide24

Linear Classifiers

f

x

y

denotes +1

denotes -1

f

(

x

,

w

, b

)

=

sign

(

w

T

x

+

b

)

Any of these would be fine …

… but which is

best

?Slide25

Classifier Margin

f

x

y

denotes +1

denotes -1

f

(

x

,

w

, b

) =

sign

(

w

T

x

+

b

)

Define the

margin

of a linear classifier as the

width

that the decision boundary could be expanded before hitting a data pointSlide26

Maximum Margin

f

x

y

denotes +1

denotes -1

f

(

x

,

w

, b

)

=

sign

(

w

T

x

+

b

)

The

maximum margin linear classifier

is the linear classifier with the maximum margin

This is the simplest kind of SVM (Called an LSVM)

Linear SVMSlide27

Maximum Margin

f

x

y

denotes +1

denotes -1

f

(

x

,

w

, b

)

=

sign

(

w

T

x

+

b

)

The

maximum margin linear classifier

is the linear classifier with the maximum margin

This is the simplest kind of SVM (Called an LSVM)

Support Vectors

are those data points that the margin pushes against

Linear SVMSlide28

Why the Maximum Margin?

denotes +1

denotes -1

f

(

x

,

w

,b

) = sign(

w

. x

-

b

)

The

maximum margin linear classifier

is the linear classifier with the, um, maximum margin.

This is the simplest kind of SVM (Called an LSVM)

Support Vectors

are those data points that the margin pushes against

Intuitively this feels

safest

If we’ve made a small

error

in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification

Robust

to outliers since the model is immune to change/removal of any non-support-vector data points

There’s some

theory

(using “VC dimension”) that is related to (but not the same as) the proposition that this is a good thing

Empirically it works very wellSlide29

Specifying a Line and Margin

How do we represent this mathematically?… in d input dimensions? An example x = (x1, …, xd)

T

Plus-Plane

Minus-Plane

Classifier Boundary

“Predict Class = +1” zone

“Predict Class = -1” zoneSlide30

Specifying a Line and Margin

Plus-plane = wT x + b = +1 Minus-plane = w

T x + b = -1

Plus-Plane

Minus-Plane

Classifier Boundary

“Predict Class = +1” zone

“Predict Class = -1” zone

Classify as

+1

if

w

T

x

+

b

1

-1

if

w

T

x

+

b

-

1

?

if

-

1 <

w

T

x

+

b

<

1

wx+b=1

wx+b=0

wx+b=-1

Weight vector:

w

= (w

1

, …,

w

d

)

T

Bias or threshold:

b

The dot product

is a scalar:

x

’s projection onto

wSlide31

Computing the Margin

Plus-plane = wT x + b = +1 Minus-plane =

wT x + b = -1

Claim:

The vector

w

is perpendicular to the Plus-Plane and the Minus-Plane

“Predict Class = +1” zone

“Predict Class = -1” zone

w

T

x+b

=1

w

T

x+b

=0

w

T

x+b

=-1

M =

Margin

(width)

How do we compute

M

in terms of

w

and

b

?

wSlide32

Computing the Margin

Plus-plane = wT x + b = +1 Minus-plane = wT

x + b = -1 Claim:

The vector

w

is perpendicular to the Plus-Plane.

Why?

“Predict Class = +1” zone

“Predict Class = -1” zone

wx+b=1

wx+b=0

wx+b=-1

M =

Margin

How do we compute

M

in terms of

w

and

b

?

Let

u

and

v

be two vectors on the Plus-Plane. What is

w

T

(

u

v

) ?

And so, of course, the vector

w

is also perpendicular to the Minus-Plane

wSlide33

Computing the Margin

Plus-plane = wT x + b = +1 Minus-plane = wT

x + b = -1 The vector w is perpendicular to the Plus-Plane

Let

x

be any point on the Minus-PlaneLet x+ be the closest Plus-Plane-point to x−

“Predict Class = +1” zone

“Predict Class = -1” zone

wx+b=1

wx+b=0

wx+b=-1

M =

Margin

How do we compute

M

in terms of

w

and

b

?

x

x

+

Any location in

m

: not necessarily a datapoint

Any location in

R

d

; not necessarily a data point

wSlide34

Computing the Margin

Plus-plane = wT x + b = +1 Minus-plane = w

T x + b = -1 The vector

w

is perpendicular to the Plus-Plane

Let

x− be any point on the Minus-PlaneLet x+ be the closest Plus-Plane-point to x

−Claim: x+ = x

− + λw for some value of λ

“Predict Class = +1” zone

“Predict Class = -1” zone

wx+b=1

wx+b=0

wx+b=-1

M =

Margin

How do we compute

M

in terms of

w

and

b

?

x

x

+

wSlide35

Computing the Margin

Plus-plane = wT x + b = +1Minus-plane = w

T x + b = -1 The vector

w

is perpendicular to the Plus-Plane

Let

x− be any point on the Minus-PlaneLet x+ be the closest Plus-Plane-point to x

−Claim: x+ = x−

+ λw for some value of λ Why?

“Predict Class = +1” zone

“Predict Class = -1” zone

wx+b=1

wx+b=0

wx+b=-1

M =

Margin

How do we compute

M

in terms of

w

and

b

?

x

x

+

The line from

x

to

x

+

is perpendicular to the planes

So to get from

x

to

x

+

travel some distance in direction

w

wSlide36

Computing the Margin

What we know:wT x+ + b = +1

wT x-

+

b

=

-1 x+ = x− +

λw||x+ - x−

|| = M

“Predict Class = +1” zone

“Predict Class = -1” zone

wx+b=1

wx+b=0

wx+b=-1

M =

Margin

x

x

+

wSlide37

Computing the Margin

What we know:wT x+ +

b = +1

w

T

x− + b = -1

x+ = x−

+ λw||x+ - x− || = MIt’s now easy to get M

in terms of w and b

“Predict Class = +1” zone

“Predict Class = -1” zone

wx+b=1

wx+b=0

wx+b=-1

M =

Margin

w

T

(

x

+

λ

w

) +

b

=

1

w

T

x

+

b

+

λ

w

T

w

=

1

-

1

+

λ

w

T

w

=

1

x

x

+

wSlide38

Computing the Margin

What we know:wT x+ + b = +1

wT x−

+

b

=

-1 x+ = x− + λ

w||x+ - x−

|| = M

“Predict Class = +1” zone

“Predict Class = -1” zone

wx+b=1

wx+b=0

wx+b=-1

M =

Margin =

M

= |

|

x

+

-

x

|| = ||

λ

w

||

x

x

+

w

=

M

, margin sizeSlide39

Learning the Maximum Margin Classifier

Given a guess of w and b, we canCompute whether all data points in the correct half-planesCompute the marginSo now we just need to write a program to search the space of

w’s and b’s to find the widest margin that correctly classifies all the data points How?

“Predict Class = +1” zone

“Predict Class = -1” zone

wx+b=1

wx+b=0

wx+b=-1

M =

Margin =

x

x

+

wSlide40

from

Statnikov et al.Slide41

from

Statnikov et al.Slide42

SVM as Constrained Optimization

Unknowns: w, bObjective function: maximize the margin: M = 2 / ||w||Equivalent to minimizing ||w|| or

||w||2 = w

T

w

N

training points: (xk , yk), yk

= +1 or -1Subject to each training point correctly classified, i.e.,subject to y

k(wTxk + b) ≥ 1 for all

k

This is a

quadratic optimization problem (QP)

, which can be solved efficiently

N

constraintsSlide43

Classification with SVMs

Given a new point x, we can classify it by

Computing score: wT

x

+

b

Deciding class based on whether < 0 or > 0

If desired, can set confidence threshold t

-1

0

1

Score >

t

: yes

Score < -

t

: no

Else: don’t know

Sec. 15.1Slide44

SVMs: More than Two Classes

SVMs can only handle two-class problemsk-class problem: Split the task into k binary tasks and learn k SVMs:Class 1 vs. the rest (classes 2 — k

)Class 2 vs. the rest (classes 1, 3 — k)…Class k

vs. the rest

Pick the class that puts the point

farthest into its positive regionSlide45

from

Statnikov et al.

I vs II & III

III vs I & IISlide46

SVMs: Non Linearly-Separable Data

What if the data are not linearly separable?Slide47

SVMs: Non Linearly-Separable Data

Two approaches:

Allow a few points on the wrong side (

slack variables

)

Map data to a higher dimensional space, and do linear classification there (

kernel trick

)Slide48

Non Linearly-Separable Data

Approach 1: Allow a few points on the wrong side (slack variables)

“Soft Margin Classification”Slide49

denotes +1

denotes -1

What Should We Do?Slide50

denotes +1

denotes -1

What Should We Do?

Idea:

Find minimum ||

w

||

2

while minimizing number of training set errors

Problem: Two things to minimize makes for an ill-defined optimizationSlide51

denotes +1

denotes -1

What Should We Do?

Idea:

Minimize

||

w

||

2

+

C

(#

train errors)

There’s a serious practical problem with this approach

Tradeoff “penalty” parameterSlide52

What Should We Do?

Idea:

Minimize

||

w

||

2

+

C (# train errors)There’s a serious practical problem with this approach

denotes +1

denotes -1

Tradeoff “penalty” parameter

Can’t be expressed as a Quadratic Programming problem.

So solving it may be too slow.

(Also, doesn’t distinguish between disastrous errors and near misses)Slide53

denotes +1

denotes -1

What Should We Do?

Minimize

||

w

||

2

+

C

(distance of all “misclassified points” to their correct place)Slide54

Choosing C

, the Penalty FactorCritical to choose a good value for the constant penalty parameter, C, becauseC too big means very similar to LSVM because we have a high penalty for non-separable points, and we may use many support vectors and overfitC too small means we allow many misclassifications in the training data and we may underfitSlide55

Choosing C

from Statnikov et al.Slide56

Learning Maximum Margin with Noise

Given guess of w, b, we canCompute sum of distances of points to their correct zonesCompute the margin widthAssume N

examples, each (xk , yk) where y

k

= +1 / -1

wx+b=1

wx+b=0

wx+b=-1

M =

What should our optimization criterion be?Slide57

Learning Maximum Margin with Noise

Given guess of w, b, we canCompute sum of distances of points to their correct zonesCompute the margin widthAssume N

examples, each (xk , yk)

where

y

k

= +1 / -1

wx+b=1

wx+b=0

wx+b=-1

M =

What should our optimization criterion be?

Minimize

ε

7

ε

11

ε

2

How many constraints will we have?

N

What should they be?

slack variables

y

k

(

w

T

x

k

+

b

)

≥ 1-

ε

k

for all

k

Note:

ε

k

= 0 for points in correct zoneSlide58

Given guess of

w , b we canCompute sum of distances of points to their correct zonesCompute the margin widthAssume R datapoints, each (x

k,yk) where yk = +/- 1

Learning Maximum Margin with Noise

wx+b=1

wx+b=0

wx+b=-1

M =

What should our optimization criterion be?

Minimize

e

7

ε

11

ε

2

Our original (noiseless data) QP had

d

+1

variables:

w

1

, w

2

, …

w

d

,

and

b

Our new (noisy data) QP has

d

+1+

N

variables:

w

1

,

w

2

,

w

d

,

b

,

ε

k

,

ε

1

,…

ε

N

d

= # input dimensions

How many constraints will we have?

N

What should they be?

w

T

x

k

+

b

1-

ε

k

if

y

k

= +1

w

T

x

k

+

b

-1+

ε

k

if

y

k

= -1

N

= # examplesSlide59

How many constraints will we have?

N

What should they be?

w

T

x

k

+ b ≥ 1- εk

if yk = +1

wTxk + b ≤ -1+

εk if

yk

= -1

Learning Maximum Margin with Noise

Given guess of

w , b we can

Compute sum of distances of points to their correct zones

Compute the margin width

Assume

N

examples, each (

x

k

, y

k) where yk = +1 / -1

wx+b=1

wx+b=0

wx+b=-1

M =

What should our optimization criterion be?

Minimize

ε

7

ε

11

ε

2

There’s a bug in this QP Slide60

Learning Maximum Margin with Noise

Given guess of w , b we canCompute sum of distances of points to their correct zonesCompute the margin widthAssume N

examples, each (xk, yk) where y

k

=

+1 / -1

wx+b=1

wx+b=0

wx+b=-1

M =

What should our optimization criterion be?

Minimize

How many constraints will we have?

2

N

What should they be?

w

T

x

k

+

b

+1

-

ε

k

if

y

k

= +1

w

T

x

k

+

b

-1

+

ε

k

if

y

k

= -1

ε

k

0

for all

k

ε

7

ε

11

ε

2

slack variables

”Slide61

An Equivalent QP

Maximize

where

subject to these constraints:

Then define:

Then classify with:

f

(

x

,

w

,

b

)

= sign

(

w

.

x

-

b

)

N

examples, each (

x

k

, y

k

)

where

y

k

=

+1 / -1Slide62

N

examples, each (xk, yk) where yk = +1 / -1An Equivalent QP

Maximize

where

subject to these constraints:

Then define:

Then classify with:

f

(

x

,

w

,b

) = sign(

w

. x

-

b

)

Datapoints

with

α

k

>

0

will be the support vectors

so this sum only needs to be over the support vectorsSlide63

An Equivalent QP

Maximize

where

Subject to these constraints:

Then define:

Then classify with:

f

(

x

,

w

,b

) = sign(

w

. x

-

b

)

Datapoints with

a

k

> 0

will be the support vectors

..so this sum only needs to be over the support vectors.

Why did I tell you about this equivalent QP?

It’s a formulation that QP packages can optimize more quickly

Because of further jaw-dropping developments you’re about to learnSlide64

Non Linearly-Separable Data

Approach 2: Map data to a higher dimensional space, and then do linear classification there (called the kernel trick)Slide65

Suppose we’re in 1 Dimension

What would SVMs do with this data?

x=0Slide66

Suppose we’re in 1 Dimension

Positive “plane”

Negative “plane”

x

=0Slide67

Harder 1D Dataset:

Not Linearly-Separable

What can be done about this?

x

=0Slide68

Harder 1D Dataset

The

Kernel Trick

: Preprocess the data, mapping

x

into a higher dimensional space,

z

=

Φ

(

x

)

For example:

x

=0

Here, Φ maps data from 1D to 2D

In general, map from

d

-dimensional input space to

k

-dimensional

z

spaceSlide69

Harder 1D Dataset

x

=0

The

Kernel Trick

: Preprocess the data, mapping

x

into a higher dimensional space,

z

=

Φ

(

x

)

w

T

Φ(

x

) + b = +1

The data

is

linearly separable in the new space, so use a linear SVM in the new spaceSlide70

Another Example

Project examples into some higher dimensional space where the data

is

linearly separable, defined by

z

=

Φ

(

x)Slide71

CS 540, University of Wisconsin-Madison, C. R. Dyer

Another Example

Project examples into some higher dimensional space where the data

is

linearly separable, defined by

z

=

Φ

(

x

)Slide72

Algorithm

Pick a Φ functionMap each training example, x, into the new higher-dimensional space using z = Φ(x)Solve the optimization problem using the nonlinearly transformed training examples,

z, to obtain a Linear SVM (with or without using the slack variable formulation) defined by w and

b

Classify a test example,

e

, by computing: sign(wT

· Φ(e) +

b)Slide73

Improving Efficiency

Time complexity of the original optimization formulation depends on the dimensionality, k, of z (k >> d) We can convert the optimization problem into an equivalent form, called the Dual Form, with time complexity O(N 3

) that depends on N, the number of training examples, not kDual Form will also allow us to rewrite the mapping functions in Φ in terms of “kernel functions” insteadSlide74

Dual Optimization Problem

Maximize

where

subject to these constraints:

Then define:

Then classify with:

f

(

x

,

w

,

b

)

=

sign

(

w

T

x

-

b

)

N

examples: (

x

k

, y

k

)

where

y

k

=

+1 / -1Slide75

Dual Optimization Problem

N examples: (xk, yk) where yk = +1 / -1

Maximize

where

subject to these constraints:

Then define:

Then classify with:

f(

x

,

w

,b

) = sign(

w

T

x

-

b

)

New variables; Examples with

α

k

>

0

will be the support vectors

Dot product of two examplesSlide76

Algorithm

Compute N x N matrix Q by computing yi yj

(xiT

x

j

)

between all pairs of training examplesSolve the optimization problem to compute αi

for i = 1, …, NEach non-zero αi indicates that example

xi is a support vectorCompute w and bThen classify test example x with: f(x) = sign(w

T x – b)Slide77

Example

Suppose we have 5 1D data points

x

1

=1, x

2

=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1, and 4, 5 as class 2 

y1=1, y2=1, y3=-1, y4=-1, y5=1

And we use a polynomial kernel of degree 2: K(xi,xj) = (

xixj+1)2

Let C = 100We first find αi (i=1, …, 5) by solvingSlide78

Example

Using a QP solver, we get

α

1

=0,

α

2=2.5, α3=0, α4=7.333,

α5=4.833The support vectors are {x

2=2, x4=5, x5=6}The discriminant function is

b

is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1, as x2, x4, x5 lie on and all give b=9Slide79

Example

Classification function:

1

2

4

5

6

class 2

class 1

class 1Slide80

Copyright © 2001, 2003, Andrew W. Moore

Dual Optimization Problem (After Mapping)

where

Subject to these constraints:

Then define:

Then classify with:

f

(

x

,

w

,

b

)

=

sign

(

w

T

Φ

(

x

)

-

b

)

Maximize

N

examples: (

x

k

, y

k

)

where

y

k

=

+1 / -1Slide81

Copyright © 2001, 2003, Andrew W. Moore

Dual Optimizzation Problem (After Mapping)

where

Subject to these constraints:

Then define:

Then classify with:

f

(

x

,

w

,

b

)

=

sign

(

w

T

Φ

(

x

)

-

b

)

Maximize

N

examples: (

x

k

, y

k

)

where

y

k

=

+1 / -1

N

2

dot products to compute this matrixSlide82

Dual formulation of the optimization problem depends on the input data only in dot products of the form:

Φ(xi)T · Φ(xj) where x

i and xj are two examplesWe can compute these dot products efficiently for certain types of

Φ’s where

K

(

xi, xj) =

Φ(xi)T · Φ(

xj)Example: Φ(xi)T · Φ(

xj) = (xiT · x

j)2 = K(xi , xj )Since the data only appears as dot products, we do

not need to map the data to higher dimensional space (using Φ(x) ) because we can use the kernel function

K insteadSlide83
Slide84

Kernel Functions

A kernel, K(xi, xj), is a dot product in some feature space

A kernel function is a function that can be applied to pairs of input examples to evaluate dot products in some corresponding (possibly infinite dimensional) feature spaceWe do not need to compute Φ

explicitlySlide85

What’s Special about a Kernel?

Say 1 example (in 2D) is: s = (s1, s2)We decide to use a particular mapping into 6D space:

Φ(s)T = (

s

1

2

, s22, √2

s1s2,

s1, s2, 1)Let another example be t = (t1

, t2)Then,

Φ(s)T  Φ(t) = s

12 t1

2 + s

22 t

2

2

+ 2s1s

2

t

1

t

2

+

s

1

t1 + s2t2 + 1 = (

s1t1 + s

2

t

2

+ 1)

2

= (

s

T

t +1)2So, define the kernel function to be K(s, t) = (sTt +1)2 = Φ(s)T  Φ(t) We save computation by using kernel K Slide86

Some Commonly Used Kernels

Linear kernel: K(xi , xj) = x

iT xjQuadratic kernel:

K

(

x

i , xj) = (xiT x

j +1)2Polynomial of degree d

kernel: K(xi , xj) = (xiT xj +1)

dRadial-Basis Function (Gaussian) kernel: K(xi

, xj) = exp(−||xi -xj ||2 /

σ2) Many possible kernels; picking a good one is tricky

Hacking with SVMs: create various kernels, hope their space Φ is meaningful, plug them into SVM, pick one with good classification accuracy

Kernel usually combined with slack variables because no guarantee of linear separability in new spaceSlide87
Slide88
Slide89

Algorithm

Compute N x N matrix Q by computing yi yj K(xi ,

xj ) between all pairs of training pointsSolve optimization problem to compute αi

for

i

= 1, …,

NEach non-zero αi indicates that example xi is a support vector

Compute w and bClassify test example x using: f

(x) = sign(wT x – b)Slide90

Common SVM Kernel Functions

z

k

=

( polynomial terms of

x

k

of degree 1 to

d )For example, when d=2 and m=2, K(x,y) = (

x1y1 + x2y

2 + 1)2 = 1 + 2x1y1 + 2x2y2 + 2x1x

2y1y2 + x

12 y

12 + x2

2

y

22zk = ( radial basis functions of x

k

)

z

k

=

( sigmoid functions of

x

k

)Slide91

Some SVM Kernel Functions

Polynomial Kernel Function: K(xi, xj)= (xi

T xj + 1)

d

Beyond polynomials there are other high dimensional basis functions that can be made practical by finding the right kernel function

Radial-Basis-style Kernel Function:

Neural-Net-style Kernel Function:

σ

,

k

and

d

are magic parameters that must be chosen by a model selection methodSlide92

Applications of SVMs

Bioinformatics

Machine Vision

Text Categorization

Ranking (e.g., Google searches)

Handwritten Character Recognition

Time series analysis

 Lots of very successful applications!Slide93

Handwritten Digit RecognitionSlide94

Example Application:

The Federalist Papers Dispute

Written in 1787-1788 by Alexander

Hamilton

, John Jay, and James

Madison

to persuade the citizens of New York to ratify the U.S. Constitution

Papers consisted of short essays, 900 to 3500 words in length

Authorship of 12

of those papers have been in dispute ( Madison or Hamilton); these papers are referred to as the

disputed Federalist papersSlide95

Description of the Data

For every paper:

Computed relative frequencies of

70

words that

Mosteller

-Wallace identified as good candidates for author-attribution studies

Each document is represented as a vector containing the 70 real numbers corresponding to the

70

word frequencies The dataset consists of 118 papers:

50 Madison papers 56 Hamilton papers 12

disputed papers

“Bag of words”Slide96

70-Word DictionarySlide97

Feature Selection for Classifying the Disputed Federalist Papers

Apply the SVM algorithm for feature selection to:

Train on the 106 Federalist papers with known authors

Find a classification hyperplane (LSVM) that uses as few words as possible

Use the

hyperplane

to classify the 12 disputed papersSlide98

Hyperplane Classifier Using 3 Words

A hyperplane depending on three words was found:

0.537

to

+ 24.663

upon

+ 2.953

would = 66.616

All disputed papers ended up on the Madison

side of the plane Slide99

Results: 3D Plot of HyperplaneSlide100

SVM Applet

http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtmlSlide101

Summary

Learning linear functionsPick separating hyperplane that maximizes marginSeparating plane defined in terms of support vectors (small number of training examples) only

Learning non-linear functionsProject examples into a higher dimensional spaceUse kernel functions for efficiencyGenerally avoids

overfitting

problem

Global optimization method; no local optima

Can be expensive to apply, especially for multi-class problemsBiggest Drawback: The choice of kernel functionThere is no “set-in-stone

” theory for choosing a kernel function for any given problemOnce a kernel function is chosen, there is only ONE modifiable parameter, the error penalty CSlide102

Software

A list of SVM implementations can be found at

http://www.kernel-machines.org

/

software.html

Some implementations (such as

LIBSVM) can handle multi-class classification

SVMLight is one of the earliest and most frequently used implementations of SVMs

Several Matlab toolboxes for SVMs are availableSlide103

Quiz: What is This?

“The Next Rembrandt,” a computer-generated, 3-D-printedpainting made using data from the artist’s real works, by Microsoft et al.Slide104

“The final portrait was created through a highly detailed and complex process which took over 18 months and used 150 gigabytes of digitally rendered graphics. This started with the analysis of all 346 of Rembrandt’s paintings using high resolution 3D scans and digital files, which were

upscaled using machine learning. It was possible to generate typical features and, using an algorithm that detects over 60 points in a painting, determine the distance between these on the subject’s face.” Slide105

Who are the “robots” appearing in fiction and myths by Mary Shelley, Rabbi Judah

Loew, and Pygmalion? Answer: Frankenstein, Golem, Galatea (aka Elise, Eliza)Which one of the following movies contains an AI agent that underwent a learning phase on screen? With criminals for mentors, it included stabbing a few police officers: Andrew (Bicentennial Man), HAL (2001), Chappie (Chappie), or Agent Smith (Matrix)? Answer: Chappie