knearestneighbors kNN Decision trees Support vector machines SVM Neural networks Support Vector Machines Chapter 189 and the paper Support vector machines by M Hearst ed 1998 ID: 627477
Download Presentation The PPT/PDF document "Supervised Learning Methods" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Notes for HW #4, Problem 3
At each hidden unit, use the ReLU activation functionIn forward pass use the ReLU activation function: g(x) = max(0, x)In backward pass use the derivative of the ReLU function: Slide2
Multi-Class Classification with Neural Networks
Use a number of output units equal to the number of classesRepresent each class with 1 at a particular output unit and 0 at all other output units
Cat
Dog
ToasterSlide3
Multi-Class Classification with Neural Networks
At each output unit
use the Softmax activation function
:
where
z
i
is the weighted sum of the inputs to the
i
th
output unit, and there are
K output units
Means output units all have values between 0 and 1, and sum to 1; can be interpreted as probabilities
Note: [3,1,1] does
not
become [.6, .2, .2] but rather [.78, .11, .11] since we are
exponentiating before normalizing
Slide4
Multi-class Classification with Neural Networks
For the error function, instead of SSE, use “Cross-Entropy” loss:
Derivative has a nice property when used with
Softmax
:
where
is the computed output at output unit
i
and
T
i
is the target output at unit
i
Measures distance between target distribution and output distributionSlide5
Notes for HW #4, Problem 3
At all
output units
, use the
Softmax
activation function
and define the Error (Loss) function using
Cross-EntropySlide6
Updating Weights in a 2-Layer Neural Network
For weights between hidden and output units, using the Softmax activation function and
Cross-Entropy error function:
=
α aj (
Tk - Ok)
= α aj ∆k where
wj,k
weight on link from hidden unit j to output unit k α learning rate parameter
aj activation (i.e., output) of hidden unit
j
Tk
teacher output for output unit
k
Ok
Softmax
output of output unit
kSlide7
Updating Weights in a 2-Layer Neural Network
For weights between input and hidden units with ReLU at hidden units:
w
i,j
weight on link from input i to hidden unit
j wj,k weight on link from hidden unit j to output unit k
K number of output units
α learning rate parameter Tk teacher output for output unit k O
k Softmax output of output unit k
ai
input value
i
g’(in
j
)
derivative of
ReLU
activation function
∆
k
= (Tk - Ok)
=
-
=
α
g
´
(
in
j
) ∆k
Slide8
Updating Weights in a 2-Layer Neural Network
For weights between hidden and output units, using a Sigmoid activation function
= -α
-
a
j (T
k - O
k) g'(ink) = α
aj (
Tk - Ok) Ok
(1 -
Ok
) =
α
aj Δ
k
w
j,k
weight on link
from hidden unit
j
to output unit k α learning rate parameter
aj activation (i.e., output) of hidden unit
j
T
k
teacher output for output unit
k
O
k
(=
ak) actual output of output unit k g' derivative of the sigmoid activation function, which is g’(x) = g(x)(1 – g(x))Δk modified errorΔk =
Err
k
×
g'
(
in
k
)Slide9
For Problem 2(b) this means
∇C = ErrC * g’(inC
) = (TC – OC ) g’(in
C
)
where
g’(inC) = g(inC) (1 – g(in
C))g(inC) = 1 / (1 + e –inC
)inC = aA * wac + aB * wbc + wh2caA = g(inA)
OC = g(inC ) = aCSlide10
Updating Weights in a
2-Layer Neural NetworkFor weights between inputs and hidden units:
w
i,j
weight on link
from input i to hidden unit j wj,k weight on link from hidden unit
j to output unit k
α learning rate parameter aj activation (i.e., output) of hidden unit j
Tk teacher output for output unit
k O
k actual output of output unit
k
ai input value
i
g'
derivative of sigmoid activation function, which is
g' = g
(1-
g
)Slide11
Then
∇A = g’(inA) * wac * ∇C
= aA * (1 – aA) * wac * ∇
CSlide12
Supervised Learning Methods
k-nearest-neighborsDecision treesNeural networksNaïve BayesSupport vector machines (SVM)Slide13
Support Vector Machines
Chapter 18.9 and the optional paper “Support vector machines” by M. Hearst, ed., 1998
Acknowledgments
: These slides combine and modify ones provided by Andrew Moore (CMU), Carla Gomes (Cornell),
Mingyue
Tan (UBC), Jerry Zhu (Wisconsin), Glenn Fung (Wisconsin), and
Olvi
Mangasarian (Wisconsin)Slide14
What are Support Vector Machines (SVMs) Used For?
Classification Regression and data-fitting Supervised and unsupervised learningSlide15
Lake Mendota, Madison, WI
Identify areas of land cover (land, ice, water, snow) in a sceneTwo methods:Scientist manually-derivedSupport Vector Machine (SVM)
Visible Image
Expert Labeled
Expert Derived
Automated Ratio
SVM
Lake Mendota, Wisconsin
Classifier
Expert Derived
SVM
cloud
45.7%
58.5%
ice
60.1%
80.4%
land
93.6%
94.0%
snow
63.5%
71.6%
water
84.2%
89.1%
unclassified
45.7%
Courtesy of Steve Chien of NASA/JPL
Slide16
Linear Classifiers
f
x
y
denotes +1
denotes -1
How would you classify this data?Slide17
Linear Classifiers
(aka Linear Discriminant Functions)Definition:A function that is a linear combination of the components of the input (column vector) x:where w is the weight (column vector) and
b is the biasA 2-class classifier then uses the rule: Decide class c1
if
f
(
x) ≥ 0 and class c2 if f(x) < 0 or, equivalently,
decide c1 if wTx ≥ -b and c
2 otherwiseSlide18
w
is the plane’s normal vector
b
is the distance from the origin
Planar decision surface in
d
dimensions is parameterized by (
w
,
b
)
w
b
Slide19Slide20
Slide21
Linear Classifiers
f
x
y
denotes +1
denotes -1
f
(
x
,
w
, b
)
=
sign
(
w
T
x
+
b
)
How would you classify this data?Slide22
Linear Classifiers
f
x
y
denotes +1
denotes -1
f
(
x
,
w
, b
) =
sign
(
w
T
x
+
b
)
How would you classify this data?Slide23
Linear Classifiers
f
x
y
denotes +1
denotes -1
f
(
x
,
w
, b
)
=
sign
(
w
T
x
+
b
)
How would you classify this data?Slide24
Linear Classifiers
f
x
y
denotes +1
denotes -1
f
(
x
,
w
, b
)
=
sign
(
w
T
x
+
b
)
Any of these would be fine …
… but which is
best
?Slide25
Classifier Margin
f
x
y
denotes +1
denotes -1
f
(
x
,
w
, b
) =
sign
(
w
T
x
+
b
)
Define the
margin
of a linear classifier as the
width
that the decision boundary could be expanded before hitting a data pointSlide26
Maximum Margin
f
x
y
denotes +1
denotes -1
f
(
x
,
w
, b
)
=
sign
(
w
T
x
+
b
)
The
maximum margin linear classifier
is the linear classifier with the maximum margin
This is the simplest kind of SVM (Called an LSVM)
Linear SVMSlide27
Maximum Margin
f
x
y
denotes +1
denotes -1
f
(
x
,
w
, b
)
=
sign
(
w
T
x
+
b
)
The
maximum margin linear classifier
is the linear classifier with the maximum margin
This is the simplest kind of SVM (Called an LSVM)
Support Vectors
are those data points that the margin pushes against
Linear SVMSlide28
Why the Maximum Margin?
denotes +1
denotes -1
f
(
x
,
w
,b
) = sign(
w
. x
-
b
)
The
maximum margin linear classifier
is the linear classifier with the, um, maximum margin.
This is the simplest kind of SVM (Called an LSVM)
Support Vectors
are those data points that the margin pushes against
Intuitively this feels
safest
If we’ve made a small
error
in the location of the boundary (it’s been jolted in its perpendicular direction) this gives us least chance of causing a misclassification
Robust
to outliers since the model is immune to change/removal of any non-support-vector data points
There’s some
theory
(using “VC dimension”) that is related to (but not the same as) the proposition that this is a good thing
Empirically it works very wellSlide29
Specifying a Line and Margin
How do we represent this mathematically?… in d input dimensions? An example x = (x1, …, xd)
T
Plus-Plane
Minus-Plane
Classifier Boundary
“Predict Class = +1” zone
“Predict Class = -1” zoneSlide30
Specifying a Line and Margin
Plus-plane = wT x + b = +1 Minus-plane = w
T x + b = -1
Plus-Plane
Minus-Plane
Classifier Boundary
“Predict Class = +1” zone
“Predict Class = -1” zone
Classify as
+1
if
w
T
x
+
b
≥
1
-1
if
w
T
x
+
b
≤
-
1
?
if
-
1 <
w
T
x
+
b
<
1
wx+b=1
wx+b=0
wx+b=-1
Weight vector:
w
= (w
1
, …,
w
d
)
T
Bias or threshold:
b
The dot product
is a scalar:
x
’s projection onto
wSlide31
Computing the Margin
Plus-plane = wT x + b = +1 Minus-plane =
wT x + b = -1
Claim:
The vector
w
is perpendicular to the Plus-Plane and the Minus-Plane
“Predict Class = +1” zone
“Predict Class = -1” zone
w
T
x+b
=1
w
T
x+b
=0
w
T
x+b
=-1
M =
Margin
(width)
How do we compute
M
in terms of
w
and
b
?
wSlide32
Computing the Margin
Plus-plane = wT x + b = +1 Minus-plane = wT
x + b = -1 Claim:
The vector
w
is perpendicular to the Plus-Plane.
Why?
“Predict Class = +1” zone
“Predict Class = -1” zone
wx+b=1
wx+b=0
wx+b=-1
M =
Margin
How do we compute
M
in terms of
w
and
b
?
Let
u
and
v
be two vectors on the Plus-Plane. What is
w
T
(
u
–
v
) ?
And so, of course, the vector
w
is also perpendicular to the Minus-Plane
wSlide33
Computing the Margin
Plus-plane = wT x + b = +1 Minus-plane = wT
x + b = -1 The vector w is perpendicular to the Plus-Plane
Let
x
−
be any point on the Minus-PlaneLet x+ be the closest Plus-Plane-point to x−
“Predict Class = +1” zone
“Predict Class = -1” zone
wx+b=1
wx+b=0
wx+b=-1
M =
Margin
How do we compute
M
in terms of
w
and
b
?
x
−
x
+
Any location in
m
: not necessarily a datapoint
Any location in
R
d
; not necessarily a data point
wSlide34
Computing the Margin
Plus-plane = wT x + b = +1 Minus-plane = w
T x + b = -1 The vector
w
is perpendicular to the Plus-Plane
Let
x− be any point on the Minus-PlaneLet x+ be the closest Plus-Plane-point to x
−Claim: x+ = x
− + λw for some value of λ
“Predict Class = +1” zone
“Predict Class = -1” zone
wx+b=1
wx+b=0
wx+b=-1
M =
Margin
How do we compute
M
in terms of
w
and
b
?
x
−
x
+
wSlide35
Computing the Margin
Plus-plane = wT x + b = +1Minus-plane = w
T x + b = -1 The vector
w
is perpendicular to the Plus-Plane
Let
x− be any point on the Minus-PlaneLet x+ be the closest Plus-Plane-point to x
−Claim: x+ = x−
+ λw for some value of λ Why?
“Predict Class = +1” zone
“Predict Class = -1” zone
wx+b=1
wx+b=0
wx+b=-1
M =
Margin
How do we compute
M
in terms of
w
and
b
?
x
−
x
+
The line from
x
−
to
x
+
is perpendicular to the planes
So to get from
x
−
to
x
+
travel some distance in direction
w
wSlide36
Computing the Margin
What we know:wT x+ + b = +1
wT x-
+
b
=
-1 x+ = x− +
λw||x+ - x−
|| = M
“Predict Class = +1” zone
“Predict Class = -1” zone
wx+b=1
wx+b=0
wx+b=-1
M =
Margin
x
−
x
+
wSlide37
Computing the Margin
What we know:wT x+ +
b = +1
w
T
x− + b = -1
x+ = x−
+ λw||x+ - x− || = MIt’s now easy to get M
in terms of w and b
“Predict Class = +1” zone
“Predict Class = -1” zone
wx+b=1
wx+b=0
wx+b=-1
M =
Margin
w
T
(
x
−
+
λ
w
) +
b
=
1
⇒
w
T
x
−
+
b
+
λ
w
T
w
=
1
⇒
-
1
+
λ
w
T
w
=
1
⇒
x
−
x
+
wSlide38
Computing the Margin
What we know:wT x+ + b = +1
wT x−
+
b
=
-1 x+ = x− + λ
w||x+ - x−
|| = M
“Predict Class = +1” zone
“Predict Class = -1” zone
wx+b=1
wx+b=0
wx+b=-1
M =
Margin =
M
= |
|
x
+
-
x
−
|| = ||
λ
w
||
x
−
x
+
w
=
M
, margin sizeSlide39
Learning the Maximum Margin Classifier
Given a guess of w and b, we canCompute whether all data points in the correct half-planesCompute the marginSo now we just need to write a program to search the space of
w’s and b’s to find the widest margin that correctly classifies all the data points How?
“Predict Class = +1” zone
“Predict Class = -1” zone
wx+b=1
wx+b=0
wx+b=-1
M =
Margin =
x
−
x
+
wSlide40
from
Statnikov et al.Slide41
from
Statnikov et al.Slide42
SVM as Constrained Optimization
Unknowns: w, bObjective function: maximize the margin: M = 2 / ||w||Equivalent to minimizing ||w|| or
||w||2 = w
T
w
N
training points: (xk , yk), yk
= +1 or -1Subject to each training point correctly classified, i.e.,subject to y
k(wTxk + b) ≥ 1 for all
k
This is a
quadratic optimization problem (QP)
, which can be solved efficiently
N
constraintsSlide43
Classification with SVMs
Given a new point x, we can classify it by
Computing score: wT
x
+
b
Deciding class based on whether < 0 or > 0
If desired, can set confidence threshold t
-1
0
1
Score >
t
: yes
Score < -
t
: no
Else: don’t know
Sec. 15.1Slide44
SVMs: More than Two Classes
SVMs can only handle two-class problemsk-class problem: Split the task into k binary tasks and learn k SVMs:Class 1 vs. the rest (classes 2 — k
)Class 2 vs. the rest (classes 1, 3 — k)…Class k
vs. the rest
Pick the class that puts the point
farthest into its positive regionSlide45
from
Statnikov et al.
I vs II & III
III vs I & IISlide46
SVMs: Non Linearly-Separable Data
What if the data are not linearly separable?Slide47
SVMs: Non Linearly-Separable Data
Two approaches:
Allow a few points on the wrong side (
slack variables
)
Map data to a higher dimensional space, and do linear classification there (
kernel trick
)Slide48
Non Linearly-Separable Data
Approach 1: Allow a few points on the wrong side (slack variables)
“Soft Margin Classification”Slide49
denotes +1
denotes -1
What Should We Do?Slide50
denotes +1
denotes -1
What Should We Do?
Idea:
Find minimum ||
w
||
2
while minimizing number of training set errors
Problem: Two things to minimize makes for an ill-defined optimizationSlide51
denotes +1
denotes -1
What Should We Do?
Idea:
Minimize
||
w
||
2
+
C
(#
train errors)
There’s a serious practical problem with this approach
Tradeoff “penalty” parameterSlide52
What Should We Do?
Idea:
Minimize
||
w
||
2
+
C (# train errors)There’s a serious practical problem with this approach
denotes +1
denotes -1
Tradeoff “penalty” parameter
Can’t be expressed as a Quadratic Programming problem.
So solving it may be too slow.
(Also, doesn’t distinguish between disastrous errors and near misses)Slide53
denotes +1
denotes -1
What Should We Do?
Minimize
||
w
||
2
+
C
(distance of all “misclassified points” to their correct place)Slide54
Choosing C
, the Penalty FactorCritical to choose a good value for the constant penalty parameter, C, becauseC too big means very similar to LSVM because we have a high penalty for non-separable points, and we may use many support vectors and overfitC too small means we allow many misclassifications in the training data and we may underfitSlide55
Choosing C
from Statnikov et al.Slide56
Learning Maximum Margin with Noise
Given guess of w, b, we canCompute sum of distances of points to their correct zonesCompute the margin widthAssume N
examples, each (xk , yk) where y
k
= +1 / -1
wx+b=1
wx+b=0
wx+b=-1
M =
What should our optimization criterion be?Slide57
Learning Maximum Margin with Noise
Given guess of w, b, we canCompute sum of distances of points to their correct zonesCompute the margin widthAssume N
examples, each (xk , yk)
where
y
k
= +1 / -1
wx+b=1
wx+b=0
wx+b=-1
M =
What should our optimization criterion be?
Minimize
ε
7
ε
11
ε
2
How many constraints will we have?
N
What should they be?
“
slack variables
”
y
k
(
w
T
x
k
+
b
)
≥ 1-
ε
k
for all
k
Note:
ε
k
= 0 for points in correct zoneSlide58
Given guess of
w , b we canCompute sum of distances of points to their correct zonesCompute the margin widthAssume R datapoints, each (x
k,yk) where yk = +/- 1
Learning Maximum Margin with Noise
wx+b=1
wx+b=0
wx+b=-1
M =
What should our optimization criterion be?
Minimize
e
7
ε
11
ε
2
Our original (noiseless data) QP had
d
+1
variables:
w
1
, w
2
, …
w
d
,
and
b
Our new (noisy data) QP has
d
+1+
N
variables:
w
1
,
w
2
,
…
w
d
,
b
,
ε
k
,
ε
1
,…
ε
N
d
= # input dimensions
How many constraints will we have?
N
What should they be?
w
T
x
k
+
b
≥
1-
ε
k
if
y
k
= +1
w
T
x
k
+
b
≤
-1+
ε
k
if
y
k
= -1
N
= # examplesSlide59
How many constraints will we have?
N
What should they be?
w
T
x
k
+ b ≥ 1- εk
if yk = +1
wTxk + b ≤ -1+
εk if
yk
= -1
Learning Maximum Margin with Noise
Given guess of
w , b we can
Compute sum of distances of points to their correct zones
Compute the margin width
Assume
N
examples, each (
x
k
, y
k) where yk = +1 / -1
wx+b=1
wx+b=0
wx+b=-1
M =
What should our optimization criterion be?
Minimize
ε
7
ε
11
ε
2
There’s a bug in this QP Slide60
Learning Maximum Margin with Noise
Given guess of w , b we canCompute sum of distances of points to their correct zonesCompute the margin widthAssume N
examples, each (xk, yk) where y
k
=
+1 / -1
wx+b=1
wx+b=0
wx+b=-1
M =
What should our optimization criterion be?
Minimize
How many constraints will we have?
2
N
What should they be?
w
T
x
k
+
b
≥
+1
-
ε
k
if
y
k
= +1
w
T
x
k
+
b
≤
-1
+
ε
k
if
y
k
= -1
ε
k
≥
0
for all
k
ε
7
ε
11
ε
2
“
slack variables
”Slide61
An Equivalent QP
Maximize
where
subject to these constraints:
Then define:
Then classify with:
f
(
x
,
w
,
b
)
= sign
(
w
.
x
-
b
)
N
examples, each (
x
k
, y
k
)
where
y
k
=
+1 / -1Slide62
N
examples, each (xk, yk) where yk = +1 / -1An Equivalent QP
Maximize
where
subject to these constraints:
Then define:
Then classify with:
f
(
x
,
w
,b
) = sign(
w
. x
-
b
)
Datapoints
with
α
k
>
0
will be the support vectors
so this sum only needs to be over the support vectorsSlide63
An Equivalent QP
Maximize
where
Subject to these constraints:
Then define:
Then classify with:
f
(
x
,
w
,b
) = sign(
w
. x
-
b
)
Datapoints with
a
k
> 0
will be the support vectors
..so this sum only needs to be over the support vectors.
Why did I tell you about this equivalent QP?
It’s a formulation that QP packages can optimize more quickly
Because of further jaw-dropping developments you’re about to learnSlide64
Non Linearly-Separable Data
Approach 2: Map data to a higher dimensional space, and then do linear classification there (called the kernel trick)Slide65
Suppose we’re in 1 Dimension
What would SVMs do with this data?
x=0Slide66
Suppose we’re in 1 Dimension
Positive “plane”
Negative “plane”
x
=0Slide67
Harder 1D Dataset:
Not Linearly-Separable
What can be done about this?
x
=0Slide68
Harder 1D Dataset
The
Kernel Trick
: Preprocess the data, mapping
x
into a higher dimensional space,
z
=
Φ
(
x
)
For example:
x
=0
Here, Φ maps data from 1D to 2D
In general, map from
d
-dimensional input space to
k
-dimensional
z
spaceSlide69
Harder 1D Dataset
x
=0
The
Kernel Trick
: Preprocess the data, mapping
x
into a higher dimensional space,
z
=
Φ
(
x
)
w
T
Φ(
x
) + b = +1
The data
is
linearly separable in the new space, so use a linear SVM in the new spaceSlide70
Another Example
Project examples into some higher dimensional space where the data
is
linearly separable, defined by
z
=
Φ
(
x)Slide71
CS 540, University of Wisconsin-Madison, C. R. Dyer
Another Example
Project examples into some higher dimensional space where the data
is
linearly separable, defined by
z
=
Φ
(
x
)Slide72
Algorithm
Pick a Φ functionMap each training example, x, into the new higher-dimensional space using z = Φ(x)Solve the optimization problem using the nonlinearly transformed training examples,
z, to obtain a Linear SVM (with or without using the slack variable formulation) defined by w and
b
Classify a test example,
e
, by computing: sign(wT
· Φ(e) +
b)Slide73
Improving Efficiency
Time complexity of the original optimization formulation depends on the dimensionality, k, of z (k >> d) We can convert the optimization problem into an equivalent form, called the Dual Form, with time complexity O(N 3
) that depends on N, the number of training examples, not kDual Form will also allow us to rewrite the mapping functions in Φ in terms of “kernel functions” insteadSlide74
Dual Optimization Problem
Maximize
where
subject to these constraints:
Then define:
Then classify with:
f
(
x
,
w
,
b
)
=
sign
(
w
T
x
-
b
)
N
examples: (
x
k
, y
k
)
where
y
k
=
+1 / -1Slide75
Dual Optimization Problem
N examples: (xk, yk) where yk = +1 / -1
Maximize
where
subject to these constraints:
Then define:
Then classify with:
f(
x
,
w
,b
) = sign(
w
T
x
-
b
)
New variables; Examples with
α
k
>
0
will be the support vectors
Dot product of two examplesSlide76
Algorithm
Compute N x N matrix Q by computing yi yj
(xiT
x
j
)
between all pairs of training examplesSolve the optimization problem to compute αi
for i = 1, …, NEach non-zero αi indicates that example
xi is a support vectorCompute w and bThen classify test example x with: f(x) = sign(w
T x – b)Slide77
Example
Suppose we have 5 1D data points
x
1
=1, x
2
=2, x3=4, x4=5, x5=6, with 1, 2, 6 as class 1, and 4, 5 as class 2
y1=1, y2=1, y3=-1, y4=-1, y5=1
And we use a polynomial kernel of degree 2: K(xi,xj) = (
xixj+1)2
Let C = 100We first find αi (i=1, …, 5) by solvingSlide78
Example
Using a QP solver, we get
α
1
=0,
α
2=2.5, α3=0, α4=7.333,
α5=4.833The support vectors are {x
2=2, x4=5, x5=6}The discriminant function is
b
is recovered by solving f(2)=1 or by f(5)=-1 or by f(6)=1, as x2, x4, x5 lie on and all give b=9Slide79
Example
Classification function:
1
2
4
5
6
class 2
class 1
class 1Slide80
Copyright © 2001, 2003, Andrew W. Moore
Dual Optimization Problem (After Mapping)
where
Subject to these constraints:
Then define:
Then classify with:
f
(
x
,
w
,
b
)
=
sign
(
w
T
Φ
(
x
)
-
b
)
Maximize
N
examples: (
x
k
, y
k
)
where
y
k
=
+1 / -1Slide81
Copyright © 2001, 2003, Andrew W. Moore
Dual Optimizzation Problem (After Mapping)
where
Subject to these constraints:
Then define:
Then classify with:
f
(
x
,
w
,
b
)
=
sign
(
w
T
Φ
(
x
)
-
b
)
Maximize
N
examples: (
x
k
, y
k
)
where
y
k
=
+1 / -1
N
2
dot products to compute this matrixSlide82
Dual formulation of the optimization problem depends on the input data only in dot products of the form:
Φ(xi)T · Φ(xj) where x
i and xj are two examplesWe can compute these dot products efficiently for certain types of
Φ’s where
K
(
xi, xj) =
Φ(xi)T · Φ(
xj)Example: Φ(xi)T · Φ(
xj) = (xiT · x
j)2 = K(xi , xj )Since the data only appears as dot products, we do
not need to map the data to higher dimensional space (using Φ(x) ) because we can use the kernel function
K insteadSlide83Slide84
Kernel Functions
A kernel, K(xi, xj), is a dot product in some feature space
A kernel function is a function that can be applied to pairs of input examples to evaluate dot products in some corresponding (possibly infinite dimensional) feature spaceWe do not need to compute Φ
explicitlySlide85
What’s Special about a Kernel?
Say 1 example (in 2D) is: s = (s1, s2)We decide to use a particular mapping into 6D space:
Φ(s)T = (
s
1
2
, s22, √2
s1s2,
s1, s2, 1)Let another example be t = (t1
, t2)Then,
Φ(s)T Φ(t) = s
12 t1
2 + s
22 t
2
2
+ 2s1s
2
t
1
t
2
+
s
1
t1 + s2t2 + 1 = (
s1t1 + s
2
t
2
+ 1)
2
= (
s
T
t +1)2So, define the kernel function to be K(s, t) = (sTt +1)2 = Φ(s)T Φ(t) We save computation by using kernel K Slide86
Some Commonly Used Kernels
Linear kernel: K(xi , xj) = x
iT xjQuadratic kernel:
K
(
x
i , xj) = (xiT x
j +1)2Polynomial of degree d
kernel: K(xi , xj) = (xiT xj +1)
dRadial-Basis Function (Gaussian) kernel: K(xi
, xj) = exp(−||xi -xj ||2 /
σ2) Many possible kernels; picking a good one is tricky
Hacking with SVMs: create various kernels, hope their space Φ is meaningful, plug them into SVM, pick one with good classification accuracy
Kernel usually combined with slack variables because no guarantee of linear separability in new spaceSlide87Slide88Slide89
Algorithm
Compute N x N matrix Q by computing yi yj K(xi ,
xj ) between all pairs of training pointsSolve optimization problem to compute αi
for
i
= 1, …,
NEach non-zero αi indicates that example xi is a support vector
Compute w and bClassify test example x using: f
(x) = sign(wT x – b)Slide90
Common SVM Kernel Functions
z
k
=
( polynomial terms of
x
k
of degree 1 to
d )For example, when d=2 and m=2, K(x,y) = (
x1y1 + x2y
2 + 1)2 = 1 + 2x1y1 + 2x2y2 + 2x1x
2y1y2 + x
12 y
12 + x2
2
y
22zk = ( radial basis functions of x
k
)
z
k
=
( sigmoid functions of
x
k
)Slide91
Some SVM Kernel Functions
Polynomial Kernel Function: K(xi, xj)= (xi
T xj + 1)
d
Beyond polynomials there are other high dimensional basis functions that can be made practical by finding the right kernel function
Radial-Basis-style Kernel Function:
Neural-Net-style Kernel Function:
σ
,
k
and
d
are magic parameters that must be chosen by a model selection methodSlide92
Applications of SVMs
Bioinformatics
Machine Vision
Text Categorization
Ranking (e.g., Google searches)
Handwritten Character Recognition
Time series analysis
Lots of very successful applications!Slide93
Handwritten Digit RecognitionSlide94
Example Application:
The Federalist Papers Dispute
Written in 1787-1788 by Alexander
Hamilton
, John Jay, and James
Madison
to persuade the citizens of New York to ratify the U.S. Constitution
Papers consisted of short essays, 900 to 3500 words in length
Authorship of 12
of those papers have been in dispute ( Madison or Hamilton); these papers are referred to as the
disputed Federalist papersSlide95
Description of the Data
For every paper:
Computed relative frequencies of
70
words that
Mosteller
-Wallace identified as good candidates for author-attribution studies
Each document is represented as a vector containing the 70 real numbers corresponding to the
70
word frequencies The dataset consists of 118 papers:
50 Madison papers 56 Hamilton papers 12
disputed papers
“Bag of words”Slide96
70-Word DictionarySlide97
Feature Selection for Classifying the Disputed Federalist Papers
Apply the SVM algorithm for feature selection to:
Train on the 106 Federalist papers with known authors
Find a classification hyperplane (LSVM) that uses as few words as possible
Use the
hyperplane
to classify the 12 disputed papersSlide98
Hyperplane Classifier Using 3 Words
A hyperplane depending on three words was found:
0.537
to
+ 24.663
upon
+ 2.953
would = 66.616
All disputed papers ended up on the Madison
side of the plane Slide99
Results: 3D Plot of HyperplaneSlide100
SVM Applet
http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtmlSlide101
Summary
Learning linear functionsPick separating hyperplane that maximizes marginSeparating plane defined in terms of support vectors (small number of training examples) only
Learning non-linear functionsProject examples into a higher dimensional spaceUse kernel functions for efficiencyGenerally avoids
overfitting
problem
Global optimization method; no local optima
Can be expensive to apply, especially for multi-class problemsBiggest Drawback: The choice of kernel functionThere is no “set-in-stone
” theory for choosing a kernel function for any given problemOnce a kernel function is chosen, there is only ONE modifiable parameter, the error penalty CSlide102
Software
A list of SVM implementations can be found at
http://www.kernel-machines.org
/
software.html
Some implementations (such as
LIBSVM) can handle multi-class classification
SVMLight is one of the earliest and most frequently used implementations of SVMs
Several Matlab toolboxes for SVMs are availableSlide103
Quiz: What is This?
“The Next Rembrandt,” a computer-generated, 3-D-printedpainting made using data from the artist’s real works, by Microsoft et al.Slide104
“The final portrait was created through a highly detailed and complex process which took over 18 months and used 150 gigabytes of digitally rendered graphics. This started with the analysis of all 346 of Rembrandt’s paintings using high resolution 3D scans and digital files, which were
upscaled using machine learning. It was possible to generate typical features and, using an algorithm that detects over 60 points in a painting, determine the distance between these on the subject’s face.” Slide105
Who are the “robots” appearing in fiction and myths by Mary Shelley, Rabbi Judah
Loew, and Pygmalion? Answer: Frankenstein, Golem, Galatea (aka Elise, Eliza)Which one of the following movies contains an AI agent that underwent a learning phase on screen? With criminals for mentors, it included stabbing a few police officers: Andrew (Bicentennial Man), HAL (2001), Chappie (Chappie), or Agent Smith (Matrix)? Answer: Chappie