/
Boosting Approach to ML Maria- Boosting Approach to ML Maria-

Boosting Approach to ML Maria- - PowerPoint Presentation

ellena-manuel
ellena-manuel . @ellena-manuel
Follow
348 views
Uploaded On 2018-11-08

Boosting Approach to ML Maria- - PPT Presentation

Florina Balcan 03182015 Perceptron Margins Kernels Recap from last time Boosting Works by creating a series of challenge datasets st even modest performance on these can be ID: 721702

error margin linear boosting margin error boosting linear perceptron mistake set classifier weak theorem test algorithm learning examples definition positive negative update

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Boosting Approach to ML Maria-" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Boosting Approach to ML

Maria-Florina Balcan

03/18/2015

Perceptron, Margins, KernelsSlide2

Recap from last time: Boosting

Works by creating

a series

of challenge datasets

s.t.

even modest performance on these can be used to produce an overall high-accuracy predictor.

Works amazingly well in practice.

Adaboost one of the top 10 ML algorithms.

General method for improving the accuracy of any given learning algorithm.

Backed up by solid

foundations. Slide3

Adaboost

(Adaptive Boosting)

For

t=1,2, … ,T

Construct

on

{

, …, } 

Run

A

on

producing

 

,

 

+

+

+

+

+

+

+

+

-

-

-

-

-

-

-

-

 

Output

 

Input

:

S={(

)

, …,(

)};

 

weak learning algo A

(e.g., Naïve Bayes, decision stumps)

puts half of weight on examples where is incorrect & half on examples where is correct

 

if

 

if

 

[i.e., ]

 

Given and set

 

 

uniform on

{, …, }

 

 Slide4

Nice Features of

AdaboostVery general: a meta-procedure, it can use any weak learning algorithm!!!

Very fast (single pass through data each round) &

simple

to code

, no parameters to tune.Grounded in rich theory.(e.g., Naïve Bayes, decision stumps)Slide5

Analyzing Training Error

Theorem

(error of

over

)

 

 

So, if

,

then

 

Adaboost

is adaptive

Does not need to know

or

T

a priori

 

Can exploit

 

The training error drops exponentially in T!!!

To get

,

need only

rounds

 Slide6

Generalization Guarantees

G

={all fns of the form

}

 

is a weighted vote, so the hypothesis class is:  Theorem [Freund&Schapire’97]

 T= # of roundsKey reason:

plus typical VC bounds.

 

H

space

of weak

hypotheses;

d=

VCdim

(H)

Theorem

w

here

 

 

How about generalization guarantees?

Original analysis [Freund&Schapire’97] Slide7

Generalization Guarantees

Theorem

[Freund&Schapire’97]

 

w

here d=VCdim(H)

errorcomplexity

train errorgeneralization

error

T= # of roundsSlide8

Generalization Guarantees

Experiments

showed that the test error of the generated classifier usually does

not

increase

as its size becomes very large.Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!Slide9

Generalization Guarantees

Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved

could further improve test set performance!!!

These results seem to contradict

FS’97 bound and Occam’s razor

(in order achieve good test error the classifier should be as simple as possible)!

Experiments showed

that the test error of the generated classifier usually does not increase as its size becomes very large.

 Slide10

How can we explain the experiments?

Key Idea:

R.

Schapire

, Y. Freund, P. Bartlett, W. S. Lee. present in “

Boosting the margin: A new explanation for the effectiveness of voting methods

” a nice theoretical explanation.

Training error does not tell the whole story.

We need also to consider the classification

confidence

!!Slide11

Boosting

didn’t seem to overfit…(!)

test error

train error

test

error of base classifier (weak learner)

Error Curve, Margin Distr. Graph

- Plots

from [SFBL98]

…because it turned out to be increasing the

margin

of the classifierSlide12

Classification Margin

H space

of weak hypotheses.

The

convex hull

of

H:

Let

.

 

The majority vote rule

given by

(given by

)

predicts

wrongly on example iff .

 

 

Definition:

margin of (or of

) on example to be .  

 

The margin is positive

iff

. See as the strength or the confidence of the vote.

 

1

High confidence, correct

-1High confidence, incorrect

Low confidenceSlide13

Boosting and Margins

Theorem:

,

then

with prob.

,

,

,

 

Note

: bound does

not

depend on T (the # of rounds of boosting), depends only on the complex. of the weak

hyp

space and the margin!Slide14

Boosting and Margins

If all training examples have

large margins

, then we can

approximate

the final classifier by a much smaller classifier.

Can use this to prove that

better margin  smaller test error, regardless of the number of weak classifiers.

Can also prove that boosting tends to increase the margin of training examples by concentrating on those of smallest margin.

Although final classifier is getting

larger

,

margins

are likely to be

increasing

, so the final classifier is actually getting closer to a simpler classifier, driving down test error.

Theorem:

, then with prob. , ,

,

 Slide15

Boosting and Margins

Theorem:

,

then

with prob.

,

,

,

 

Note

: bound does

not

depend on T (the # of rounds of boosting), depends only on the complex. of the weak

hyp

space and the margin!Slide16

Shift

in mindset: goal is now just to find classifiers a bit better than random guessing.Relevant for big data age: quickly focuses on “core difficulties”, so well-suited to distributed settings, where data must be communicated efficiently

[Balcan-Blum-Fine-Mansour COLT’12].

Backed up by solid foundations.

Adaboost

work

and its variations well in practice with many kinds of data (one of the top 10 ML algos).

More about classic applications in Recitation.Boosting, Adaboost SummarySlide17

Interestingly, the usefulness of

margin

recognized in Machine Learning since late 50’s.

Perceptron

[Rosenblatt’57]

analyzed via geometric (aka

) margin. 

Original guarantee in the online learning scenario.Slide18

The Perceptron Algorithm

Online Learning Model

Margin Analysis

KernelsSlide19

Mistake bound model

Example arrive

sequentially

.

The Online Learning Model

We need to make a prediction.

A

fterwards

observe the

outcome.

Analysis wise, make

n

o

distributional

assumptions

.

Goal

: Minimize the number of mistakes.Online Algorithm

Example

 

Prediction  

Phase i:Observe  For

i=1, 2, …, :Slide20

The Online Learning Model. Motivation

-

Email

classification (distribution of both spam and regular mail changes over time, but the target function stays fixed - last year's spam still looks like spam).

-

Add placement in a new market.

-

Recommendation systems. Recommending movies, etc.

- Predicting whether a user will be interested in a new news article or not.Slide21

Linear Separators

X

X

X

X

X

X

X

X

X

X

O

O

O

O

O

O

O

O

w

Instance space

 

Hypothesis class of linear decision surfaces in

.

 

,

if

, then label

x

as

+

, otherwise label it as

-

 

Claim

: WLOG

.

 

Proof: Can simulate a non-zero threshold with a dummy input feature

that is always set up to 1. 

 

iff

 

where  Slide22

Set

t=1

, start with the all zero vector

.

 

Linear Separators: Perceptron Algorithm

Given example

, predict positive iff

 

On a mistake, update as follows:

Mistake on positive, then update

 

Mistake on negative, then update

 

Note:

is weighted sum of incorrectly classified examples

 

 

 

Important when we talk about kernels.Slide23

Perceptron Algorithm: Example

Example:

 

-

+

+

 

 

 

 

+

-

-

Algorithm:

Set t=1, start with all-zeroes weight vector

.

Given example

, predict positive

iff

On a mistake, update as follows:

Mistake on positive, update

Mistake on negative, update

 

 

 

 

 

 

X

X

X

Slide24

Geometric Margin

Definition:

The

margin

of example

w.r.t.

a linear sep. is the distance from to the plane

(or the negative if on wrong side)

 

 

w

Margin of positive example

 

 

Margin of negative example

 Slide25

Geometric Margin

Definition:

The

margin

of a set of examples

wrt a linear separator is the smallest margin over points

+

+

+

+

+

+

-

-

-

-

-

 

 

+

-

-

-

-

+

w

Definition:

The

margin

of example

w.r.t.

a linear

sep.

is the distance from

to the plane

(or the negative if on wrong side)

 Slide26

+

+

+

+

-

-

-

-

-

 

 

+

-

-

-

-

w

Definition:

The

margin

of a set of examples

is the

maximum

over all linear separators

.

 

Geometric Margin

Definition:

The

margin

of a set of examples

wrt

a linear separator

is the smallest margin over points . 

Definition:

The margin of example w.r.t. a linear sep. is the distance from to the plane (or the negative if on wrong side)

 Slide27

Perceptron: Mistake

Bound

Theorem

: If data has margin

and all points inside a ball of radius

, then Perceptron makes

mistakes. 

(Normalized margin:

m

ultiplying

all points by 100, or dividing all points by 100, doesn’t change the number of mistakes;

algo

is invariant to scaling.)

+

w*

+

+

+

+

+

+

-

-

-

-

-

-

-

-

-

+

w*

RSlide28

Perceptron

Algorithm: Analysis

Theorem:

I

f data has margin

and all points inside a ball of radius , then Perceptron makes

mistakes. 

Update

rule:

Mistake

on positive:

Mistake on negative:

 

Proof:

I

dea

: analyze

and

, where

is the max-margin

sep,

.  

Claim 1

:

Claim 2

:

(because

)

 

(by Pythagorean Theorem)

 

 

 

After mistakes:

 

(by Claim 1)

 

(by Claim 2)

 

(since is unit length) 

So,

, so . 

 

 Slide29

Perceptron Extensions

Can use it to find a consistent separator (by cycling through the data).

One can convert the mistake bound guarantee into a distributional guarantee too (for the case where the

s come from a fixed distribution).

 

Can be adapted to the case where there is no perfect separator as long as the so called hinge loss

(i.e., the total distance needed to move the points to classify them correctly large margin)

is small.

Can be

kernelized

to handle non-linear decision boundaries!Slide30

Perceptron Discussion

Simple online algorithm for learning linear separators with a nice guarantee that depends only on the geometric (aka

) margin.

 

Simple, but very

useful in applications like Branch prediction; it also has interesting extensions to structured

prediction.

It can be

kernelized to handle non-linear decision boundaries --- see next class!