Mistake Bounds - PowerPoint Presentation

399 views
Uploaded On 2016-06-01

Mistake Bounds - PPT Presentation

William W Cohen One simple way to look for interactions Naïve Bayes two class version dense vector of g xy scores for each word in the vocabulary Scan thru data whenever we see x ID: 344550

labels instances guess mistake instances labels mistake guess positive subsets perceptron compute data learning examples bayes algorithm label don

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/344550" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Mistake Bounds" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

Slide1

Mistake Bounds

William W. CohenSlide2

One simple way to look for interactions

Naïve Bayes – two class version

dense vector of g(

x,y

) scores for each word in the vocabulary

Scan thru data:

whenever we see

with

we increase g(x,y)-g(x,~y)whenever we see x with ~y we decrease g(x,y)-g(x,~y)We do this regardless of whether it seems to help or not on the data….if there are duplications, the weights will become arbitrarily large

To detect interactions:

increase/decrease g(

x,y

)-g(

x,~y

) only if we need to (for that example)

otherwise, leave it unchangedSlide3

Online Learning

instance

Compute: y

+1,-1: label

mistake

: vk+1 = vk + correction

Train Data

To detect interactions:

increase/decrease

only if we need to (for that example)

otherwise, leave it unchanged

We can be sensitive to duplication by stopping updates when we get better performanceSlide4

Theory: the prediction game

Player A:

picks a “target concept” c

for now - from a finite set of possibilities C (e.g., all decision trees of size m) for t=1,….,

Player A picks x=(x1,…,xn) and sends it to BFor now, from a finite set of possibilities (e.g., all binary vectors of length n)

B predicts a label, ŷ, and sends it to AA sends B the true label y=c(

x)we record if B made a mistake or notWe care about the worst case number of mistakes B will make over all possible concept & training sequences of any length

The “Mistake bound” for B, MB(C), is this boundSlide5

The prediction game

Are there practical algorithms where we can compute the mistake bound?Slide6

The perceptron

instance

Compute: y

If mistake:

k+1

Slide7

-u

(1)

A target

(2)

The guess

after one positive example.

-u

(3a)

The guess

after the two positive examples:

(3b)

The guess

after the one positive and one negative example:

If mistake:

k+1

Slide8

-u

(3a)

The guess

after the two positive examples:

(3b)

The guess

after the one positive and one negative example:

γSlide9

-u

(3a)

The guess

after the two positive examples:

(3b)

The guess

after the one positive and one negative example:

2Slide10

2Slide11

Summary

We have shown that

: exists a u

with unit norm that has margin γ on examples in the seq (x

1,y1),(x

2,y2),….Then

: the perceptron algorithm makes < R2/ γ

2 mistakes on the sequence (where R >= ||xi||)

Independent of dimension of the data or classifier (!)This doesn’t follow from M(C)<=VCDim(C)

We don’t know if this algorithm could be betterThere are many variants that rely on similar analysis (ROMMA, Passive-Aggressive, MIRA, …)We don’t know what happens if the data’s not separableUnless I explain the “Δ trick” to youWe don’t know what classifier to use “after” trainingSlide12

On-line to batch learning

Pick a

at random according to m

/m, the fraction of examples it was used for.

Predict using the

k you just picked.(Actually, use some sort of deterministic approximation to this).Slide13
Slide14

Complexity of perceptron learning

Algorithm:

=0for each example x,y:if

sign(v.x) != yv = v +

yxinit

hashtablefor xi!=0, vi += yxi

O(n)

O(|x|)=O(|d|)Slide15

Complexity of averaged

perceptron

Algorithm:

vk=0va = 0for each example

x,y:if sign(vk.x) != yva

= va + vkvk = vk

+ yxmk = 1elsenk++

init hashtablesfor

vki!=0, vai += vk

ifor xi!=0, vi += yxi

O(n) O(n|V|)O(|x|)=O(|d|)O(|V|)Slide16

Parallelizing perceptrons

Instances/labels

Instances/labels – 1

Instances/labels – 2

Instances/labels – 3

vk/va

-1

vk/va

- 2

vk/va-3

Split into example subsets

Combine somehow?

Compute

vk’s

on subsetsSlide17

Parallelizing perceptrons

Instances/labels

Instances/labels – 1

Instances/labels – 2

Instances/labels – 3

vk/va

-1

vk/va

- 2

vk/va-3

vk/va

Split into example subsets

Combine somehow

Compute

vk’s

on subsetsSlide18

Parallelizing perceptrons

Instances/labels

Instances/labels – 1

Instances/labels – 2

Instances/labels – 3

vk/va

-1

vk/va

- 2

vk/va-3

vk/va

Split into example subsets

Synchronize with messages

Compute

vk’s

on subsets

vk/vaSlide19

Review/outline

How to implement Naïve Bayes

Time is linear in size of data (one scan!)

We need to count C( X=word ^ Y=label)

Can you parallelize Naïve Bayes?Trivial solution 1Split the data up into multiple subsetsCount and total each subset independently

Add up the countsResult should be the sameThis is unusual for streaming learning algorithmsWhy?

no interaction between feature weight updatesFor perceptron that’s not the caseSlide20

A hidden agenda

Part of machine learning is good grasp of theory

Part of ML is a good grasp of what hacks tend to work

These are not always the same

Especially in big-data situationsCatalog of useful tricks so farBrute-force estimation of a joint distributionNaive Bayes

Stream-and-sort, request-and-answer patternsBLRT and KL-divergence (and when to use them)TF-IDF weighting – especially IDFit’s often useful even when we don’t understand why

Perceptron/mistake bound modeloften leads to fast, competitive, easy-to-implement methodsparallel versions are non-trivial to implement/understand