/
An Introduction to Ensemble Methods An Introduction to Ensemble Methods

An Introduction to Ensemble Methods - PowerPoint Presentation

luanne-stotts
luanne-stotts . @luanne-stotts
Follow
435 views
Uploaded On 2015-12-10

An Introduction to Ensemble Methods - PPT Presentation

Boosting Bagging Random Forests and More Yisong Yue Supervised Learning Goal learn predictor hx High accuracy low error Using training data x 1 y 1 x n y n Person ID: 220296

variance ensemble amp bias ensemble variance bias amp random bagging forests gradient model error boosting descent training pdf data

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "An Introduction to Ensemble Methods" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

An Introduction to Ensemble MethodsBoosting, Bagging, Random Forests and More

Yisong YueSlide2

Supervised LearningGoal: learn predictor h(x)

High accuracy (low error)Using training data {(x1,y

1

),…,(

x

n

,y

n

)}Slide3

Person

Age

Male?

Height > 55”

Alice

1401Bob1011Carol1301Dave810Erin1100Frank911Gena800

h(x) = sign(

w

T

x

-b)Slide4

Person

Age

Male?

Height > 55”

Alice

1401Bob1011Carol1301Dave810Erin1100Frank911Gena800

Male?

Age>9?

Age>10?

1

0

1

0

Yes

Yes

Yes

No

No

NoSlide5

Splitting Criterion:

entropy reduction (info gain)

Complexity:

number of leaves, size of leavesSlide6

Outline

Bias/Variance TradeoffEnsemble methods that minimize varianceBagging

Random Forests

Ensemble methods that minimize bias

Functional Gradient

DescentBoostingEnsemble SelectionSlide7

Generalization Error

“True” distribution: P(x,y)

Unknown to us

Train:

h(x) = y

Using training data S = {(x1,y1),…,(xn,yn)}Sampled from P(x,y)Generalization Error: L(h) = E(x,y)~P(x,y)[ f(h(x),y) ] E.g., f(a,b) = (a-b)2Slide8

Person

Age

Male?

Height > 55”

Alice

1401Bob1011Carol1301Dave810Erin1100Frank911Gena800

Person

Age

Male?

Height

> 55”

James

11

1

1

Jessica

14

0

1

Alice

14

0

1

Amy

12

0

1

Bob

10

1

1

Xavier

9

1

0

Cathy

9

0

1

Carol

13

0

1

Eugene

13

1

0

Rafael

12

1

1

Dave

8

1

0

Peter

9

1

0

Henry

13

1

0

Erin

11

0

0

Rose

7

0

0

Iain

8

1

1

Paulo

12

1

0

Margaret

10

0

1

Frank

9

1

1

Jill

13

0

0

Leon

10

1

0

Sarah

12

0

0

Gena

8

0

0

Patrick

5

1

1

L

(h) = E

(

x,y

)~P(

x,y

)

[

f

(h(x),y) ]

Generalization Error:

h

(x)

ySlide9

Bias/Variance TradeoffTreat h(x

|S) has a random function Depends on training data S

L

=

ES[ E(x,y)~P(x,y)[ f(h(x|S),y) ] ]Expected generalization errorOver the randomness of SSlide10

Bias/Variance Tradeoff

Squared loss: f(a,b) = (a-b)2

Consider one data point (

x,y

)

Notation: Z = h(x|S) – y ž = ES[Z]Z-ž = h(x|S) – ES[h(x|S)]ES[(Z-ž)2] = ES[Z2 – 2Zž + ž2] = ES[Z2] – 2ES[Z]ž + ž2 = ES[Z2] – ž2ES[f(h(x|S),y)] = ES[Z2] = ES[(Z-ž)2] + ž2BiasVarianceExpected Error

Bias/Variance for all (x,y) is expectation over P(x,y).Can also incorporate measurement noise.

(Similar flavor of analysis for other loss functions.)Slide11

Example

x

ySlide12

h(

x|S

)Slide13

h(

x|S

)Slide14

h(

x|S

)Slide15

E

S

[(h(

x|S

) - y)

2] = ES[(Z-ž)2] + ž2VarianceZ = h(x|S) – y ž = ES[Z]BiasVarianceBiasVarianceBiasVarianceBias

Expected ErrorSlide16
Slide17
Slide18

Outline

Bias/Variance TradeoffEnsemble methods that minimize variance

Bagging

Random Forests

Ensemble methods that minimize bias

Functional Gradient DescentBoostingEnsemble SelectionSlide19

Bagging

Goal: reduce varianceIdeal setting: many training sets S’

Train model using each S’

Average predictions

E

S[(h(x|S) - y)2] = ES[(Z-ž)2] + ž2VarianceBiasExpected ErrorZ = h(x|S) – y ž = ES[Z]http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf “Bagging Predictors” [Leo Breiman, 1994]Variance reduces linearlyBias unchangedsampled independently

S’

P(

x,y

)Slide20

Bagging

Goal: reduce varianceIn practice: resample S’ with replacement

Train model using each S’

Average predictions

E

S[(h(x|S) - y)2] = ES[(Z-ž)2] + ž2VarianceBiasExpected ErrorZ = h(x|S) – y ž = ES[Z]from Shttp://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf “Bagging Predictors” [Leo Breiman, 1994]Variance reduces sub-linearly(Because S’ are correlated)Bias often increases slightly

S’

S

Bagging = Bootstrap AggregationSlide21

“An

Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants”Eric Bauer & Ron

Kohavi

,

Machine

Learning 36, 105–139 (1999) VarianceBiasBiasDTBagged DTBetterSlide22

Random Forests

Goal: reduce varianceBagging can only do so muchResampling training data asymptotes

Random Forests:

sample data & features!

Sample

S’ Train DTAt each node, sample features (sqrt)Average predictions“Random Forests – Random Features” [Leo Breiman, 1997]http://oz.berkeley.edu/~breiman/random-forests.pdf Further de-correlates treesSlide23

Better

Average performance over many datasets

Random Forests perform the best

“An

Empirical Evaluation of Supervised Learning in High

Dimensions”Caruana, Karampatziakis & Yessenalina, ICML 2008Slide24

Structured Random Forests

DTs normally train on unary labels y=0/1 What about structured labels?Must define information gain of structured labels

Edge detection:

E.g.,

s

tructured label is a 16x16 image patchMap structured labels to another space where entropy is well defined“Structured Random Forests for Fast Edge Detection”Dollár & Zitnick, ICCV 2013Slide25

Image

Ground Truth

RF Prediction

Comparable accuracy

vs

state-of-the-art Orders of magnitude faster“Structured Random Forests for Fast Edge Detection”Dollár & Zitnick, ICCV 2013Slide26

Outline

Bias/Variance TradeoffEnsemble methods that minimize varianceBagging

Random Forests

Ensemble methods that minimize bias

Functional Gradient

DescentBoostingEnsemble SelectionSlide27

Functional Gradient Descent

http://statweb.stanford.edu/~jhf/ftp/trebst.pdf

h

(x) = h

1

(x)S’ = {(x,y)}

h1(x)

S’ = {(x,y-h

1

(x))}

h

2

(x)

S’ = {(x,y-h

1

(x) - … - h

n-1

(x))}

h

n

(x)

+ h

2

(x)

+ … +

h

n

(x)Slide28

Coordinate Gradient DescentLearn w so that h(x) =

wTxCoordinate descent

Init

w = 0

Choose dimension with highest gain

Set component of wRepeatSlide29

Coordinate Gradient DescentLearn w so that h(x) =

wTxCoordinate descent

Init

w = 0

Choose dimension with highest gain

Set component of wRepeatSlide30

Coordinate Gradient DescentLearn w so that h(x) =

wTxCoordinate descent

Init

w = 0

Choose dimension with highest gain

Set component of wRepeatSlide31

Coordinate Gradient DescentLearn w so that h(x) =

wTxCoordinate descent

Init

w = 0

Choose dimension with highest gain

Set component of wRepeatSlide32

Coordinate Gradient DescentLearn w so that h(x) =

wTxCoordinate descent

Init

w = 0

Choose dimension with highest gain

Set component of w RepeatSlide33

Coordinate Gradient DescentLearn w so that h(x) =

wTxCoordinate descent

Init

w = 0

Choose dimension with highest gain

Set component of wRepeatSlide34

Functional Gradient Descent

“Function Space”

(All possible DTs)

Coordinate descent in function space

Restrict weights to be 0,1,2,…

h

(x) = h

1

(x)

+ h

2

(x)

+ … +

h

n

(x)Slide35

Boosting (AdaBoost)

h

(x) = a

1

h

1(x)S’ = {(x,y,u1)}h1

(x)

S’ = {(x,y,u

2

)}

h

2

(x)

S’ = {(x,y,u

3

))}

h

n

(x)

+ a

2

h

2

(x)

+ … + a

3

h

n

(x)

https://www.cs.princeton.edu/~schapire/papers/explaining-

adaboost.pdf

u – weighting on data points

a

– weight of linear combination

Stop when validation

performance plateaus

(will discuss later)Slide36

Theorem:

training error drops exponentially fast

https://www.cs.princeton.edu/~schapire/papers/explaining-

adaboost.pdf

Initial Distribution of DataTrain modelError of modelCoefficient of modelUpdate Distribution

Final averageSlide37

DT

AdaBoost

Better

Bagging

Variance

BiasBiasBoosting often uses weak modelsE.g, “shallow” decision treesWeak models have lower variance“An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants”Eric Bauer & Ron Kohavi, Machine Learning 36, 105–139 (1999) Slide38

Ensemble Selection

“Ensemble Selection from Libraries of Models”Caruana, Niculescu-Mizil

, Crew &

Ksikes

, ICML 2004

Training S’Validation V’H = {2000 models trained using S’}h(x) = h1(x) + h2(x) + … + hn(x) Maintain ensemble model as combination of H:Add model from H that maximizes performance on V’ + hn+1(x) Repeat

S

Denote as h

n+1

Models are trained on S’

Ensemble built to optimize V’Slide39

“Ensemble Selection from Libraries of Models”

Caruana

,

Niculescu-Mizil

, Crew &

Ksikes, ICML 2004Ensemble Selection often outperforms a more homogenous sets of models.Reduces overfitting by building model using validation set.Ensemble Selection won KDD Cup 2009 http://www.niculescu-mizil.org/papers/KDDCup09.pdf Slide40

Method

Minimize Bias?

Minimize Variance?

Other

Comments

BaggingComplex model class. (Deep DTs)Bootstrap aggregation (resampling training data)Does not work for simple models.Random ForestsComplex model class.(Deep DTs)Bootstrap aggregation+ bootstrapping featuresOnly for decision trees.Gradient Boosting(AdaBoost)Optimize training performance.Simple model class.(Shallow DTs)Determines which model to add at run-time.Ensemble SelectionOptimize validation performanceOptimize validation performancePre-specified dictionary of models learned on training set.State-of-the-art prediction performanceWon Netflix ChallengeWon numerous KDD CupsIndustry standard…and many other ensemble methods as well.Slide41

References & Further Reading

“An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants” Bauer

&

Kohavi

,

Machine Learning, 36, 105–139 (1999) “Bagging Predictors” Leo Breiman, Tech Report #421, UC Berkeley, 1994, http://statistics.berkeley.edu/sites/default/files/tech-reports/421.pdf “An Empirical Comparison of Supervised Learning Algorithms” Caruana & Niculescu-Mizil, ICML 2006“An Empirical Evaluation of Supervised Learning in High Dimensions” Caruana, Karampatziakis & Yessenalina, ICML 2008“Ensemble Methods in Machine Learning” Thomas Dietterich, Multiple Classifier Systems, 2000“Ensemble Selection from Libraries of Models” Caruana, Niculescu-Mizil, Crew & Ksikes, ICML 2004“Getting the Most Out of Ensemble Selection” Caruana, Munson, & Niculescu-Mizil, ICDM 2006“Explaining AdaBoost” Rob Schapire, https://www.cs.princeton.edu/~schapire/papers/explaining-adaboost.pdf “Greedy Function Approximation: A Gradient Boosting Machine”, Jerome Friedman, 2001, http://statweb.stanford.edu/~jhf/ftp/trebst.pdf “Random Forests – Random Features” Leo Breiman, Tech Report #567, UC Berkeley, 1999, “Structured Random Forests for Fast Edge Detection” Dollár & Zitnick, ICCV 2013“ABC-Boost: Adaptive Base Class Boost for Multi-class Classification” Ping Li, ICML 2009“Additive Groves of Regression Trees” Sorokina, Caruana & Riedewald, ECML 2007, http://additivegroves.net/ “Winning the KDD Cup Orange Challenge with Ensemble Selection”,

Niculescu-Mizil et al., KDD 2009“Lessons from the Netflix Prize Challenge” Bell & Koren, SIGKDD Exporations

9(2), 75—79, 2007