An Empirical Study Gary M Weiss Alexander Battistin Fordham University Motivation Classification performance related to amount of training data Relationship visually represented by learning curve ID: 224140
Download Presentation The PPT/PDF document "Generating Well-Behaved Learning Curves" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Generating Well-Behaved Learning Curves:An Empirical Study
Gary M. Weiss
Alexander
Battistin
Fordham UniversitySlide2
MotivationClassification performance related to amount of training dataRelationship visually represented by learning curvePerformance increases steeply at firstSlope begins to decrease with adequate training dataSlope approaches 0 as more data barely helpsTraining data often costlyCost of collecting or labeling“Good” learning curves can help identify optimal amount of data7/22/2014
2
DMIN 2014Slide3
Exploiting Learning CurvesIn practice will only have learning curve up to current number of examples when deciding on whether to acquire more Need to predict performance for larger sizesCan do iteratively and acquire in batchesCan even use curve fittingWorks best if learning curves are well behavedSmooth and regular7/22/20143
DMIN 2014Slide4
Prior Work Using Learning CurvesProvost, Jensen and Oates1 evaluated progressive sampling schemes to identify point where learning curves begin to plateauWeiss and Tian2 examined how learning curves can be used to optimize learning when performance, acquisition costs, and CPU time are considered“Because the analyses are all driven by the learning curves, any method for improving the quality of the learning curves (i.e., smoothness, monotonicity) would improve the quality of our results, especially the effectiveness of the progressive sampling strategies.” 1 Provost, F., Jensen, D., and Oates, T. (1999) .Efficient progressive sampling. In Proc. 5th
Int. Conference on knowledge Discovery & Data Mining
, 23-32.
2 Weiss, G.,M., and
Tian
, Y. 2008. Maximizing classifier utility when there are data acquisition and modeling costs.
Data Mining and Knowledge Discovery
, 17(2): 253-282.
7/22/2014
4
DMIN 2014Slide5
What We DoGenerate learning curves for six data setsDifferent classification algorithmsRandom sampling and cross validationEvaluate curvesVisually for smoothness and monotonicity“Variance” of the learning curve7/22/20145DMIN 2014Slide6
The Data SetsName# ExamplesClasses# AttributesAdult32,5612
14
Coding
20,000
2
15
Blackjack
15,000
2
4
Boa1
11,000
2
68
Kr-
vs
-
kp
3,196
2
36
Arrhythmia
452
2279
7/22/2014
DMIN 2014
6Slide7
Experiment MethodologySampling strategies10-fold cross validation: 90% available for trainingRandom sampling: 75% available for trainingTraining set sizes sampled at regular 2% intervals of available dataClassification algorithms (from WEKA)J48 Decision TreeRandom ForestNaïve Bayes7/22/2014DMIN 2014
7Slide8
Results: AccuracyDatasetJ48RandomForestNaïveBayesAdult86.3
84.3
83.4
Coding
72.2
79.3
71.2
Blackjack
72.3
71.7
67.8
Boa1
54.7
56.0
58.0
Kr-
vs
-
kp
99.4
98.7
87.8
Arrhythmia
65.465.2
62.0
Average75.1
75.9
71.7
7/22/2014
DMIN 2014
8
Accuracy is not our focus, but a well behaved learning curve for a method that produces poor results is not useful. These results are for the largest training set size (no reduction)
J48 and Random Forest are competitive so we will focus on themSlide9
Results: VariancesDatasetJ48RandomForestNaïveBayesAdult0.51
0.32
0.01
Coding
9.78
17.08
0.19
Blackjack
0.36
2.81
0.01
Boa1
0.20
0.31
0.73
Kr-
vs
-
kp
3.54
12.08
4.34
Arrhythmia
41.4615.87
9.90
7/22/2014
DMIN 2014
9
Variance for a curve equals average variance in performance for each evaluated training set size. The results are for 10-fold cross validation. Naïve Bayes is best followed by J48. But Naïve Bayes had low accuracy
(see previous slide
)Slide10
J48 Learning Curves(10 xval)7/22/2014DMIN 201410Slide11
Random Forest Learning Curves7/22/2014DMIN 201411Slide12
Naïve Bayes Learning Curves7/22/2014DMIN 201412Slide13
Closer Look at J48 and RF(Adult)7/22/2014DMIN 201413Slide14
A Closer Look at J48 and RF(kr-vs-kp)7/22/2014DMIN 201414Slide15
Closer Look at J48 and RF(Arrhythmia)7/22/2014DMIN 201415Slide16
Now lets compare cross validation to Random Sampling, which we find generates less well behaved curves7/22/2014DMIN 201416Curves used Cross validationSlide17
J48 Learning Curves(Blackjack Data Set)7/22/2014DMIN 201417Slide18
RF Learning Curves(Blackjack Data Set)7/22/2014DMIN 201418Slide19
ConclusionsIntroduced the notion of well-behaved learning curves and methods for evaluating this propertyNaïve Bayes seemed to produce much smoother curves, but less accurateLow variance may be because they consistently reach a plateau earlyJ48 and Random Forest seem reasonableNeed more data sets to determine which is bestCross validation clearly generates better curves than random sampling (less randomness?)7/22/2014DMIN 2014
19Slide20
Future WorkNeed more comprehensive evaluationMany more data setsCompare more algorithmsAdditional metrics Count number of drops in performance with greater size (i.e., “blips”). Need better summary metric.Vary number of runs. More runs almost certainly yields smoother learning curves. Evaluate in contextAbility to identify optimal learning pointAbility to identify plateau (based on some criterion)7/22/2014
DMIN 2014
20Slide21
If Interested in This AreaProvost, F., Jensen, D., and Oates, T. (1999) .Efficient progressive sampling. In Proc. 5th Int. Conference on knowledge Discovery & Data Mining, 23-32.Weiss, G.,M., and Tian, Y. 2008. Maximizing classifier utility when there are data acquisition and modeling costs. Data Mining and Knowledge Discovery, 17(2): 253-282. Contact me if you want to work on expanding this paper (gaweiss@fordham.edu)
7/22/2014
DMIN 2014
21