/
JournalofMachineLearningResearch1(2000)113-141Submitted5/00;Published1 JournalofMachineLearningResearch1(2000)113-141Submitted5/00;Published1

JournalofMachineLearningResearch1(2000)113-141Submitted5/00;Published1 - PDF document

tatyana-admore
tatyana-admore . @tatyana-admore
Follow
381 views
Uploaded On 2016-05-04

JournalofMachineLearningResearch1(2000)113-141Submitted5/00;Published1 - PPT Presentation

LLWEINSchapireSinger1999andthesupportvectormachinesSVMalgorithmVapnik1995CortesVapnik1995adirectextensiontothemulticlasscasemaybeproblematicTypicallyinsuchcasesthemulticlassproblemisre ID: 305721

LLWEINSchapire&Singer 1999)andthesupport-vectormachines(SVM)algorithm(Vapnik 1995;Cortes&Vapnik 1995) adirectextensiontothemulticlasscasemaybeproblematic.Typically insuchcases themulticlassproblemisre

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch1(2000)1..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch1(2000)113-141Submitted5/00;Published12/00ReducingMulticlasstoBinary:AUnifyingApproachforMarginClassiÞersErinL.AllweinEALLWEINSouthwestResearchInstitute6220CulebraRoadSanAntonio,TX78228RobertE.SchapireATTAT&TLabsResearchShannonLaboratory180ParkAvenue,RoomA203FlorhamPark,NJ07932YoramSingerSchoolofComputerScience&EngineeringHebrewUniversity,Jerusalem91904,IsraelLesliePackKaelblingWepresentaunifyingframeworkforstudyingthesolutionofmulticlasscategorizationprob-lemsbyreducingthemtomultiplebinaryproblemsthatarethensolvedusingamargin-basedbinarylearningalgorithm.TheproposedframeworkuniÞessomeofthemostpopularapproachesinwhicheachclassiscomparedagainstallothers,orinwhichallpairsofclassesarecomparedtoeachother,orinwhichoutputcodeswitherror-correctingpropertiesareused.WeproposeageneralmethodforcombiningtheclassiÞersgeneratedonthebinaryproblems,andweproveageneralempiricallossboundgiventheempiricallossoftheindividualalgorithms.TheschemeandthecorrespondingboundsapplytomanypopularclassiÞcationlearn-ingalgorithmsincludingsupport-vectormachines,AdaBoost,regression,logisticregressionanddecision-treealgorithms.WealsogiveamulticlassgeneralizationerroranalysisforgeneraloutputcodeswithAdaBoostasthebinarylearner.ExperimentalresultswithSVMandAdaBoostshowthatourschemeprovidesaviablealternativetothemostcommonlyusedmulticlassalgorithms.1.IntroductionManysupervisedmachinelearningtaskscanbecastastheproblemofassigningelementstoaÞnitesetofclassesorcategories.Forexample,thegoalofopticalcharacterrecognition(OCR)systemsistodeterminethedigitvalue(;:::;)fromitsimage.Thenumberofapplicationsthatrequiremulticlasscategorizationisimmense.Afewexamplesforsuchapplicationsaretextandspeechcategorization,naturallanguageprocessingtaskssuchaspart-of-speechtagging,andgestureandobjectrecognitioninmachinevision.Indesigningmachinelearningalgorithms,itisofteneasierÞrsttodevisealgorithmsfordis-tinguishingbetweenonlytwoclasses.Somemachinelearningalgorithms,suchasC4.5(Quinlan,1993)andCART(Breiman,Friedman,Olshen,&Stone,1984),canthenbenaturallyextendedtohandlethemulticlasscase.Forotheralgorithms,suchasAdaBoost(Freund&Schapire,1997;2000AT&TCorp. LLWEINSchapire&Singer,1999)andthesupport-vectormachines(SVM)algorithm(Vapnik,1995;Cortes&Vapnik,1995),adirectextensiontothemulticlasscasemaybeproblematic.Typically,insuchcases,themulticlassproblemisreducedtomultiplebinaryclassiÞcationproblemsthatcanbesolvedseparately.Connectionistmodels(Rumelhart,Hinton,&Williams,1986),inwhicheachclassisrepresentedbyanoutputneuron,areanotableexample:eachoutputneuronservesasadiscrimina-torbetweentheclassitrepresentsandalloftheotherclasses.Thus,thistrainingalgorithmisbasedonareductionofthemulticlassproblemtobinaryproblems,whereisthenumberofclasses.TherearemanywaystoreduceamulticlassproblemtomultiplebinaryclassiÞcationproblems.Inthesimpleapproachmentionedabove,eachclassiscomparedtoallothers.HastieandTibshi-rani(1998)suggestadifferentapproachinwhichallpairsofclassesarecomparedtoeachother.DietterichandBakiri(1995)presentedageneralframeworkinwhichtheclassesarepartitionedintoopposingsubsetsusingerror-correctingcodes.Forallofthesemethods,afterthebinaryclassiÞ-cationproblemshavebeensolved,theresultingsetofbinaryclassiÞersmustthenbecombinedinsomeway.Inthispaper,westudyageneralframework,whichisasimpleextensionofDietterichandBakiriÕsframework,thatuniÞesallofthesemethodsofreducingamulticlassproblemtoabinaryproblem.Wepayparticularattentiontothecaseinwhichthebinarylearningalgorithmisonethatisbasedonthemarginofatrainingexample.Roughlyspeaking,themarginofatrainingexampleisanumberthatispositiveifandonlyiftheexampleiscorrectlyclassiÞedbyagivenclassiÞerandwhosemagnitudeisameasureofconÞdenceintheprediction.Severalwellknownalgorithmsworkdirectlywithmargins.Forinstance,theSVMalgorithm(Vapnik,1995;Cortes&Vapnik,1995)at-temptstomaximizetheminimummarginofanytrainingexample.Therearemanymorealgorithmsthatattempttominimizesomelossfunctionofthemargin.AdaBoost(Freund&Schapire,1997;Schapire&Singer,1999)isoneexample:itcanbeshownthatAdaBoostisagreedyprocedureforminimizinganexponentiallossfunctionofthemargins.InSection2,wecatalogmanyotheralgo-rithmsthatalsocanbeviewedasmargin-basedlearningalgorithms,includingregression,logisticregressionanddecision-treealgorithms.ThesimplestmethodofcombiningthebinaryclassiÞers(whichwecallHammingdecodingignoresthelossfunctionthatwasusedduringtrainingaswellastheconÞdencesattachedtopre-dictionsmadebytheclassiÞer.InSection3,wegiveanewandgeneraltechniqueforcombiningclassiÞersthatdoesnotsufferfromeitherofthesedefects.Wecallthismethodloss-baseddecodingWenextprovesomeofthetheoreticalpropertiesofthesemethodsinSection4.Inparticular,forbothofthedecodingmethods,weprovegeneralboundsonthetrainingerroronthemulticlassproblemintermsoftheempiricalperformanceontheindividualbinaryproblems.Theseboundsindicatethatloss-baseddecodingissuperiortoHammingdecoding.Also,theseboundsdependonthemannerinwhichthemulticlassproblemhasbeenreducedtobinaryproblems.Fortheone-against-allapproach,ourboundsarelinearinthenumberofclasses,butforareductionbasedonrandompartitionsoftheclasses,theboundsareofthenumberofclasses.TheseresultsgeneralizemorespecializedboundsprovedbySchapireandSinger(1999)andbyGuruswamiandSahai(1999).InSection5,weproveaboundonthegeneralizationerrorofourmethodwhenthebinarylearnerisAdaBoost.Inparticular,wegeneralizetheanalysisofSchapireetal.(1998),expressingaboundonthegeneralizationerrorintermsofthetraining-setmarginsofthecombinedmulticlassclassiÞer,andshowingthatboosting,whenusedinthisway,tendstoaggressivelyincreasethemarginsofthetrainingexamples. ULTICLASSTOINARYFinally,inSection6,wepresentexperimentsusingSVMandAdaBoostwithavarietyofmulticlass-to-binaryreductions.Theseresultsshowthat,aspredictedbyourtheory,loss-baseddecodingisalmostalwaysbetterthanHammingdecoding.Further,theresultsshowthatthemostcommonlyusedone-against-allreductioniseasytobeat,butthatthebestmethodseemstobe2.Margin-basedLearningAlgorithmsWestudymethodsforhandlingmulticlassproblemsusingageneralclassofbinaryalgorithmsthatattempttominimizeamargin-basedlossfunction.Inthissection,wedescribethatclassoflearningalgorithmswithseveralexamples.binarymargin-basedlearningalgorithmtakesasinputbinarylabeledtrainingexamples;:::;wherethebelongtosomedomainandthe.Suchalearningalgorithmusesthedatatogenerateareal-valuedfunctionorbelongstosomehypothesisspace.Themarginofanexamplex;ywithrespectto.Notethatthemarginispositiveifandonlyifthesignof.Thus,ifweinterpretthesignofasitspredictionon,then [yif(xi)0]]isexactlythetrainingerrorof,where,inthiscase,wecountazerooutput()asamistake.(Hereandthroughoutthispaper,,[]]is1ifpredicateholdsandAlthoughminimizationofthetrainingerrormaybeaworthwhilegoal,initsmostgeneralformtheproblemisintractable(seeforinstancetheworkofH¬offgenandSimon(1992)).Itisthereforeoftenadvantageoustoinsteadminimizesomeothernonnegativelossfunctionofthemargin,thatis,tominimize forsomelossfunctionfunction;1).Differentchoicesofthelossfunctionanddifferentalgorithmsfor(approximately)minimizingEq.(1)oversomehypothesisspaceleadtovariouswell-studiedlearningalgorithms.Belowwelistseveralexamples.Inthepresentwork,wearenotparticularlyconcernedwiththemethodusedtoachieveasmallempiricallosssincewewillusethesealgorithmslaterinthepaperasÒblackboxes.ÓWefocusinsteadonthelossfunctionitselfwhosepropertieswillallowustoproveourmaintheoremontheeffectivenessofoutputcodingmethodsformulticlassproblems.Support-vectorMachines.Fortrainingdatathatmaynotbelinearlyseparable,thesupport-vectormachines(SVM)algorithm(Vapnik,1995;Cortes&Vapnik,1995)seeksalinearclassiÞeroftheformthatminimizestheobjectivefunction forsomeparameter,subjecttothelinearconstraints LLWEINPutanotherway,theSVMsolutionforistheminimizeroftheregularizedempiricallossfunction =max.(Foramoreformaltreatmentsee,forinstance,theworkofSch¬olkopfetal.(1998).)AlthoughtheroleofthenormofintheobjectivefunctionisfundamentalinorderforSVMtowork,theanalysispresentedinthenextsection(andthecorrespondingmulticlassalgorithm)dependsonlyonthelossfunction(whichisafunctionofthemargins).Thus,SVMcanbeviewedhereasabinarymargin-basedlearningalgorithmwhichseekstoachievesmallempiricalriskforthelossfunction)=(1ThealgorithmAdaBoost(Freund&Schapire,1997;Schapire&Singer,1999)buildsahypothesisthatisalinearcombinationofbasehypothesesThehypothesisisbuiltupinaseriesofroundsoneachofwhichanisselectedbyabaselearningalgorithmisthenchosen.IthasbeenobservedbyBreiman(1997a,1997b)andotherauthors(Collins,Schapire,&Singer,2000;Friedman,Hastie,&Tibshirani,2000;Mason,Baxter,Bartlett,&Frean,1999;R¬atsch,Onoda,&M¬uller,toappear;Schapire&Singer,1999)thattheÕsandÕsareeffectivelybeinggreedilychosensoastominimize Thus,AdaBoostisabinarymargin-basedlearningalgorithminwhichthelossfunctionisAdaBoostwithrandomizedpredictions.InalittlestudiedvariantofAdaBoost(Freund&Schapire,1997),weallowAdaBoosttooutputrandomizedpredictionsinwhichthepredictedlabelofanewexampleischosenrandomlytobewithprobability(1+.Thelosssufferedthenistheprobabilitythattherandomlychosenpredictedlabeldisagreeswiththecorrectlabel.Let(1+.Thenthelossis=+1.Usingasimplealgebraicmanipulation,thelosscanbeshowntobe(1+.SoforthisvariantofAdaBoost,weset(1+.However,inthiscase,notethatthelearningalgorithmisnotdirectlyattemptingtominimizethisloss(itisinsteadminimizingtheexponentiallossdescribedabove).Regression.Therearevariousalgorithms,suchasneuralnetworksandleastsquaresregression,thatattempttominimizethesquarederrorlossfunction.WhentheÕsareinthisfunctioncanberewrittenasThus,forbinaryproblems,minimizingsquarederrorÞtsourframeworkwhere)=(1 ULTICLASSTOINARYLogisticregression.InlogisticregressionandrelatedmethodssuchasIterativeScaling(Csisz«&Tusn«ady,1984;DellaPietra,DellaPietra,&Lafferty,1997;Lafferty,1999),andLogitBoost(Fried-manetal.,2000),onepositsalogisticmodelforestimatingtheconditionalprobabilityofapositivePr[=+1 Onethenattemptstomaximizethelikelihoodofthelabelsinthesample,orequivalently,tomini-mizetheloglosslog(Pr[])=log(1+Thus,forlogisticregressionandrelatedmethods,wetake)=log(1+Decisiontrees.Themostpopulardecisiontreealgorithmscanalsobenaturallylinkedtolossfunctions.Forinstance,QuinlanÕsC4.5(1993),initssimplestform,forbinaryclassiÞcationprob-lems,splitsdecisionnodesinamannertogreedilyminimize p+j!+pjpj+p+j arethefractionofpositiveandnegativeexamplesreachingleaf,respectively.Thepredictionatleafisthen.Vieweddifferently,imagineadecisiontreethatinsteadoutputsarealnumberateachleafwiththeintentionofperforminglogisticregressionasabove.Thentheempiricallossassociatedwithlogisticregressionis(1+(1+Thisisminimized,overchoicesof,when=(12)ln(.PlugginginthischoicegivesexactlyEq.(2),andthresholdinggivesthehardpredictionruleusedearlier.Thus,C4.5,inthissimpleform,canbeviewedasamargin-basedlearningalgorithmthatisnaturallylinkedtothelossfunctionusedinlogisticregression.Bysimilarreasoning,CART(Breimanetal.,1984),whichsplitsusingtheGiniindex,canbelinkedtothesquarelossfunction,whileKearnsandMansourÕs(1996)splittingrulecanbelinkedtotheexponentiallossusedbyAdaBoost.Theanalysiswepresentinthenextsectionmightalsoholdforotheralgorithmsthattacitlyemployafunctionofthemargin.Forinstance,FreundÕsBrownBoostalgorithm(1999)implicitlyusesaninstancepotentialfunctionthatsatisÞestheconditionweimposeon.Therefore,itcanalsobecombinedwithoutputcodingandusedtosolvemulticlassproblems.Toconcludethissection,weplotinFigure1someofthelossfunctionsdiscussedabove.3.OutputCodingforMulticlassProblemsInthelastsection,wediscussedmargin-basedalgorithmsforlearningbinaryproblems.Supposenowthatwearefacedwithamulticlasslearningprobleminwhicheachlabelischosenfromasetofcardinality.Howcanabinarymargin-basedlearningalgorithmbemodiÞedtohandlea-classproblem? LLWEIN 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 0 2 4 6 8 10 12 14 L(z)L(z)1/2*(L(z)+L(z))L(0)exp(z) 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 0 2 4 6 8 10 12 14 L(z)L(z)1/2*(L(z)+L(z))L(0)(1z)2 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5 3 3.5 L(z)L(z)1/2*(L(z)+L(z))L(0)(1z)+ 2.5 2 1.5 1 0.5 0 0.5 1 1.5 2 2.5 0 1 2 3 4 5 6 L(z)L(z)1/2*(L(z)+L(z))L(0)log(1+exp(2*z)) Figure1:Someofthemargin-basedlossfunctionsdiscussedinthepaper:theexponentiallossusedbyAdaBoost(topleft);thesquarelossusedinleast-squaresregression(topright);theÒhingeÓlossusedbysupport-vectormachines(bottomleft);andthelogisticlossusedinlogisticregression(bottomright).Severalsolutionshavebeenproposedforthisquestion.Manyinvolvereducingthemulticlassproblem,inonewayoranother,toasetofbinaryproblems.Forinstance,perhapsthesimplestapproachistocreateonebinaryproblemforeachoftheclasses.Thatis,foreach,weapplythegivenmargin-basedlearningalgorithmtoabinaryprobleminwhichallexampleslabeledareconsideredpositiveexamplesandallotherexamplesareconsiderednegativeexamples.Wethenendupwithhypothesesthatsomehowmustbecombined.Wecallthistheone-against-allAnotherapproach,suggestedbyHastieandTibshirani(1998),istousethegivenbinarylearningalgorithmtodistinguisheachpairofclasses.Thus,foreachdistinctpair,werunthelearningalgorithmonabinaryprobleminwhichexampleslabeledareconsideredpositive,andthoselabeledarenegative.Allotherexamplesaresimplyignored.Again,thehypothesesthataregeneratedbythisprocessmustthenbecombined.Wecallthistheall-pairsAmoregeneralsuggestiononhandlingmulticlassproblemswasgivenbyDietterichandBakiri(1995).TheirideaistoassociateeachclasswitharowofaÒcodingmatrixÓ2f.Thebinarylearningalgorithmisthenrunonceforeachcolumnofthematrixontheinducedbinaryprobleminwhichthelabelofeachexamplelabeledismappedtoy;s.Thisyields ULTICLASSTOINARY.Givenanexample,wethenpredictthelabelforwhichrowofmatrixÒclosestÓto;:::;f.ThisisthemethodoferrorcorrectingoutputcodesInthissection,weproposeaunifyinggeneralizationofallthreeofthesemethodsapplica-bletoanymargin-basedlearningalgorithm.ThisgeneralizationisclosesttotheECOCapproachofDietterichandBakiri(1995)butdiffersinthatthecodingmatrixistakenfromthelargerset.Thatis,someoftheentriesmaybezero,indicatingthatwedonÕtcarehowhypothesiscategorizesexampleswithlabelThus,ourschemeforlearningmulticlassproblemsusingabinarymargin-basedlearningalgo-worksasfollows.Webeginwithagivencodingmatrix2fFor;:::;`,thelearningalgorithmisprovidedwithlabeleddataoftheformforallexamplesinthetrainingsetbutomittingallexamplesforwhich.TheusesthisdatatogenerateahypothesisForexample,fortheone-against-allapproach,isamatrixinwhichalldiagonalelementsandallotherelementsare.Fortheall-pairsapproach,isamatrixinwhicheachcolumncorrespondstoadistinctpair.Forthiscolumn,inrowrowandzerosinallotherrows.Asanalternativetocallingrepeatedly,insomecases,wemayinsteadwishtoaddthecolumnindexasadistinguishedattributeoftheinstancesreceivedby,andthenlearnaonthislargerlearningproblemratherthanhypothesesonsmallerproblems.Thatis,weprovidewithinstancesoftheformforalltrainingexamplesandallcolumns.Algorithmthenproducesa;:::;`However,forconsistencywiththeprecedingapproach,wedeÞnetobex;s.Wecallthesetwoapproachesinwhichiscalledrepeatedlyoronlyoncethevariants,respectively.Wenoteinpassingthattherearenofundamentaldifferencesbetweenthesingleandmulti-callvariants.Mostpreviousworkonoutputcodingemployedthemulti-callvariantduetoitssimplicity.Thesingle-callvariantbecomeshandywhenanimplementationofaclassiÞcationlearningalgo-rithmthatoutputsasinglehypothesisoftheformXf;:::;`isavailable.WedescribeexperimentswithbothvariantsinSection6.Foreithervariant,thealgorithmattemptstominimizethelossontheinducedbinaryprob-lem(s).Recallthatisafunctionofthemarginofanexamplesothelossofonanexamplewithinducedlabel2f.When)=0,wewanttoentirelyignorethehypothesisincomputingtheloss.WecandeÞnethelosstobeanyconstantinthiscase,so,forconvenience,wechoosethelosstobesothatthelossassociatedwithexampleinallcases.Thus,theaveragelossoverallchoicesofandallexamples Wecallthistheaveragebinarylossofthehypothesesonthegiventrainingsetwithrespecttocodingmatrixandloss.Itisthequantitythatthecallstohavetheimplicitintentionof LLWEINminimizing.WewillseeinthenextsectionhowthisquantityrelatestothemisclassiÞcationerroroftheÞnalclassiÞerthatwebuildontheoriginalmulticlasstrainingset.denoterowandletbethevectorofpredictionsonaninstance)=(;:::;fGiventhepredictionsoftheÕsonatestpoint,whichofthelabelsinshouldbepredicted?WhileseveralmethodsofcombiningtheÕscanbedevised,inthispaper,wefocusontwothatareverysimpletoimplementandforwhichwecananalyzetheempiricalriskoftheoriginalmulticlassproblem.ThebasicideaofbothmethodsistopredictwiththelabelwhoserowisÒclosestÓtothepredictions.Inotherwords,predictthelabelthatminimizesforsome.Thisformulationbegsthequestion,however,ofhowwemeasuredistancebetweenthetwovectors.Onewayofdoingthisistocountupthenumberofpositionsinwhichthesignofthepredictiondiffersfromthematrixentry.Formally,thismeansourdistancemeasureis))= ,and.ThisisessentiallylikecomputingHammingdistancebetweenrowandthesignsoftheÕs.However,notethatifeitheriszerothenthatcomponentcontributestothesum.Foraninstanceanda,thepredictedlabel;:::;kistherefore=argminWecallthismethodofcombiningtheHammingdecodingAdisadvantageofthismethodisthatitignoresentirelythemagnitudeofthepredictionswhichcanoftenbeanindicationofalevelofÒconÞdence.ÓOursecondmethodforcombiningpredictionstakesthispotentiallyusefulinformationintoaccount,aswellastherelevantlossfunctionisignoredwithHammingdecoding.Theideaistochoosethelabelthatismostconsistentwiththepredictionsinthesensethat,ifexamplewerelabeled,thetotallossonexamplex;rwouldbeminimizedoverchoicesof.Formally,thismeansthatourdistancemeasureisthetotallossonaproposedexamplex;r))=AnalogoustoHammingdecoding,thepredictedlabel;:::;k=argminWecallthisapproachloss-baseddecoding.AnillustrationofthetwodecodingmethodsisgiveninFigure2.TheÞgureshowsthedecodingprocessforaproblemwithclassesusinganoutputcodeoflength.ForclaritywedenoteintheÞguretheentriesoftheoutputcodematrixby(insteadof).Notethatintheexample,thepredictedclassoftheloss-baseddecoding(which,inthiscase,usesexponentialloss)isdifferentthanthatoftheHammingdecoding.Wenoteinpassingthattheloss-baseddecodingmethodforlog-lossisthewellknownandwidelyusedmaximum-likelihooddecodingwhichwasstudiedbrießyinthecontextofECOCbyGuruswamiandSahai(1999). ULTICLASSTOINARY (Prediction) --+---++++---++-----+ --+---++++---++Figure2:AnillustrationofthemulticlasspredictionprocedureforHammingdecoding(top)andloss-baseddecoding(bottom)fora-classproblemusingacodeoflength.Theexpo-nentialfunctionwasusedfortheloss-baseddecoding.4.AnalysisoftheTrainingErrorInthissection,weanalyzethetrainingerroroftheoutputcodingmethodsdescribedinthelastsection.Specically,weupperboundthetrainingerrorofthetwodecodingmethodsintermsoftheaveragebinarylossasdenedinEq.(3),aswellasameasureoftheminimumdistancebetweenanypairofrowsofthecodingmatrix.Here,weuseasimplegeneralizationoftheHammingdistanceforvectorsovertheset.Specically,wedenethedistancebetweentworows2ftobe 2=`uv Ouranalysisthendependsontheminimumdistancebetweenpairsofdistinctrows:=min)):Forexample,fortheone-against-allcode,.Fortheall-pairscode,2+1,sinceeverytworowshaveexactlyonecomponentwithoppositesigns( LLWEIN)andfortherestatleastonecomponentofthetwois).Forarandommatrixwithcomponentschosenuniformlyovereither,theexpectedvalueofforanydistinctpairofrowsisexactlyIntuitively,thelarger,themorelikelyitisthatdecodingwillforerrorsmadebyindividualhypotheses.ThiswasDietterichandBakiris(1995)insightinsuggestingtheuseofoutputcodeswitherror-correctingproperties.Thisintuitionisreectedinouranalysisinwhichalargervalueofgivesabetterupperboundonthetrainingerror.Inparticular,Theorem1statesthatthetrainingerrorisatmosttimesworsethantheaveragebinarylossofthecombinedhypotheses(afterscalingthelossby).Fortheone-against-allmatrix,whichcanbelargeifthenumberofclassesislarge.Ontheotherhand,fortheall-pairsmatrixorforarandommatrix,isclosetotheconstant,independentofWebeginwithananalysisofloss-baseddecoding.AnanalysisofHammingdecodingwillfollowasacorollary.Concerningtheloss,ouranalysisassumesonlythat forall.Notethatthispropertyholdsifisconvex,althoughconvexityisbynomeansanecessarycondition.NotealsothatallofthelossfunctionsinSection2satisfythisproperty.ThepropertyisillustratedinFigure1forfourofthelossfunctionsdiscussedinthatsection.Theorem1betheaveragebinaryloss(asdeÞnedinEq.(3))ofhypotheses;:::;fonagiventrainingset;:::;withrespecttothecodingmatrixandloss,whereisthecardinalityofthelabelset.LetbeasinEq.(6).AssumethatsatisÞesEq.(7)forall.Thenthetrainingerrorusingloss-baseddecodingisatmost Proof:Supposethatloss-baseddecodingincorrectlyclassiesanexamplex;y.Thenthereissomelabelforwhichy;sy;sbethesetofcolumnsofinwhichrowsdifferandarebothnon-zero.Let)=0y;s)=0bethesetofcolumnsinwhicheitherroworrowiszero.Lety;s.ThenEq.(8)becomeswhichimplies ULTICLASSTOINARYYS0.Thisinturnimpliesthat 2Xs2S[S0(L(z0s(zs1 2Xs2S(L(z0s(zs1 and,byassumption,.Thus,thersttermofEq.(9)isatleast.If,theneither.Eithercaseimpliesthat.Thus,thesecondtermofEq.(9)isatlastTherefore,Eq.(9)isatleast Inotherwords,amistakeontrainingexampleimpliesthatsothenumberoftrainingmistakesisatmost ))= andthetrainingerrorisatmostasclaimed. Asacorollary,wecangiveasimilarbutweakertheoremforHammingdecoding.Notethatweuseadifferentassumptionaboutthelossfunction,butonethatalsoholdsforallofthelossfunctionsdescribedinSection2.Corollary2;:::;fbeasetofhypothesesonatrainingset;:::;,andlet2fbeacodingmatrixwhereisthecardinalityofthelabelset.LetbeasinEq.(6).ThenthetrainingerrorusingHammingdecodingisatmost )))Moreover,ifisalossfunctionsatisfyingistheaveragelosswithrespecttothislossfunction,thenthetrainingerrorusingHammingdecodingisatmost L(0):(11)123 LLWEINProof:Considerthelossfunction.FromEqs.(4)and(5),itisclearthatHammingdecodingisequivalenttoloss-baseddecodingusingthislossfunction.Moreover,esEq.(7)forallsowecanapplyTheorem1togetanupperboundonthetrainingerrorof whichequalsEq.(10).Forthesecondpart,notethatif,andif)=0.ThisimpliesthatEq.(12)isboundedabovebyEq.(11). Theorem1andCorollary2arebroadgeneralizationsofsimilarresultsprovedbySchapireandSinger(1999)inamuchmorespecializedsettinginvolvingonlyAdaBoost.Also,Corollary2generalizessomeoftheresultsofGuruswamiandSahai(1999)thatboundthemulticlasstrainingerrorintermsofthetraining(misclassication)errorratesofthebinaryclassiTheboundsofTheorem1andCorollary2dependimplicitlyonthefractionofzeroentriesinthematrix.Intuitively,themorezerosthereare,themoreexamplesthatareignoredandtheharderitshouldbetodrivedownthetrainingerror.Atanextreme,ifisallzeros,thenisfairlylarge)butlearningcertainlyshouldnotbepossible.Tomakethisdependenceexplicit,leti;s)=0bethesetofpairsi;sinducingexamplesthatareignoredduringlearning.Letbethefractionofignoredpairs.Letbetheaveragebinarylossrestrictedtothepairsnotignoredduring i;s.ThentheboundinTheorem1canberewritten L(0)1 (0)+ +(1 Similarly,letbethefractionofmisclassicationerrorsmadeon [M(yi;s)6=sign))]rstpartofCorollary2impliesthatthetrainingerrorusingHammingdecodingisboundedaboveby +2(1Weseefromtheseboundsthattherearemanytrade-offsinthedesignofthecodingmatrixOntheonehand,wewanttherowstobefarapartsothatwillbelarge,andwealsowanttheretobefewnon-zeroentriessothatwillbesmall.Ontheotherhand,attemptingtomakelargeandsmallmayproducebinaryproblemsthataredifculttolearn,yieldinglarge(restricted)averagebinaryloss. ULTICLASSTOINARY5.AnalysisofGeneralizationErrorforBoostingwithLoss-basedDecodingTheprevioussectionconsideredonlythetrainingerrorusingoutputcodes.Inthissection,wetakeupthemoredifculttaskofanalyzingthegeneralizationerror.Becauseofthedifcultyofobtainingsuchresults,wedonothavethekindofgeneralresultsobtainedfortrainingerrorwhichapplytoabroadclassoflossfunctions.Instead,wefocusonlyonthegeneralizationerrorofusingAdaBoostwithoutputcodingandloss-baseddecoding.Specically,weshowhowthemargin-theoreticanalysisofSchapireetal.(1998)canbeextendedtothismorecomplicatedalgorithm.y,Schapireetal.sanalysiswasproposedasameansofexplainingtheempiricallyob-servedtendencyofAdaBoosttoresistovertting.Theirtheorywasbasedonthenotionofanexamplemarginwhich,informally,measurestheinthepredictionmadebyaclassi-eronthatexample.Theythengaveatwo-partanalysisofAdaBoost:First,theyprovedaboundonthegeneralizationerrorintermsofthemarginsofthetrainingexamples,aboundthatisindepen-dentofthenumberofbasehypothesescombined,andaboundsuggestingthatlargermarginsimplylowergeneralizationerror.Inthesecondpartoftheiranalysis,theyprovedthatAdaBoosttendstoaggressivelyincreasethemarginsofthetrainingexamples.Inthissection,wegivecounterpartsofthesetwopartsoftheiranalysisforthecombinationofAdaBoostwithloss-baseddecoding.Wealsoassumethatthesingle-callvariantisusedasdescribedinSection3.TheresultisessentiallytheAdaBoost.MOalgorithmofSchapireandSinger(1999)cally,whattheycalledVariant2Thisalgorithmworksasfollows.Weassumethatacodingmatrixisgiven.Thealgorithmworksinrounds,repeatedlycallingthebaselearningalgorithmtoobtainabasehypothesis.Oneach;:::;T,thealgorithmcomputesadistributionoverpairsoftrainingexamplesandcolumnsofthematrix,i.e.,overtheset;:::;m;:::;`.Thebaselearningalgorithmusesthetrainingdata(withbinarylabelsasencodedusing)andthedistributionobtainabasehypothesisL!f.(Ingeneral,srangemaybe,buthere,forsimplicity,weassumethatisbinaryvalued.)Theerroristheprobabilitywithrespecttoofmisclassifyingoneoftheexamples.Thatis,is,M(yi;s)6=ht(xi;s)]=mXi=1`Xs=1Dt(i;ss[M(yi;s)6=ht(xi;s)]]:Thedistributionisthenupdatedusingtherulei;si;s)exp( =(12)ln((1(whichisnonnegative,assuming,aswedo,that),andisanormalizationconstantensuringthatisadistribution.Itisstraightforwardtoshowthat (Theinitialdistributionischosentobeuniformsothati;s)=1rounds,thisprocedureoutputsanalclassiwhich,becauseweareusingloss-baseddecoding,is)=argminy;sx;s LLWEINWebeginourmargin-theoreticanalysiswithadenitionofthemarginofthiscombinedmulti-classclassier.First,letf;;x;y 1 y;sIfweletx;s x;swecanthenrewriteEq.(15)as)=argmaxf;;x;ySincewehavetransformedtheargumentoftheminimuminEq.(15)byastrictlydecreasingfunc-tion(namely,)ln()toarriveatEq.(18)itisclearthatwehavenotchangedthenitionof.ThisrewritinghastheeffectofnormalizingtheargumentofthemaximuminEq.(18)sothatitisalwaysintherangerange1;+1].Wecannowdenethemarginforalabeledexamplex;ytobethedifferencebetweenthevotef;;x;ygiventothecorrectlabel,andthelargestvotegiventoanyotherlabel.Wedenotethemarginbyf;x;y.Formally,f;x;y f;;x;yf;;x;rwherethefactorofsimplyensuresthatthemarginisintherangerange1;+1].Notethatthemarginispositiveifandonlyifcorrectlyclassiesexamplex;yAlthoughthisdenitionofmarginisseeminglyverydifferentfromtheonegivenearlierinthepaperforbinaryproblems(whichisthesameastheoneusedbySchapireetal.intheircompar-ativelysimplecontext),weshownextthatmaximizingtraining-examplemarginstranslatesintoabetterboundongeneralizationerror,independentofthenumberofroundsofboosting.bethebase-hypothesisspaceof-valuedfunctionson.Weletco(denotetheconvexhullco(whereitisunderstoodthateachofthesumsaboveisoverthenitesubsetofhypothesesin.Thus,asdenedinEq.(17)belongstoco(Weassumethattrainingexamplesarechoseni.i.d.fromsomedistribution.Wewriteprobabilityorexpectationwithrespecttoarandomchoiceofanexamplex;yaccordingtoto]andED[].Similarly,probabilityandexpectationwithrespecttoanexamplechosenuniformlyatrandomfromtrainingsetisdenoteddenoted]andES[].Wecannowprovetherstmaintheoremofthissectionwhichshowshowthegeneralizationerrorcanbeusefullyboundedwhenmostofthetrainingexampleshavelargemargin.ThisisverysimilartotheresultsofSchapireetal.(1998)exceptforthefactthatitappliestoloss-baseddecodingforageneralcodingmatrix ULTICLASSTOINARYTheorem3beadistributionover,andletbeasampleofexampleschosenindependentlyatrandomaccordingto.Supposethebase-classiÞerspacehasVC-dimension,andlet.Assumethatwhereisthenumberofcolumnsinthecodingmatrix.Thenwithprobabilityatleastovertherandomchoiceofthetrainingset,everyweightedaverageco(andeverysatisÞesthefollowingboundforallallMf;x;yyMf;x;y p m dlog2( +log(1Proof:Toprovethetheorem,wewillrstneedtodenethenotionofasloppycover,slightlyspecializedforourpurposes.Foraclassofreal-valuedfunctionsover,atrainingsetofsize,andrealnumbers,wesaythatafunctionclassisan-coverofwithrespecttoif,forall,thereexistsx;yx;y.Let;;;mdenotethemaximum,overalltrainingofsize,ofthesizeofthesmallest-sloppy-coverofwithrespecttoUsingtechniquesfromBartlett(1998),Schapireetal.(1998,Theorem4)giveatheoremwhichstatesthat,for,theprobabilityovertherandomchoiceoftrainingsetthatthereexistsanyfunctionforwhichwhichF(x;yyF(x;yisatmost;=;=WeproveTheorem3byapplyingthisresulttothefamilyoffunctionsf;co(;�Todoso,weneedtoconstructarelativelysmallsetoffunctionsthatapproximateallthefunctionsWestartwithalemmathatimpliesthatanyfunctionf;canbeapproximatedbyinthesmallniteset ;:::; Lemma4Forall,thereexistssuchthatforallco(andforallf;;x;y;x;y LLWEINProof: ln 1 .Weclaimrstthat,forany,andfor 11 Fortherstinequality,itsufcestoshowthatisnondecreasing.Differentiating,ndthat d=+P`sslnps .Sinceentropyoversymbolscannotexceed,thisquantityisnonnegative.ForthesecondinequalityofEq.(20),itsufcestoshowthat)+(lnnonincreasing.Againdifferentiating(orreusingEq.(21)),wendthat =P`s=1pslnps whichisnonpositivesinceentropycannotbenegative.So,if,thenlet=(lnbethelargestelementofthatisnobiggerthan i` (^ ln`1 i ln`(i ,then(^ ln`1 Itremainsthenonlytohandlethecasethatissmall.Assumethatthat1;+1]`.Then ln 1 ``Xs=1e!1 ln exp 2 2+ ``Xss 2+1 ``Xss:128 ULTICLASSTOINARYThisisbecause,asprovedbyHoeffding(1963),foranyrandomvariable,and X]!:Ontheotherhand,byEq.(20), `P`szs 1 `P`s=1 wheretherstequalityuseslsrule.Thus,if,thenwetake=minsothat (^ ``Xs=1zs+^ whichimpliesthat(^ 1;+1]`.Sincef;;x;rx;ss1;+1],thiscompletesthelemma. beaxedsubsetofofsize.BecausehasVC-dimension,thereexistsaofcardinalitythatincludesallbehaviorson.Thatis,forallthereexistssuchthatx;sx;sforallx;yandall.Nowletx;s x;sbethesetofunweightedaveragesofelementsin,andletN;f;WewillshowthatN;isasloppycoverofco(.Thenwecanwritex;sx;s.BecauseweareonlyinterestedinthebehaviorofonpointsinwecanassumewithoutlossofgeneralitythateachLemma5Supposeforsomeandsome,wehavethatx;sx;sforall.LetandletbeasinLemma4.Thenforallf;x;yx;y LLWEINProof:Forallf;;x;rg;;x;r exp(x;s exp(x;s exp(x;sx;sx;sx;sx;sx;swheretherstinequalityusesthesimplefactthatforpositives.Bythesymmetryofthisargument,f;;x;rg;;x;rAlso,fromLemma4,g;;x;r;x;rf;;x;r;x;rforall.Bydenitionofmargin,thisimpliesthelemma. Recallthatthecoefarenonnegativeandthattheysumtoone;inotherwords,theyneaprobabilitydistributionover.Itwillbeusefultoimaginesamplingfromthisdistribution.cally,supposethatischoseninthismanner.Thenforxedx;sx;sisa-valuedrandomvariablewithexpectedvalueofexactlyx;s.Considernowchoosingsuchfunctions;:::;independentlyatrandom,eachaccordingtothissamedistribution,and=(1.Then.Letusdenotebytheresultingdistributionoverfunctions.Notethat,forxedx;sx;sistheaverageof-trialswithexpectedvaluex;sForanyx;yyjMf;x;yx;yy9s2L:jf(x;sx;ssjf(x;sx;sThesethreeinequalitiesfollow,respectively,fromLemma5,theunionboundandHoeffdinginequality.Thus,Thus,S[jMf;x;yx;yygQ[jMf;x;yx;yTherefore,thereexistssuchthatthatjMf;x;yx;y ULTICLASSTOINARYWehavethusshownthatN;isa-sloppy-coverof.Inotherwords,N; d Makingtheappropriatesubstitutions,thisgivesthatEq.(19)isatmost16ln 22 )ln(16=1616ln  8m+2d m2ln2e`m de`2m Notethatthequantityinsidethesquarerootisatleast.Thus, Usingthisapproximationfortherstoccurrenceof,itfollowsthatEq.(22)isatmostWehavethusprovedtheboundofthetheoremforasinglegivenchoiceofwithhighprobability.Toprovethattheboundholdssimultaneouslyforall,wecanuseexactlythesameargumentusedbySchapireandSinger(1999)intheverylastpartoftheirTheorem8. Thus,wehaveshownthatlargetraining-setmarginsimplyabetterboundonthegeneralizationerror,independentofthenumberofroundsofboosting.WeturnnowtothesecondpartofouranalysisinwhichweprovethatAdaBoost.MOtendstoincreasethemarginsofthetrainingexam-ples,assumingthatthebinaryerrorsofthebasehypothesesareboundedawayfromthetrivialerrorrateof(seethediscussionthatfollowstheproof).ThetheoremthatweprovebelowisadirectanalogofTheorem5ofSchapireetal.(1998)forbinaryAdaBoost.Notethatwefocusonlyoncodingmatricesthatdonotcontainzeros.Aslightlyweakerresultcanbeprovedinthemoregeneralcase.Theorem6Supposethebaselearningalgorithm,whencalledbyAdaBoost.MOusingcodingma-2f,generateshypotheseswithweightedbinarytrainingerrors;:::;.LetbeasinEq.(6).Thenforany,wehavethatthatMf;x;y TYt2q whereareasinEqs.(16)and(17).Proof:Suppose,forsomelabeledexamplex;yf;x;y.Then,bydenitionofmargin,thereexistsalabel2Yfforwhich f;;x;yf;;x;rthatisy;s LLWEINy;sx;sx;s.Lety;s.ThenEq.(23)impliesandso 2Xs2Sezs+ez0s 2=Xs2Sezs+ezs Thisisbecause,if,andbecauseforallTherefore,iff;x;yy;sThus,thefractionoftrainingexamplesforwhichf;isatmost XiXsyif(xi=`e mXi=1`Xs TXttM(yiht(xi!(24)=`e i;s Here,Eq.(24)usesthedenitionofasinEqs.(16)and(17).Eq.(25)usesthedeasdenedrecursivelyinEq.(13).Eq.(26)usesthefactthatisadistribution.ThetheoremnowfollowsbyplugginginEq.(14)andapplyingstraightforwardalgebra. AsnotedbySchapireetal.(1998),thisboundcanbeusefullyunderstoodifweassumethatforall.Thentheupperboundsimpliesto q (1+2,thentheexpressioninsidetheparenthesesissmallerthanonesothatthefractionoftrainingexampleswithmarginbelowdropstozeroexponentiallyfastin ULTICLASSTOINARY 10 20 30 40 50 60 70 80 3 4 5 6 7 8 Number of classes avg. binary error loss-based decoding Hamming decoding 0 10 20 30 40 50 60 70 80 3 4 5 6 7 8 Number of classes avg. binary error loss-based decoding Hamming decoding Figure3:Comparisonoftheaveragebinaryerror,themulticlasserrorusingHammingdecoding,andthemulticlasserrorusingloss-baseddecodingwithAdaBoostonsyntheticdata,usingacompletecode(left)andtheone-against-allcode(right).6.ExperimentsInthissection,wedescribeanddiscussexperimentswehaveperformedwithsyntheticdataandwithnaturaldatafromtheUCIrepository.Werunbothsetsofexperimentswithtwobase-learners:AdaBoostandSVM.TwoprimarygoalsoftheexperimentsaretocompareHammingdecodingtoloss-baseddecodingandtocomparetheperformanceofdifferentoutputcodes.Westartwithadescriptionofanexperimentwithreal-valuedsyntheticdatawhichunderscoresthetradeoffbetweenthecomplexityofthebinaryproblemsinducedbyagivenoutputcodeanditserrorcorrectingIntheexperimentswithsyntheticdata,wegeneratedinstancesaccordingtothenormaldistri-butionwithzeromeanandaunitvariance.Tocreateamulticlassproblemwithclasses,wesetthresholds,denoted;:::;,where.Aninstanceisassociatedwithclassifandonlyif.Foreach;:::;wegeneratedsetsofexamples,whereeachsetwasofsize.Thethresholdsweresetsothatexactlyexampleswillbeassociatedwitheachclass.Similarly,wegeneratedthesamenumberoftestsets(ofthesamesize)usingthethresholdsobtainedforthetrainingdatatolabelthetestdata.WeusedAdaBoostasthebaselearnerandsetthenumberofroundsofboostingforeachbinaryproblemtoandcalledAdaBoostrepeatedlyforeachcolumn(multi-callvariant).Forweakhypotheses,weusedthesetofallpossiblethresholdfunctions.Thatis,aweakhypothesisbasedonathresholdwouldlabelanInFigure3,weplotthetesterrorrateasthefunctionoffortwooutputcodingmatrices.rstisacompletecodewhosecolumnscorrespondtoallthenon-trivialpartitionsoftheset;:::;kintotwosubsets.Thus,thiscodeisamatrixofsize.Thesecondcodewastheone-against-allcode.Foreachcode,weplottheaveragebinarytesterror,themulticlasserrorsusingHammingdecoding,andthemulticlasserrorsusingloss-baseddecoding.ThegraphsclearlyshowthatHammingdecodingisinferiortoloss-baseddecodingandyieldsmuchhighererrorrates.Themulticlasserrorsofthetwocodesusingloss-baseddecodingarecomparable.Whilethemulticlasserrorratewiththecompletecodeisslightlylowerthantheerrorratefortheone-against-allcode,thesituationisreversedfortheaveragebinaryerrors.Thisphenomenonunderscoresthe LLWEIN Problem Train Test #Attributes #Classes dermatology 366 - 34 6 satimage 4435 2000 36 6 glass 214 - 9 7 segmentation 2310 - 19 7 ecoli 336 - 8 8 pendigits 7494 3498 16 10 yeast 1484 - 8 10 vowel 528 462 10 11 soybean 307 376 35 19 thyroid 9172 - 29 20 audiology 226 - 69 24 isolet 6238 1559 617 26 letter 16000 4000 16 26 Table1:Descriptionofthedatasetsusedintheexperiments.tradeoffbetweentheredundancyandcorrectingpropertiesoftheoutputcodesandthedifcultyofthebinarylearningproblemsitinduces.Thecompletecodehasgooderrorcorrectingproperties;thedistancebetweeneachpairofrowsis+1).However,manyofthebinaryproblemsthatthecompletecodeinducesaredifculttolearn.Thedistancebetweeneachpairofrowsintheone-against-allcodeissmall:.Hence,itsempiricalerrorboundaccordingtoTheorem1seemsinferior.However,thebinaryproblemsitinducesaresimplerandthusitsaveragebinarylossislowerthantheaveragebinarylossofthecompletecodeandtheoverallresultiscomparableperformance.Next,wedescribeexperimentsweperformedwithmulticlassdatafromtheUCIrepository(Merz&Murphy,1998).Weusedtwodifferentpopularbinarylearners,AdaBoostandSVM.Wechosethefollowingdatasets(orderedaccordingtothenumberofclasses):Dermatology,Satimage,Glass,Segmentation,E-coli,Pendigits,Yeast,Vowel,Soybean-large,Thyroid,Audiology,Isolet,Letter-recognition.ThepropertiesofthedatasetsaresummarizedinTable1.IntheSVMexperiments,weskippedAudiology,Isolet,Letter-recognition,Segmentation,andThyroid,becausethesedatasetswereeithertoobigtobehandledbyourcurrentimplementationofSVMorcontainedmanynominalfeatureswithmissingvalueswhichareproblematicforSVM.Alldatasetshaveatleastsixclasses.Whenavailable,weusedtheoriginalpartitionofthedatasetsintotrainingandtestsets.Fortheotherdatasets,weused-foldcrossvalidation.ForSVM,weusedpolynomialkernelsofdegreewiththemulti-callvariant.ForAdaBoost,weuseddecisionstumpsforbasehypotheses.BymodifyinganexistingpackageofAdaBoost.MH(Schapire&Singer,1999)wewereabletodeviseasimpleimplementationofthesingle-callvariantthatwasdescribedinSection5.SummariesoftheresultsusingthedifferentoutputcodesdescribedbelowaregiveninTables2and3.Wetestedvedifferenttypesofoutputcodes:one-against-all,complete(inwhichthereisonecolumnforeverypossible(non-trivial)splitoftheclasses),all-pairs,andtwotypesofrandomcodes.Thersttypeofrandomcodehas10logcolumnsforaproblemwithclasses.Each ULTICLASSTOINARY HammingDecoding Problem One-vs-all Complete All-Pairs Dense Sparse dermatology 5.0 4.2 3.1 3.9 3.6 satimage 14.9 12.3 11.7 12.3 13.2 glass 31.0 31.0 28.6 28.6 27.1 segmentation 0.0 0.1 0.0 0.1 0.1 ecoli 21.5 18.5 19.1 17.6 19.7 pendigits 8.9 8.6 3.0 9.3 6.2 yeast 44.7 41.9 42.5 43.9 49.5 vowel 67.3 59.3 50.2 62.6 54.5 soybean 8.2 Ð 9.0 5.6 8.0 thyroid 7.8 Ð Ð 12.3 8.1 audiology 26.9 Ð 23.1 23.1 23.1 isolet 9.2 Ð Ð 10.8 10.1 letter 27.7 Ð 7.8 30.9 27.1 Loss-basedDecoding( dermatology 4.2 4.2 3.1 3.9 3.6 satimage 12.1 12.4 11.2 11.9 11.9 glass 26.7 31.0 27.1 27.1 26.2 segmentation 0.0 0.1 0.0 0.1 0.7 ecoli 17.3 17.6 18.8 18.5 19.1 pendigits 4.6 8.6 2.9 8.8 6.8 yeast 41.6 42.0 42.6 43.2 49.8 vowel 56.9 59.1 50.9 61.9 54.1 soybean 7.2 Ð 8.8 4.8 8.2 thyroid 6.5 Ð Ð 12.0 8.0 audiology 19.2 Ð 23.1 19.2 23.1 isolet 5.3 Ð Ð 10.1 9.8 letter 14.6 Ð 7.4 29.0 26.6 Loss-basedDecoding(Exp.) dermatology 4.2 3.9 3.1 4.2 3.1 satimage 12.1 12.3 11.4 12.0 12.0 glass 26.7 28.6 27.6 25.2 29.0 segmentation 0.0 0.0 0.0 0.0 0.0 ecoli 15.8 16.4 18.2 17.0 17.9 pendigits 4.6 7.2 2.9 8.1 4.8 yeast 41.6 42.1 42.3 43.0 49.3 vowel 56.9 54.1 51.7 60.0 49.8 soybean 7.2 Ð 8.8 4.8 5.6 thyroid 6.5 Ð Ð 11.4 7.2 audiology 19.2 Ð 23.1 19.2 19.2 isolet 5.3 Ð Ð 9.4 9.7 letter 14.6 Ð 7.1 28.3 22.3 Table2:ResultsofexperimentswithoutputcodeswithdatasetsfromtheUCIrepositoryusingAda-Boostasthebasebinarylearner.Foreachproblemveoutputcodeswereusedandthenevaluated(seetext)withthreedecodingmethods:Hammingdecoding,loss-baseddecod-ingusingAdaBoostwithrandomizedpredictions(denoted),andloss-baseddecodingusingtheexponentiallossfunction(denoted LLWEIN HammingDecoding Problem One-vs-all Complete All-Pairs Dense Sparse dermatology 4.2 3.6 3.1 3.6 2.5 satimage 40.9 14.3 50.4 15.0 27.4 glass 37.6 34.3 29.5 34.8 32.4 ecoli 15.8 14.2 13.9 15.2 14.2 pendigits 3.9 2.0 26.2 2.5 2.6 yeast 73.9 42.4 40.8 42.5 48.1 vowel 60.4 53.0 39.2 53.5 50.2 soybean 20.5 Ð 9.6 9.0 9.0 Loss-basedDecoding dermatology 3.3 3.6 3.6 3.9 3.1 satimage 40.9 13.9 27.8 14.3 13.3 glass 38.6 34.8 31.0 34.8 32.4 ecoli 16.1 13.6 13.3 14.8 14.8 pendigits 2.5 1.9 3.1 2.1 2.7 yeast 72.9 40.5 40.9 39.7 47.2 vowel 50.9 51.3 39.0 51.7 47.0 soybean 21.0 Ð 10.4 8.8 9.0 Table3:ResultsofexperimentswithoutputcodeswithdatasetsfromtheUCIrepositoryusingthesupport-vectormachine(SVM)algorithmasthebasebinarylearner.ForeachproblemdifferentclassesofoutputcodesweretestedwereusedandthenevaluatedwithHammingdecodingandtheappropriateloss-baseddecodingforSVM.elementinthecodewaschosenuniformlyatrandomfrom.Forbrevity,wecallthesedenserandomcodes.Wegeneratedadenserandomcodeforeachmulticlassproblembyexamining10,000randomcodesandchoosingthecodethathadthelargestanddidnothaveanyidenticalcolumns.Thesecondtypeofcode,calledasparsecode,waschosenatrandomfromEachelementinasparsecodeiswithprobability withprobability each.Thesparsecodeshave15logcolumns.Foreachproblem,wepickedacodewithhighvalueofbyexamining10,000randomcodesasbefore.However,nowwealsohadtocheckthatnocodehadacolumnorarowcontainingonlyzeros.Notethatforsomeoftheproblemswithmanyclasses,wecouldnotevaluatethecompleteandall-pairscodessincetheyweretoolarge.WecomparedHammingdecodingtoloss-baseddecodingforeachofthevefamiliesofcodes.TheresultsareplottedinFigures4and5.EachofthetestedUCIdatasetsisplottedasabarinthegureswhereheightofthebar(possiblynegative)isproportionaltothetesterrorrateofloss-baseddecodingminusthetesterrorrateofHammingdecoding.Thedatasetsareindexedandareplottedintheorderlistedabove.WetestedAdaBoostwithtwolossfunctionsfordecoding:theexponentialloss(denotedinthegureanddrawninblack)andtheloss(1+anddrawningray)whichistheresultofusingAdaBoost ULTICLASSTOINARYOne-vs-allCompleteAll-PairsDenseSparse 6 6 7 7 8 10 10 11 19 21 24 26 26 14 12 10 8 6 4 2 0 Exp 6 6 7 7 8 10 10 11 19 21 24 26 26 6 5 4 3 2 1 0 1 Test error difference Exp 6 6 7 7 8 10 10 11 19 21 24 26 26 1.5 1 0.5 0 0.5 1 1.5 2 Exp 6 6 7 7 8 10 10 11 19 21 24 26 26 4 3.5 3 2.5 2 1.5 1 0.5 0 0.5 1 Exp 6 6 7 7 8 10 10 11 19 21 24 26 26 5 4 3 2 1 0 1 2 Test error difference Exp Figure4:ComparisonofthetesterrorusingHammingdecodingandloss-baseddecodingwhenthebinarylearnersaretrainedusingAdaBoost.Twolossfunctionsfordecodingareplotted:theexponentialloss(,inblack)and(1+whenusingAdaBoostwithrandomizedpredictions(,ingray).One-vs-allCompleteAll-PairsDenseSparse 6 6 7 8 10 10 11 19 10 8 6 4 2 0 2 6 7 8 10 10 11 19 2 1.5 1 0.5 0 0.5 6 7 8 10 10 11 19 25 20 15 10 5 0 5 6 7 8 10 10 11 19 3 2.5 2 1.5 1 0.5 0 0.5 6 7 8 10 10 11 19 16 14 12 10 8 6 4 2 0 2 Figure5:ComparisonofthetesterrorusingHammingdecodingandloss-baseddecodingwhenthebinarylearnerissupportvectormachines.withrandomizedpredictions.Itisclearfromtheplotsthatloss-baseddecodingoftengivesbetterresultsthanHammingdecodingforbothSVMandAdaBoost.Thedifferenceissometimesverycant.Forinstance,forthedatasetSatimage,withtheall-pairscode,SVMachieveserrorwithloss-baseddecodingwhileHammingdecodingresultsinanerrorrateof.SimilarresultsareobtainedforAdaBoost.Thedifferenceisespeciallysignicantfortheone-against-allandrandomdensecodes.Note,however,thatloss-baseddecodingbasedonAdaBoostwithrandomizedpredictionsdoesnotyieldasgoodresultsasthestraightforwarduseofloss-baseddecodingforAdaBoostwiththeexponentialloss.ThismightbepartiallyexplainedbythefactthatAdaBoostwithrandomizedpredictionsisnotdirectlyattemptingtominimizethelossitusesfordecoding.Toconcludethesection,wediscussasetofexperimentsthatcomparedtheperformanceofthedifferentoutputcodes.InFigures6and7,weplotthetesterrordifferenceforpairsofcodesusingloss-baseddecodingwithSVMandAdaBoostasthebinarylearners.Eachplotconsistofamatrixofbargraphs.Therowsandcolumnscorrespond,inorder,tothevecodingmethods,namely,one-against-all,all-pairs,complete,randomdenseandrandomsparse.Thebargraphinrowandcolumnshowsthedifferencebetweenthetesterrorofcodingmethodminusthetesterrorofcodingmethodforthedatasetstested.ForSVM,itisclearthatthewidelyusedone-against-allcodeisinferiortoalltheothercodeswetested.(NotethatmanyofthebarsinthetoprowofFigure6correspondtolargepositivevalues.)One-against-alloftenresultsinerrorratesthataremuchhigherthantheerrorratesofothercodes.Forinstance,forthedatasetYeast,theone-against-allcodehasanerrorrateofwhiletheerrorrateofalltheothercodesisnomorethan(randomsparse)andcanbeaslowas LLWEIN 6 6 7 8 10 10 11 19 5 0 5 10 15 20 25 30 35 6 6 7 8 10 10 11 19 5 0 5 10 15 20 25 30 35 6 6 7 8 10 10 11 19 5 0 5 10 15 20 25 30 35 6 6 7 8 10 10 11 19 5 0 5 10 15 20 25 30 6 6 7 8 10 10 11 19 35 30 25 20 15 10 5 0 5 6 6 7 8 10 10 11 19 15 10 5 0 5 10 15 6 7 8 10 10 11 19 1.5 1 0.5 0 0.5 1 6 7 8 10 10 11 19 8 6 4 2 0 2 4 6 6 6 7 8 10 10 11 19 35 30 25 20 15 10 5 0 5 6 7 8 10 10 11 19 15 10 5 0 5 10 15 All-Pairs 6 6 7 8 10 10 11 19 15 10 5 0 5 10 15 6 7 8 10 10 11 19 10 5 0 5 10 15 6 7 8 10 10 11 19 35 30 25 20 15 10 5 0 5 6 7 8 10 10 11 19 1 0.5 0 0.5 1 1.5 6 7 8 10 10 11 19 15 10 5 0 5 10 15 6 6 7 8 10 10 11 19 8 6 4 2 0 2 4 6 6 6 7 8 10 10 11 19 30 25 20 15 10 5 0 5 6 7 8 10 10 11 19 6 4 2 0 2 4 6 8 6 6 7 8 10 10 11 19 15 10 5 0 5 10 6 7 8 10 10 11 19 6 4 2 0 2 4 6 8 Figure6:Thedifferencebetweenthetesterrorsforpairsoferrorcorrectingmatricesusingsupportvectormachinesasthebinarylearners.(randomdense).Ontheveryfewcasesthattheone-against-allperformsbetterthanoneoftheothercodes,thedifferenceisnotstatisticallysignicant.However,thereisnoclearwinneramongthefourotheroutputcodes.ForAdaBoost,noneofthecodesispersistentlybetterthantheotheranditseemsthatthebestcodetouseisproblemdependent.Theseresultssuggestthataninterestingdirectionforresearchwouldbemethodsfordesigningproblemspecicoutputcodes.SomerecentprogressinthisdirectionwasmadebyCrammerandSinger(2000).AcknowledgmentMostoftheresearchonthispaperwasdonewhileallthreeauthorswereatAT&TLabs.Thankstotheanonymousreviewersfortheircarefulreadingandhelpfulcomments.PartofthisresearchwassupportedbytheBinationalScienceFoundationundergrantnumber1999038. ULTICLASSTOINARY 6 6 7 7 8 10 10 11 19 21 24 26 26 3 2 1 0 1 2 3 6 6 7 7 8 10 10 11 19 21 24 26 26 4 2 0 2 4 6 8 6 6 7 7 8 10 10 11 19 21 24 26 26 14 12 10 8 6 4 2 0 2 4 6 7 7 8 10 10 11 19 21 24 26 26 8 6 4 2 0 2 4 6 8 6 6 7 7 8 10 10 11 19 21 24 26 26 3 2 1 0 1 2 3 Complete 6 6 7 7 8 10 10 11 19 21 24 26 26 2 1 0 1 2 3 4 5 6 6 7 7 8 10 10 11 19 21 24 26 26 6 5 4 3 2 1 0 1 2 3 4 6 6 7 7 8 10 10 11 19 21 24 26 26 8 6 4 2 0 2 4 6 6 6 7 7 8 10 10 11 19 21 24 26 26 8 6 4 2 0 2 4 6 6 7 7 8 10 10 11 19 21 24 26 26 5 4 3 2 1 0 1 2 All-Pairs 6 6 7 7 8 10 10 11 19 21 24 26 26 25 20 15 10 5 0 5 6 7 7 8 10 10 11 19 21 24 26 26 16 14 12 10 8 6 4 2 0 2 4 6 7 7 8 10 10 11 19 21 24 26 26 4 2 0 2 4 6 8 10 12 14 6 6 7 7 8 10 10 11 19 21 24 26 26 4 3 2 1 0 1 2 3 4 5 6 6 6 7 7 8 10 10 11 19 21 24 26 26 5 0 5 10 15 20 25 Dense 6 6 7 7 8 10 10 11 19 21 24 26 26 8 6 4 2 0 2 4 6 8 10 12 6 6 7 7 8 10 10 11 19 21 24 26 26 8 6 4 2 0 2 4 6 8 6 6 7 7 8 10 10 11 19 21 24 26 26 6 4 2 0 2 4 6 8 6 6 7 7 8 10 10 11 19 21 24 26 26 4 2 0 2 4 6 8 10 12 14 16 6 6 7 7 8 10 10 11 19 21 24 26 26 12 10 8 6 4 2 0 2 4 6 8 Figure7:ThedifferencebetweenthetesterrorsforpairsoferrorcorrectingmatricesusingAda-Boostforthebinarylearner.ReferencesBartlett,P.L.(1998).Thesamplecomplexityofpatternclassicationwithneuralnetworks:thesizeoftheweightsismoreimportantthanthesizeofthenetwork.IEEETransactionsonInformationTheory(2),525Breiman,L.(1997a).Arcingtheedge.Tech.rep.486,StatisticsDepartment,UniversityofCalifor-niaatBerkeley.Breiman,L.(1997b).Predictiongamesandarcingclassiers.Tech.rep.504,StatisticsDepartment,UniversityofCaliforniaatBerkeley.Breiman,L.,Friedman,J.H.,Olshen,R.A.,&Stone,C.J.(1984).ClassiÞcationandRegressionTrees.Wadsworth&Brooks.Collins,M.,Schapire,R.E.,&Singer,Y.(2000).Logisticregression,AdaBoostandBregmandistances.InProceedingsoftheThirteenthAnnualConferenceonComputationalLearning LLWEINCortes,C.,&Vapnik,V.(1995).Support-vectornetworks.MachineLearning(3),273Crammer,K.,&Singer,Y.(2000).Onthelearnabilityanddesignofoutputcodesformulticlassproblems.InProceedingsoftheThirteenthAnnualConferenceonComputationalLearningar,I.,&Tusnady,G.(1984).Informationgeometryandalternatingminimizationprocedures.StatisticsandDecisions,SupplementIssue,205DellaPietra,S.,DellaPietra,V.,&Lafferty,J.(1997).InducingfeaturesofrandomTransactionsPatternAnalysisandMachineIntelligence(4),1Dietterich,T.G.,&Bakiri,G.(1995).Solvingmulticlasslearningproblemsviaerror-correctingoutputcodes.JournalofArtiÞcialIntelligenceResearch,263Freund,Y.(1999).Anadaptiveversionoftheboostbymajorityalgorithm.InProceedingsoftheTwelfthAnnualConferenceonComputationalLearningTheory,pp.102Freund,Y.,&Schapire,R.E.(1997).Adecision-theoreticgeneralizationofon-linelearningandanapplicationtoboosting.JournalofComputerandSystemSciences(1),119Friedman,J.,Hastie,T.,&Tibshirani,R.(2000).Additivelogisticregression:astatisticalviewofTheAnnalsofStatistics(2),337Guruswami,V.,&Sahai,A.(1999).Multiclasslearning,boosting,anderror-correctingcodes.InProceedingsoftheTwelfthAnnualConferenceonComputationalLearningTheory,pp.145Hastie,T.,&Tibshirani,R.(1998).Classicationbypairwisecoupling.TheAnnalsofStatistics(2),451Hoeffding,W.(1963).Probabilityinequalitiesforsumsofboundedrandomvariables.JournaloftheAmericanStatisticalAssociation(301),13offgen,K.-U.,&Simon,H.-U.(1992).Robusttrainabilityofsingleneurons.InProceedingsoftheFifthAnnualACMWorkshoponComputationalLearningTheory,pp.428Kearns,M.,&Mansour,Y.(1996).Ontheboostingabilityoftop-downdecisiontreelearningalgorithms.InProceedingsoftheTwenty-EighthAnnualACMSymposiumontheTheoryofLafferty,J.(1999).Additivemodels,boostingandinferenceforgeneralizeddivergences.InPro-ceedingsoftheTwelfthAnnualConferenceonComputationalLearningTheory,pp.125Mason,L.,Baxter,J.,Bartlett,P.,&Frean,M.(1999).Functionalgradienttechniquesforcombininghypotheses.InSmola,A.J.,Bartlett,P.J.,Scholkopf,B.,&Schuurmans,D.(Eds.),inLargeMarginClassiÞers.MITPress.Merz,C.J.,&Murphy,P.M.(1998).UCIrepositoryofmachinelearningdatabases.www.ics.uci.edu/mlearn/MLRepository.html. ULTICLASSTOINARYQuinlan,J.R.(1993).C4.5:ProgramsforMachineLearning.MorganKaufmann.atsch,G.,Onoda,T.,&Muller,K.-R.(toappear).SoftmarginsforAdaBoost.MachineLearningRumelhart,D.E.,Hinton,G.E.,&Williams,R.J.(1986).Learninginternalrepresentationsbyerrorpropagation.InRumelhart,D.E.,&McClelland,J.L.(Eds.),ParallelDistributedProcessingÐExplorationsintheMicrostructureofCognition,chap.8,pp.318362.MITPress.Schapire,R.E.,Freund,Y.,Bartlett,P.,&Lee,W.S.(1998).Boostingthemargin:Anewexplana-tionfortheeffectivenessofvotingmethods.TheAnnalsofStatistics(5),1651Schapire,R.E.,&Singer,Y.(1999).Improvedboostingalgorithmsusingcondence-ratedpredic-MachineLearning(3),297olkopf,B.,Smola,A.,Williamson,R.,&Bartlett,P.(1998).Newsupportvectoralgorithms.Tech.rep.NC2-TR-1998-053,NeuroColt2.Vapnik,V.N.(1995).TheNatureofStatisticalLearningTheory.Springer.