TrainingHighlyMulticlassClassifiersPclassicationerror5Pclassicationerror0110Samples10Samples 100Samples100Samples Figure1Histogramsoftheempiricalerrorof10randomsamplestopor100randomsample ID: 328068
Download Pdf The PPT/PDF document "Gupta,Bengio,andWestonaformofcurriculuml..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Gupta,Bengio,andWestonaformofcurriculumlearning,thatis,oflearningtodistinguisheasyclassesbeforefocusingonlearningtodistinguishhardclasses(Bengioetal.,2009).Inthispaper,wefocusonclassiersthatusealineardiscriminantorasingleprototypicalfeaturevectortorepresenteachclass.Linearclassiersareapopularapproachtohighlymulticlassproblemsbecausetheyareecientintermsofmemoryandinferenceandcanprovidegoodperformance(Perronninetal.,2012;Linetal.,2011;SanchezandPerronnin,2011).Classprototypesoersimilarmemory/eciencyadvantages.Thelastlayerofadeepbeliefnetworkclassierisoftenlinearorsoft-maxdiscriminantfunctions(Bengio,2009),andtheproposedideasforadaptingonlinelossfunctionsshouldbeapplicableinthatcontextaswell.WeapplytheproposedlossfunctionadaptationtothemulticlasslinearclassiercalledWsabie(Westonetal.,2011).WealsosimplifyWsabie'sweightingofstochasticgradients,andemployarecentadvanceinautomaticstep-sizeadaptationcalledadagrad(Duchietal.,2011).TheresultingproposedWsabie++classieralmostdoublestheclassicationaccuracyonbenchmarkImagenetdatasetscomparedtoWsabie,andshowssubstantialgainsoverone-vs-allSVMs.Therestofthearticleisasfollows.AfterestablishingnotationinSection2,weexplaininSection3howdierentclassconfusabilitiescandistortthestandardempiricalloss.WethenreviewlossfunctionsforjointlytrainingmulticlasslinearclassiersinSection4,andstochasticgradientdescentvariantsforlarge-scalelearninginSection5.InSection6,weproposeapracticalonlinesolutiontoadapttheempiricallosstoaccountforthevarianceofclassconfusability.WedescribeouradagradimplementationinSection7.Experimentsarereportedonbenchmarkandproprietaryimageclassicationdatasetswith15,000-97,000classesinSection8and9.Weconcludewithsomenotesaboutthekeyissuesandunresolvedquestions.2.NotationAndAssumptionsWetakeasgivenasetoftrainingdataf(xt;Yt)gfort=1;:::;n,wherext2RdisafeaturevectorandYtf1;2;:::;GgisthesubsetoftheGclasslabelsthatareknowntobecorrectlabelsforxt.Forexample,animagemightberepresentedbysetoffeaturesxtandhaveknownlabelsYt=fdolphin,ocean,HalfMoonBayg.Weassumeadiscriminantfunctionf(x;g)hasbeenchosenwithclass-specicparametersgforeachclasswithg=1;:::;G.Theclassdiscriminantfunctionsareusedtoclassifyatestsamplexastheclasslabelthatsolvesargmaxgf(x;g):(1)Mostofthispaperappliesequallywellto\learningtorank,"inwhichcasetheoutputmightbeatop-rankedorranked-and-thresholdedlistofclassesforatestsamplex.Forsimplicity,werestrictourdiscussionandmetricstotheclassicationparadigmgivenby(1).Manyoftheideasinthispapercanbeappliedtoanychoiceofdiscriminantfunctionf(x;g),butinthispaperwefocusoneciencyintermsoftest-timeandmemory,andsowefocusonclassdiscriminantsthatareparameterizedbyad-dimensionalvectorperclass.Twosuchfunctionsare:theinnerproductf(x;g)=Tgx,andthesquared`2normf(x;g)=(gx)T(gx).Wealsorefertotheseaslineardiscriminantsand1462 TrainingHighlyMulticlassClassifiersP(classicationerror)=:5P(classicationerror)=:0110Samples10Samples 100Samples100Samples Figure1:Histogramsoftheempiricalerrorof10randomsamples(top)or100randomsamples(bottom).Asthenumberofsamplesaveragedgrows,theempiricalerrorwillconvergetothetrueprobabilityofanerror,either:5(left)or:01(right).Butgivenanitesample,theempiricalerrormaybequitenoisy,andwhenthetrueerrorishigh(left)theempiricalerrorcanbemuchnoisierthanwhenthetrueerrorislow(right).1465 Gupta,Bengio,andWestonusingtheapproximation(7)fortheempiricalloss.ShamirandDekel(2010)proposedarelatedbutmoreextremetwo-stepapproachforhighlymulti-classproblems:rsttrainaclassieronallclasses,andthendeleteclassesthatarepoorlyestimatedbytheclassier.Tobemorepractical,weproposecontinuouslyevolvingtheclassiertoignorethecur-rentlyhighly-confusableclassesbyimplementing(6)inanonlinefashionwithSGD.Thissimplevariantcanbeinterpretedasimplementingcurriculumlearning(Bengioetal.,2009),atopicwediscussfurtherinSection6.2.1.Butbeforedetailingtheproposedsimpleon-linestrategyinSection6,weneedtoreviewrelatedworkinlossfunctionsformulti-classclassiers.4.RelatedWorkinLossFunctionsforMulticlassClassiersInthissection,wereviewlossfunctionsformulticlassclassiers,anddiscussrecentworkadaptingsuchlossfunctionstotheonlinesettingforlarge-scalelearning.Oneofthemostpopularclassiersforhighlymulticlasslearningisone-vs-alllinearSVMs,whichhaveonlyO(Gd)parameterstolearnandstore,andO(Gd)timeneededfortesting.Aclearadvantageofone-vs-allisthattheGclassdiscriminantfunctionsffg(x)gcanbetrainedindependently.AnalternateparallelizableapproachistotrainallG2one-vs-oneSVMs,andletthemvoteforthebestclass(alsoknownasround-robinandall-vs-all).Binaryclassierscanalsobecombinedusingerror-correctingcodeapproaches(DietterichandBakiri,1995;Allweinetal.,2000;CrammerandSinger,2002).Awell-regardedexperimentalstudyofmulticlassclassicationapproachesbyRifkinandKlatau(2004)showedthatone-vs-allSVMsperformed\justaswell"onasetoftenbenchmarkdatasetswith4-49classesasone-vs-oneorerror-correctingcodeapproaches.Anumberofresearchershaveindependentlyextendedthetwo-classSVMoptimizationproblemtoajointmulticlassoptimizationproblemthatmaximizespairwisemarginssubjecttothetrainingsamplesbeingcorrectlyclassied,withrespecttopairwiseslackvariables(Vapnik,1998;WestonandWatkins,1998,1999;BredensteinerandBennet,1999).1Theseextensionshavebeenshowntobeessentiallyequivalentquadraticprogrammingproblems(Guermeur,2002).Theminimizedempiricallosscanbestatedasthesumofthepairwiseerrors:Lpairwise(fgg)=nXt=11 jYtjXy+2Yt1 jYCtjXy2YCtjbf(xt;y+)+f(xt;y)j+;(8)wherejj+isshort-handformax(0;),bisamarginparameter,YCtisthecomplementsetofYt,andweaddednormalizerstoaccountforthecasethatagivenxtmayhavemorethanonepositivelabelsuchthatjYtj1.CrammerandSinger(2001)insteadsuggestedtakingthemaximumhingelossoverallthenegativeclasses:Lmaxloss(fgg)=nXt=11 jYtjXy+2Ytmaxy2YCtjbf(xt;y+)+f(xt;y)j+:(9) 1.SeealsotheworkofHerbrichetal.(2000)forarelatedpairwiselossfunctionforrankingratherthanclassication.1468 TrainingHighlyMulticlassClassifiersThismaximumhinge-lossissometimescalledmulticlassSVM,andcanbederivedfromamargin-bound(Mohrietal.,2012).Danielyetal.(2012)theoreticallycomparedmulticlassSVMwithone-vs-all,one-vs-one,tree-basedlinearclassiersanderror-correctingoutputcodelinearclassiers.TheyshowedthatthehypothesisclassofmulticlassSVMcontainsthatofone-vs-allandtree-classiers,whichstrictlycontainthehypothesisclassofone-vs-oneclassiers.ThusthepotentialperformancewithmulticlassSVMislarger.Howevertheyalsoshowedthattheapproximationerrorofone-vs-oneissmallest,withmulticlassSVMnextsmallest.Statnikovetal.(2005)comparedeightmulticlassclassiersincludingthatofWestonandWatkins(1999)andCrammerandSinger(2001)onninecancerclassicationproblemswith3to26classesandlessthan400samplesperproblem.Onthesesmall-scaledatasets,theyfoundtheCrammerandSinger(2001)classierwasbest(ortied)on2/3ofthedatasets,andthepairwiselossgivenin(8)performedalmostaswell.Leeetal.(2004)proveintheirLemma2thatpreviousapproachestomulticlassSVMsarenotguaranteedtobeasymptoticallyconsistent.Formoreonconsistencyofmulticlassclassicationlossfunctions,seeRifkinandKlatau(2004),TewariandBartlett(2007),Zhang(2004),andMrouehetal.(2012).Leeetal.(2004)proposedamulticlasslossfunctionthatisconsistent.TheyforcetheclassdiscriminantstosumtozerosuchthatPgf(x;g)=0forallx,anddenetheloss:Ltotalloss(fgg)=nXt=1Xy2YCtjf(xt;y)1 G1j+:(10)Thislossfunctionjointlytrainstheclassdiscriminantssothatthetotalsumofwrongclassdiscriminantsforeachtrainingsampleissmall.Minimizingthislosscanbeexpressedasaconstrainedquadraticprogram.TheexperimentsofLeeetal.(2004)onafewsmalldatasetsdidnotshowmuchdierencebetweentheperformanceof(10)and(8).5.OnlineLossFunctionsforTrainingLarge-scaleMulticlassClassiersIftherearealargenumberoftrainingsamplesn,thencomputingthelossforeachcandidatesetofclassierparametersbecomescomputationallyprohibitive.Theusualsolutionistominimizethelossinanonlinefashionwithstochasticgradientdescent,butexactlyhowtosamplethestochasticgradientsbecomesakeyissue.Next,wereviewtwostochasticgradientapproachesthatcorrespondtodierentlossfunctions:AUCsampling(GrangierandBengio,2008)andtheWARPsamplingusedintheWsabieclassier(Westonetal.,2011).5.1AUCSamplingForalargenumberoftrainingsamplesn,GrangierandBengio(2008)proposedoptimizing(8)bysequentiallyuniformlysamplingfromeachofthethreesumsin(8):1.drawonetrainingsamplext,2.drawonecorrectclassy+fromYt,1469 TrainingHighlyMulticlassClassifiersTosampleaviolatingclassfromVt;y+,thenegativeclassesinYCtareuniformlyrandomlysampleduntilaclassthatsatisestheviolationconstraint(12)isfound,orthenumberofallowedsuchtrials(generallysettobeG)isexhausted.Therankr(y+)neededtocalculatetheweightin(13)isestimatedtobe(G1)dividedbythenumberofnegativeclassesy2YtthathadtobetriedbeforendingaviolatingclassyvfromVt;y+.5.3SomeNotesComparingAUCandWARPLossWenotethatWARPsamplingismorelikelythanAUCsamplingtoupdatetheparametersofatrainingsample'spositiveclassy+ify+hasfewviolatingclasses,thatisif(xt;yt)isalreadyhighly-rankedby(1).Specically,supposeatrainingpair(xt;yt)israndomlysampled,andsupposeH0oftheGclassesareviolatorssuchthattheirhinge-lossisnon-zerowithrespectto(x;y+).WARPsamplingwilldrawrandomclassesuntilitndsaviolatorandmakesanupdate,butAUCwillonlymakeanupdateiftheonerandomclassitdrawshappenstobeaviolator,soonlyH=(G1)ofthetime.Bydenition,thehigher-rankedthecorrectclassytisforxt,thesmallerthenumberofviolatingclassesH,andthelesslikelyAUCsamplingwillupdatetheclassiertolearnfrom(xt;yt).Inthissense,WARPsamplingismorefocusedonxingclassparametersthatarealmostrightalready,whereasAUCsamplingismorefocusedonimprovingclassparametersthatareverywrong.Attesttime,classierschooseonlythehighest-rankedclassdiscriminantasaclasslabel,andthusthefactthatAUCsamplingupdatesmoreoftenonlower-rankedclassesislikelythekeyreasonthatWARPsamplingperformssomuchbetterinpractice(seetheexperimentsofWestonetal.(2011)andtheexperimentsinthispaper).Eveninranking,itisusuallyonlythetoprankedclassesthatareofinterest.However,theWARPweightw(y+)givenin(13)partlycounteractsthisdierencebyassigninggreaterweights(equivalentlyalargerstep-sizetothestochasticgradient)ifthecorrectclasshasmanyviolators,asthenitsrankislower.Inthispaper,oneoftheproposalswemakeistouseconstantweightsw(y+)=1,sothetrainingisevenmorefocusedonimprovingclassesthatarealreadyhighly-ranked.AUCsamplingisalsoinecientbecausesomanyoftherandomsamplesresultinazerogradient.Infact,wenotethattheprobabilitythatAUCsamplingwillupdatetheclassierdecreasesiftherearemoreclasses.Specically,supposeforagiventrainingsamplepair(xt;yt)thereareHclassesthatviolateit,andthatthereareGclassesintotal.ThentheprobabilitythatAUCsamplingupdatestheclassierfor(xt;yt)isH=(G1),whichlinearlydecreasesasthenumberofclassesGisincreased.5.4OnlineVersionsofOtherMulticlassLossesWARPsamplingimplementsanonlineversionofthepairwiselossgivenin(8)(Westonetal.,2011).OnecanalsointerprettheWARPlosssamplingasanonlineapproximationofthemaximumhingelossgivenin(9),wherethemaximumviolatingclassisapproximatedbythesampledviolatingclass.Thisinterpretationdoesnotcallforarank-basedweightingw(y+),andinfactwefoundthatsettingw(y+)=1improvedaccuracybyroughly20%onalarge-scaleimageannotationtask(seeTable6).Abetterapproximationof(9)wouldrequire1471 Gupta,Bengio,andWestonsamplingmultipleviolatingclassesandthentakingtheclasswiththeworstdiscriminant,wedidnottrythisduetotheexpectedtimeneededtondmultipleviolatingclasses.Further,wehypothesizethatchoosingtheclasswiththelargestviolationastheviolatingclasscouldactuallyperformpoorlyforpracticalhighlymulticlassproblemslikeImagenetbecausetheworstdiscriminantmaybelongtoaclassthatisamissingcorrectlabel,ratherthananincorrectlabel.AnonlineversionofthelossproposedbyLeeetal.(2004)andgivenin(10)wouldbemorechallengingtoimplementbecausetheGclassdiscriminantsarerequiredtobenormalized;wedonotknowofanysuchexperiments.5.5TheWsabieClassierWestonetal.(2011)combinedtheWARPsamplingwithonlinelearningofasupervisedlineardimensionalityreduction.TheylearnanembeddingmatrixW2Rmdthatmapsagivend-dimensionalfeaturevectorxtoanm-dimensional\embedded"vectorWx2Rm,wheremd,andthentheGclass-specicdiscriminantsofdimensionmaretrainedtoseparateclassesintheembeddingspace.Westonetal.(2011)referredtothiscombinationastheWsabieclassier.ThischangestheWARPlossgivenin(11)tothenon-convexWsabieloss,dened:LWsabie(W;fgg)=nXt=1Xy+2YtXyv2Vt;y+w(y+)jbf(Wxt;y+)+f(Wxt;yv)j+:AddingtheembeddingmatrixWchangesthenumberofparametersfromGdtoGm+md.ForalargenumberofclassesGandasmallembeddingdimensionm(thecaseofinteresthere)thisreducestheoverallparameters,andsotheadditionoftheembeddingmatrixWactsasaregularizer,reducesmemory,andreducestestingtime.6.OnlineAdaptationoftheEmpiricalLosstoReduceImpactofHighlyConfusableClassesInthissection,wepresentasimpleandmemory-ecientonlineimplementationoftheempiricalclass-confusionlossweproposedinSection3.3thatreducestheimpactofhighlyconfusableclassesonthestandardempiricalloss.First,wedescribethebatchvariantofthisproposalandquantifyitseect.TheninSection6.2wedescribeasamplingversion.InSection6.3,weproposeasimpleextensionthatexperimentallyincreasestheaccuracyoftheresultingclassier,withoutusingadditionalmemory.WeshowtheproposedonlinestrategyworkswellinpracticeinSection9.6.1ReducingtheEectofHighlyConfusableClassesByIgnoringLastViolatorsWeintroducethekeyideaofalastviolatorclasswithasimpleexamplebeforeaformaldenition.Supposeduringonlinetrainingthehundredthtrainingsamplex100haslabely100=tiger,andthatthelasttrainingsamplewesawlabelledtigerwasx5.Andsupposelionwasaviolatingclassforthattrainingsamplepair(x5;tiger),thatisj11472 Gupta,Bengio,andWeston6.2IgnoringSampledLastViolatorsforOnlineLearningBuildingontheWARPsamplingproposedbyWestonetal.(2011)andreviewedinSection5.2,weproposeanonlinesamplingimplementationof(14),whereforeachclassy+westoreonesampledlastviolatorandonlyupdatetheclassierifthecurrentviolatorisnotthesameasthelastviolator.Specically,1.drawonetrainingsamplext,2.drawonecorrectclassy+fromYt,3.ifthereisnolastviolatorvt;y+orifvt;y+existsbutisnotaviolatorfor(xt;y+),thendrawandstoreoneviolatingclassyvfromVy+and(a)computethecorrespondingstochasticgradientofthelossin(8)(b)updatetheparameters.Table1re-visitsthesameexampleasearlier,andillustratesforeightsequentialtrainingexampleswhosetraininglabelwascatwhatthelastviolatorclassis,whetherthelastviolatorisacurrentviolator(inwhichcasethecurrenterrorisignored),orifnotignored,whichofthecurrentviolatorsisrandomlysampledfortheclassierupdate.Throughoutthetraining,thestateofthesampledlastviolatorforanyclassy+canbeviewedasaMarkovchain.Weillustratethisfortheclassy+andtwopossibleviolatingclassesgandhinFigure3.Intheexperimentstofollow,wecoupletheproposedonlineempiricalclass-confusionlosssamplingstrategywithanembeddingmatrixasintheWsabiealgorithmforeciencyandregularization,andrefertothisasWsabie++.AcompletedescriptionofWsabie++isgiveninTable3,includingtheadagradstepsizeupdatesdescribedinthenextsection.ThememoryneededtoimplementthisdiscountingisO(G)becauseonlyonelastviolatorclassisstoredforeachoftheGclasses.SetofCatViolatorsCat'sLVCat'sLVViolates?NewViolatorSampled? 1:dogandpignone-dog2:dogandpigdogyesignored3:dogdogyesignored4:dogandpigdogyesignored5:pigdognopig6:noviolatorspignonone7:dognone-dog8:dogdogyesignoredTable1:Exampleofignoringsampledlastviolatorsforeightsequentialsamples(oneperrow)whosetraininglabeliscat.1474 TrainingHighlyMulticlassClassifiersTouse(15)inanonlinesetting,eachtimeatrainingsampleanditspositiveclassaredrawn,wecheckifanyq-thorderlastviolatorvqt;y+foranyqQisacurrentviolator,andifso,weignorethattrainingsampleandmovedirectlytothenexttrainingsamplewithoutupdatingtheclassierparameters.Table3givesthecompleteproposedsamplingandupdatingalgorithmforEuclideandiscriminantfunctions,includingtheadaptiveadagradstep-sizeexplainedinSection7whichfollows.ForEuclideandiscriminantfunctionswedidnotnd(experimentally)thatweneededanyconstraintsoradditionalregularizersonWorfgg,thoughifdesiredaregularizationstepcanbeadded.7.AdagradForLearningRateConvergencespeedofstochasticgradientmethodsissensitivetothechoiceofstepsizes.Recently,Duchietal.(2011)proposedaparameter-dependentlearningrateforstochasticgradientmethods.Theyprovedthattheirapproachhasstrongtheoreticalregretguaranteesforconvexobjectivefunctions,andexperimentallyitproducedbetterresultsthancompa-rablemethodssuchasregularizeddualaveraging(Xiao,2010)andthepassive-aggressivemethod(Crammeretal.,2006).Inourexperiments,weappliedadagradbothtotheconvextrainingoftheone-vs-allSVMsandAUCsampling,aswellastothenon-convexWsabie++training.Inspiredbyourpreliminaryresultsusingadagradfornon-convexoptimization,Deanetal.(2012)alsotriedadagradfornon-convextrainingofadeepbeliefnetwork,andalsofounditproducedsubstantialimprovementsinpractice.Themainideabehindadagradisthateachparametergetsitsownstepsize,andeachtimeaparameterisupdateditsstepsizeisdecreasedtobeproportionaltotherunningsumofthemagnitudeofallpreviousupdates.Forsimplicity,welimitourdescriptiontothecasewheretheparametersbeingoptimizedareunconstrained,whichishowweimplementedit.Formemoryandcomputationaleciency,Duchietal.(2011)applyingadagradseparatelyforeachparameter(asopposedtomodelingcorrelationsbetweenparameters).WeappliedadagradtoadaptthestepsizefortheGclassierdiscriminantsfggandthemdembeddingmatrixW.Wefoundthatwecouldsavememorywithoutaectingexperimentalperformancebyaveragingtheadagradlearningrateovertheembeddingdi-mensionssuchthatwekeeptrackofonescalaradagradweightperclass.Thatis,letg;denotethestochasticgradientforgattime,thenweupdategasfollows:g;t+1=g;t tX=0 T;g;g d!!1=2g;:(16)Analogously,wefounditexperimentallyeectiveandmorememoryecienttokeeptrackofoneaveragedscalaradagradweightforeachofthemrowsoftheembeddingmatrixW.Therearetwomaineectstousingadagrad.First,supposetherearetwoclassesthatareupdatedequallyoften,thentheclasswithlargerstochasticgradientsfgwillexperienceafaster-decayinglearningrate.Second,andwebelievethemorerelevantissueforouruse,isthatsomeclassesareupdatedfrequently,andsomeclassesrarely.Supposethatallstochasticgradientsfghavethesamemagnitude,thentheclassesthatareupdatedmorerarelyexperiencerelativelylargerupdates.Inourexperimentsthesecondeectwas1477 TrainingHighlyMulticlassClassifierspredominant,whichwetestedbysettingthelearningrateforeachparameterproportionaltotheinversesquarerootofthenumberoftimesthatparameterhasbeenupdated.This\countingadagrad"producedresultsthatwerenotstatisticallydierentusing(16).(Theexperimentalresultsinthispaperarereportedusingadagradproperasper(16).)WeusetorefertotherunningsumofgradientmagnitudesinthecompleteWsabie++algorithmdescriptiongiveninTable3.8.ExperimentsWerstdetailthedatasetsused.TheninSection8.2wedescribethefeatures.InSection8.3wedescribethedierentclassierscomparedandhowtheparametersandhyperparameterswereset.8.1DataSetsExperimentswererunwithfourdatasets,assummarizedinTable4anddetailedbelow. 16kImageNet22kImageNet21kWebData97kWebData NumberofClasses 15,58921,84121,17196,812NumberofSamples 9million14million9million40millionNumberofFeatures 102447910241024Table4:Datasets.8.1.1ImageNetDataSetsImageNet(Dengetal.,2009)isalargeimagedatasetorganizedaccordingtoWordNet(Fellbaum,1998).ConceptsinWordNet,describedbymultiplewordsorwordphrases,arehierarchicallyorganized.ImageNetisagrowingimagedatasetthatattachesoneoftheseconceptstoeachimageusingaquality-controlledhuman-veriedlabelingprocess.Weusedthespring2010andfall2011releasesoftheImagenetdataset.Thespring2010versionhasaround9Mimagesand15,589classes(16kImageNet).Thefall2011versionhasabout14Mimagesand21,841classes(22kImageNet).Forbothdatasets,weseparatedout10%oftheexamplesforvalidation,10%fortest,andtheremaining80%wasusedfortraining.8.1.2WebDataSetsWealsohadaccesstoalargeproprietarysetofimagestakenfromtheweb,togetherwithanoisyannotationbasedonanonymizedusers'clickinformation.Wecreatedtwodatasetsfromthiscorpusthatwerefertoas21kWebDataand97kWebData.The21kWebDatacontainsabout9Mimages,dividedinto20%forvalidation,20%fortest,and60%fortrain,andtheimagesarelabelledwith21,171distinctclasses.The97kWebDatacontainsabout40Mimages,dividedinto10%forvalidation,10%fortest,and80%fortrain,andtheimagesarelabelledwith96,812distinctclasses.TherearevemaindierencesbetweentheWebDataandImageNet.First,thetypesoflabelsfoundinImagenetaremoreacademic,followingthestrictstructureofWordNet.1479 Gupta,Bengio,andWestonIncontrast,theWebDatalabelsaretakenfromasetofpopularqueriesthatweretheinputtoageneral-purposeimagesearchengine,soitincludespeople,brands,products,andabstractconcepts.Second,thenumberofimagesperlabelinImagenetisarticiallyforcedtobesomewhatuniform,whiletheWebDatadistributionofnumberofimagesperlabelisgeneratedbypopularitywithusers,andisthusmoreexponentiallydistributed.Third,becauseofthepopularoriginsoftheWebdatasets,classesmaybetranslationsofeachother,pluralvs.singularconcepts,orsynonyms(forexamples,seeTable7).Thusweexpectmorehighly-confusableclassesfortheWebDatathanImageNet.AfourthkeydierenceisImagenetdisambiguatespolysemouslabelswhereasWebDatadoesnot,forexample,animagelabeledpalmmightlooklikethepalmofahandorlikeapalmtree.ThefthdierenceisthattheremaybemultiplegivenpositivelabelsforsomeoftheWebsamples,forexample,thesameimagemightbelabelledmountain,mountains,Himalaya,andIndia.Lastly,classesmaybeatdierentandoverlappingprecisionlevels,forexampletheclasscakeandtheclassweddingcake.8.2FeaturesWedonotfocusonfeatureextractioninthiswork,althoughfeaturescertainlycanhaveabigimpactonperformance.Forexample,SanchezandPerronnin(2011)recentlyachieveda160%gaininaccuracyonthe10kImageNetdatatsetbychangingthefeaturesbutnottheclassicationmethod.Inthispaperweusefeatures,similartothoseusedinWestonetal.(2011).Werstcombinedmultiplespatial(GraumanandDarrell,2007)andmultiscalecolorandtextonhistograms(LeungandMalik,1999)foratotalofabout5105dimensions.Thedescriptorsaresomewhatsparse,withabout50000non-zeroweightsperimage.Someoftheconstituenthistogramsarenormalizedandsomearenot.WethenperformkernelPCA(Schoelkopfetal.,1999)onthecombinedfeaturerepresentationusingtheintersectionkernel(Barlaetal.,2003)toproducea1024-dimensionalor479-dimensionalinputvectorperimage(seeTab.4),whichisthenusedasthefeaturevectorsfortheclassiers.8.3ClassiersComparedandHyperparametersWeexperimentallycomparedthefollowinglinearclassiers:nearestmeans,one-vs-allSVMs,AUC,Wsabie,andtheproposedWsabie++classiers.Table5comparesthesemethodsastheywereimplementedfortheexperiments.Thenearestmeansclassieristhemostecienttotrainofthecomparedmethodsasitonlypassesoverthetrainingsamplesonceandcomputesthemeanofthetrainingfeaturevectorsforeachclass(andtherearenohyperparameters).Likethenearestmeansclassier,weimplementedWsabie++withEuclideandiscrim-inants(asdetailedinTable3)andassuchitcanbeconsideredadiscriminativenearestmeansclassier.TestingwithEuclideandiscriminantscaneasilybemadefasterbyap-plyingexactorapproximatefastk-NNmethods,wheretheclassprototypesfggplaytheroleoftheneighbors.Further,Euclideandiscriminantslendthemselvesmorenaturallytovisualizationthantheinnerproduct,aseachclassisrepresentedbyaprototype.1480 Gupta,Bengio,andWestonOne-vs-alllinearSVMsarethemostpopularchoiceforlarge-scaleclassiersduetostud-iesshowingtheirgoodperformance,theirparallelizabletraining,relativelysmallmemory,andfasttest-time(RifkinandKlatau,2004;Dengetal.,2010;SanchezandPerronnin,2011;Perronninetal.,2012;Linetal.,2011).Perronninetal.(2012)highlightstheimportanceofgettingtherightbalanceofnegativetopositiveexamplesusedtotraintheone-vs-alllinearSVMs.Asintheirpaper,wecross-validatetheexpectednumberofnegativeex-amplesperpositiveexample;theallowablechoiceswerepowersof2.Incontrast,earlierpublishedresultsbyWestonetal.(2011)thatcomparedWsabietoone-vs-allSVMsusedonenegativeexampleperpositiveexample,analogoustotheAUCclassier.Weincludedthiscomparison,whichwelabelledOne-vs-allSVMs1+:1-inthetables.BothWsabieandWsabie++jointlytrainanembeddingmatrixWasdescribedinSection3.5.Theembeddingdimensiondwaschosenonthevalidationsetfromthechoicesd=f32;64;96;128;192;256;384;512;768;1024gembeddingdimensions.Inaddition,wecreatedensembleWsabieandWsabie++classiersbyconcatenatingbm dcsuchd-dimensionalmodelstoproduceaclassierwithatotalofmparameterstocompareclassiersthatrequirethesamememoryandtest-time.Allhyperparameterswerechosenbasedontheaccuracyonaheld-outvalidationset.Step-size,margin,andregularizationconstanthyperparameterswerevariedbypowersoften.TheorderQofthelastviolatorswasvariedbypowersof2.ChosenhyperparametersarerecordedinTable9.BoththepairwiselossandWsabieclassierareimplementedwithstandard`2constraintsontheclassdiscriminants(andforWsabie,ontherowsoftheembeddingmatrix).WedidnotuseanyregularizationconstraintsforWsabie++.WeinitializedtheWsabieparametersandSVMparametersuniformlyrandomlywithintheconstraintset.Weinitializetheproposedtrainingbysettingallgtotheorigin,andallcomponentsoftheembeddingmatrixareequallylikelytobe1or1.Experimentswithdif-ferentinitializationschemesforthesedierentclassiersshowedthatdierent(reasonable)initializationsgaveverysimilarresults.Withtheexceptionofnearestmeans,allclassiersweretrainedonlinewithstochasticgradients.Wealsousedadagradfortheconvexoptimizationsofbothone-vs-allSVMsandtheAUCsampling,whichincreasedthespeedofconvergence.Recently,Perronninetal.(2012)showedgoodresultswithone-vs-allSVMclassiersandtheWARPlosswheretheyalsocross-validatedanearly-stoppingcriterion.Adagradreducesstepsizesovertime,andthisremovedtheneedtoworryaboutearlystopping.Infact,wedidnotseeanyobviousoverttingwithanyoftheclassiertraining(validationsetandtestseterrorswerestatisticallysimilar).Eachalgorithmwasallowedtotrainonupto100loopsthroughtheentiretrainingsetoruntilthevalidationsetperformancehadnotchangedin24hours.Eventhoserunsthatrantheentire100loopsappearedtohaveessentiallyconverged.ImplementedinC++withoutparallelization,allalgorithms(exceptnearestmeans)tookaroundoneweektotrainthe16kImagenetdataset,aroundtwoweekstotrainthe21kand22kdatasets,andaroundonemonthtotrainthe97kdataset.Alsoinallcasesroughly80%ofthevalidationaccuracywasachievedinroughlytherst20%ofthetrainingtime.Becausestochasticgradientdescentusesrandomsamplingofthetrainingsamples,mul-tiplerunswillproduceslightlydierentresults.Toaddressthisrandomness,weranverunsofeachclassierforeachsetofcandidateparameters,andreportedthetestaccuracy1482 Gupta,Bengio,andWestonClassier TestAccuracy Wsabie(Westonetal.,2011) 3.7%Wsabie+10lastviolators 5.0%Wsabie+adagrad 5.0%Wsabie+w(r(y+))=1in(11) 4.1%Wsabie+adagrad+10lastviolators 5.9%Wsabie+adagrad+w(r(y+))=1in(11) 6.0%Wsabie+adagrad+w(r(y+))=1in(11)+1lastviolator 6.3%Wsabie+adagrad+w(r(y+))=1in(11)+10lastviolators 7.1%Wsabie+adagrad+w(r(y+))=1in(11)+100lastviolators 6.8%Table6:EectoftheproposeddierencescomparedtoWsabieforad=100dimensionalembeddingspaceon21kWebData.Table7givesexamplesoftheclassescorrespondingtoneighboringfggintheembeddedfeaturespaceaftertheWsabie++training.Class 1-NN2-NN3-NN4-NN5-NN poodle canichepudellabradorpuppiescocker spaniel dolphin dauphindelfindolfinjnendelfinerdolphins SanDiego PuertoSydneyVancouverKanadaTripoli Madero mountain mountainsmontagneEverestAlaskaHimalaya router modemswitchserverlannetwork calligraphy fontsIslamicbordersquotesnetwork calligraphyTable7:Foreachoftheclassesontheleft,thetableshowsthevenearest(intermsofEuclideandistance)classprototypesfggintheproposeddiscriminativelytrainedembeddedfeaturespacefor21kWebDataset.Becausetheseclassesoriginatedaswebqueries,someclassnamesaretranslationsofeachother,forexampledolphinanddauphin(Frenchfordolphin).Whilethesemayseemlikeexactsynonyms,infactdierentlanguagecommunitiesoftenhaveslightlydierentvisualnotionsofthesameconcept.Similarly,theclassesObamaandPresidentObamaareexpectedtobelargelyoverlapping,buttheirclassdistributionsdierintheformalityandcontextoftheimages.Lastly,weillustratehowtheWsabie++testaccuracydependsonthenumberofembed-dingdimensions.Theseresultsareforthe21kWebDataset,withthestep-sizeandmargin1484 TrainingHighlyMulticlassClassifiers StepsizeMarginEmbeddingBalance#LVs dimension One-vs-allSVM1+:1- 16kImageNet .01.121kImageNet .01.121kWebData .1197kWebData .01.1 One-vs-allSVM 16kImageNet .01.16421kImageNet .01.16421kWebData .116497kWebData .01.1128 AUCSampling 16kImageNet .01.121kImageNet .01.121kWebData .001.0197kWebData .001.01 Wsabie 16kImageNet .01.112821kImageNet .001.112821kWebData .001.125697kWebData .0001.1256 Wsabie++: 16kImageNet 1010,000192821kImageNet 1010,000192821kWebData 1010,000256897kWebData 1010,00025632Table9:Classierparameterschosenusingvalidationsettoconsiderforeachpositiveclass,ratherthantheWARPsamplingwhichdrawsnegativeclassesuntilitndsaviolator.Preliminaryresultsshowedthatthevalidationsetchosethelargestparameterchoicewhichwasalmostthesameasthenumberofclasses.ThusthatapproachrequiredanotherhyperparameterbutseemedtobedoingexactlywhatWARPsamplingdoesandappearedtooernoaccuracyimprovementoverWARPsampling.Thisresearchfocusedonaccurateandecientclassication,andnotontheissueoftrainingtime.Withtheexceptionofnearestmeans,themethodscomparedwereimple-mentedwithstochasticgradientdescentforecientonlinetrainingtodealwiththelargenumberoftrainingsamplesn.Asimplemented,themethodstookroughlyequallylongtotrain.However,paralleltrainingoftheGone-vs-allSVMswouldhavebeenroughlyGtimesasfast.Whilenotasnaturallyparallelizable,wehavehadsomesuccessinparallelizingthe1487 TrainingHighlyMulticlassClassifierst1,Ot=1iZt=1andZt1=0.Thus,E[Ot]=E[Zt=1andZt1=0]=P(Zt=1;Zt1=0)=P(Zt=1)P(Zt=0)=p(1p)bytheindependenceoftheBernoullirandomvariablesZtandZt1.Thentheexpectednumberofconfusionscountedby(14)isE[PtOt]=PtE[Ot]bylinearity,whichcanbeexpanded:E[O1]+Pnt=2E[Ot]=p+(n1)p(1p).ReferencesE.L.Allwein,R.E.Schapire,andY.Singer.Reducingmulticlasstobinary:aunifyingapproachformarginclassiers.JournalMachineLearningResearch,1:113{141,2000.A.Barla,F.Odone,andA.Verri.Histogramintersectionkernelforimageclassication.Intl.Conf.ImageProcessing(ICIP),3:513{516,2003.S.Bengio,J.Weston,andD.Grangier.Labelembeddingtreesforlargemulti-classtasks.InNIPS,2010.Y.Bengio.LearningdeeparchitecturesforAI.FoundationsandTrendsinMachineLearn-ing,2(1):1{127,2009.Y.Bengio,J.Laouradour,R.Collobert,andJ.Weston.Curriculumlearning.InICML,2009.E.J.BredensteinerandK.P.Bennet.Multicategoryclassicationbysupportvectorma-chines.ComputationalOptimizationandApplications,12:53{79,1999.L.D.Brown,T.T.Cai,andA.DasGupta.Intervalestimationforabinomialproportion.StatisticalScience,16(2):101{117,2001.K.CrammerandY.Singer.Onthealgorithmicimplementationofmulticlasskernel-basedvectormachines.JournalMachineLearningResearch,2001.K.CrammerandY.Singer.Onthelearnabilityanddesignofoutputcodesformulticlassproblems.MachineLearning,47(2):201{233,2002.K.Crammer,O.Dekel,J.Keshet,S.Shalev-Shwartz,andY.Singer.Onlinepassiveaggres-sivealgorithms.JournalMachineLearningResearch,7:551{585,2006.A.Daniely,S.Sabato,andS.Shalev-Shwartz.Multiclasslearningapproaches:Atheoreticalcomparisonwithimplications.InNIPS,2012.J.Dean,G.S.Corrado,R.Monga,K.Chen,M.Devin,Q.V.Le,M.Z.Mao,M.Ranzato,A.Senoir,P.Tucker,K.Yang,andA.Ng.Largescaledistributeddeepnetworks.InNIPS,2012.J.Deng,W.Dong,R.Socher,L.-J.Li,K.Li,andL.Fei-Fei.ImageNet:ALarge-ScaleHier-archicalImageDatabase.InIEEEConf.ComputerVisionPatternRecognition(CVPR),2009.1489 TrainingHighlyMulticlassClassifiersM.Mohri,A.Rostamizadeh,andA.Talwalkar.FoundationsofMachineLearning.MITPress,Cambridge,MA,2012.Y.Mroueh,T.Poggio,L.Rosasco,andJ-JE.Slotine.Multiclasslearningwithsimplexcoding.InNIPS,2012.D.NisterandH.Stewenius.Scalablerecognitionwithavocabularytree.InCVPR,2006.A.Paton,N.Brummitt,R.Govaerts,K.Harman,S.Hinchclie,B.Allkin,andE.Lughadha.Target1oftheglobalstrategyforplantconservation:aworkinglistofallknownplantspeciesprogressandprospects.Taxon,57:602{611,2008.F.Perronnin,Z.Akata,Z.Harchaoui,andC.Schmid.Towardsgoodpracticeinlarge-scalelearningforimageclassication.InCVPR,2012.R.RifkinandA.Klatau.Indefenseofone-vs-allclassication.JournalMachineLearningResearch,5:101{141,2004.J.SanchezandF.Perronnin.High-dimensionalsignaturecompressionforlarge-scaleimageclassication.InCVPR,2011.B.Schoelkopf,A.J.Smola,andK.R.Muller.Kernelprincipalcomponentanalysis.Ad-vancesinkernelmethods:supportvectorlearning,pages327{352,1999.O.ShamirandO.Dekel.Multiclass-multilabelclassicationwithmoreclassesthanexamples.InProc.AISTATS,2010.A.Statnikov,C.F.Aliferis,I.Tsamardinos,D.Hardin,andS.Levy.Acomprehensiveevaluationofmulticategoryclassicationmethodsformicroarraygeneexpressioncancerdiagnosis.Bioinformatics,21(5):631{643,2005.A.TewariandP.L.Bartlett.Ontheconsistencyofmulticlassclassicationmethods.JournalMachineLearningResearch,2007.N.Usunier,D.Buoni,andP.Gallinari.Rankingwithorderedweightedpairwiseclassi-cation.InICML,2009.V.Vapnik.StatisticalLearningTheory.1998.J.WestonandC.Watkins.Multi-classsupportvectormachines.TechnicalReportCSD-TR-98-04Dept.ComputerScience,RoyalHolloway,UniversityLondon,1998.J.WestonandC.Watkins.Supportvectormachinesformulti-classpatternrecognition.InProc.EuropeanSymposiumonArticialNeuralNetworks,1999.J.Weston,S.Bengio,andN.Usunier.Wsabie:Scalinguptolargevocabularyimageannotation.InIntl.JointConf.ArticialIntelligence,(IJCAI),pages2764{2770,2011.J.Weston,A.Makadia,andH.Yee.Labelpartitioningforsublinearranking.InICML,2013a.1491 Gupta,Bengio,andWestonJ.Weston,R.Weiss,andH.Yee.Anityweightedembedding.InProc.InternationalConf.LearningRepresentations,2013b.L.Xiao.Dualaveragingmethodsforregularizedstochasticlearningandonlineoptimization.MicrosoftResearchTechnicalReportMSR-TR-2010-23,2010.T.Zhang.Statisticalanalysisofsomemulti-categorylargemarginclassicationmethods.JournalMachineLearningResearch,5:1225{1251,2004.1492