/
Gupta,Bengio,andWestonaformofcurriculumlearning,thatis,oflearningtodis Gupta,Bengio,andWestonaformofcurriculumlearning,thatis,oflearningtodis

Gupta,Bengio,andWestonaformofcurriculumlearning,thatis,oflearningtodis - PDF document

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
385 views
Uploaded On 2016-05-20

Gupta,Bengio,andWestonaformofcurriculumlearning,thatis,oflearningtodis - PPT Presentation

TrainingHighlyMulticlassClassifiersPclassi cationerror5Pclassi cationerror0110Samples10Samples 100Samples100Samples Figure1Histogramsoftheempiricalerrorof10randomsamplestopor100randomsample ID: 328068

TrainingHighlyMulticlassClassifiersP(classi cationerror)=:5P(classi cationerror)=:0110Samples10Samples 100Samples100Samples Figure1:Histogramsoftheempiricalerrorof10randomsamples(top)or100randomsample

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Gupta,Bengio,andWestonaformofcurriculuml..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Gupta,Bengio,andWestonaformofcurriculumlearning,thatis,oflearningtodistinguisheasyclassesbeforefocusingonlearningtodistinguishhardclasses(Bengioetal.,2009).Inthispaper,wefocusonclassi ersthatusealineardiscriminantorasingleprototypicalfeaturevectortorepresenteachclass.Linearclassi ersareapopularapproachtohighlymulticlassproblemsbecausetheyareecientintermsofmemoryandinferenceandcanprovidegoodperformance(Perronninetal.,2012;Linetal.,2011;SanchezandPerronnin,2011).Classprototypeso ersimilarmemory/eciencyadvantages.Thelastlayerofadeepbeliefnetworkclassi erisoftenlinearorsoft-maxdiscriminantfunctions(Bengio,2009),andtheproposedideasforadaptingonlinelossfunctionsshouldbeapplicableinthatcontextaswell.Weapplytheproposedlossfunctionadaptationtothemulticlasslinearclassi ercalledWsabie(Westonetal.,2011).WealsosimplifyWsabie'sweightingofstochasticgradients,andemployarecentadvanceinautomaticstep-sizeadaptationcalledadagrad(Duchietal.,2011).TheresultingproposedWsabie++classi eralmostdoublestheclassi cationaccuracyonbenchmarkImagenetdatasetscomparedtoWsabie,andshowssubstantialgainsoverone-vs-allSVMs.Therestofthearticleisasfollows.AfterestablishingnotationinSection2,weexplaininSection3howdi erentclassconfusabilitiescandistortthestandardempiricalloss.Wethenreviewlossfunctionsforjointlytrainingmulticlasslinearclassi ersinSection4,andstochasticgradientdescentvariantsforlarge-scalelearninginSection5.InSection6,weproposeapracticalonlinesolutiontoadapttheempiricallosstoaccountforthevarianceofclassconfusability.WedescribeouradagradimplementationinSection7.Experimentsarereportedonbenchmarkandproprietaryimageclassi cationdatasetswith15,000-97,000classesinSection8and9.Weconcludewithsomenotesaboutthekeyissuesandunresolvedquestions.2.NotationAndAssumptionsWetakeasgivenasetoftrainingdataf(xt;Yt)gfort=1;:::;n,wherext2RdisafeaturevectorandYtf1;2;:::;GgisthesubsetoftheGclasslabelsthatareknowntobecorrectlabelsforxt.Forexample,animagemightberepresentedbysetoffeaturesxtandhaveknownlabelsYt=fdolphin,ocean,HalfMoonBayg.Weassumeadiscriminantfunctionf(x; g)hasbeenchosenwithclass-speci cparameters gforeachclasswithg=1;:::;G.Theclassdiscriminantfunctionsareusedtoclassifyatestsamplexastheclasslabelthatsolvesargmaxgf(x; g):(1)Mostofthispaperappliesequallywellto\learningtorank,"inwhichcasetheoutputmightbeatop-rankedorranked-and-thresholdedlistofclassesforatestsamplex.Forsimplicity,werestrictourdiscussionandmetricstotheclassi cationparadigmgivenby(1).Manyoftheideasinthispapercanbeappliedtoanychoiceofdiscriminantfunctionf(x; g),butinthispaperwefocusoneciencyintermsoftest-timeandmemory,andsowefocusonclassdiscriminantsthatareparameterizedbyad-dimensionalvectorperclass.Twosuchfunctionsare:theinnerproductf(x; g)= Tgx,andthesquared`2normf(x; g)=�( g�x)T( g�x).Wealsorefertotheseaslineardiscriminantsand1462 TrainingHighlyMulticlassClassifiersP(classi cationerror)=:5P(classi cationerror)=:0110Samples10Samples 100Samples100Samples Figure1:Histogramsoftheempiricalerrorof10randomsamples(top)or100randomsamples(bottom).Asthenumberofsamplesaveragedgrows,theempiricalerrorwillconvergetothetrueprobabilityofanerror,either:5(left)or:01(right).Butgivena nitesample,theempiricalerrormaybequitenoisy,andwhenthetrueerrorishigh(left)theempiricalerrorcanbemuchnoisierthanwhenthetrueerrorislow(right).1465 Gupta,Bengio,andWestonusingtheapproximation(7)fortheempiricalloss.ShamirandDekel(2010)proposedarelatedbutmoreextremetwo-stepapproachforhighlymulti-classproblems: rsttrainaclassi eronallclasses,andthendeleteclassesthatarepoorlyestimatedbytheclassi er.Tobemorepractical,weproposecontinuouslyevolvingtheclassi ertoignorethecur-rentlyhighly-confusableclassesbyimplementing(6)inanonlinefashionwithSGD.Thissimplevariantcanbeinterpretedasimplementingcurriculumlearning(Bengioetal.,2009),atopicwediscussfurtherinSection6.2.1.Butbeforedetailingtheproposedsimpleon-linestrategyinSection6,weneedtoreviewrelatedworkinlossfunctionsformulti-classclassi ers.4.RelatedWorkinLossFunctionsforMulticlassClassi ersInthissection,wereviewlossfunctionsformulticlassclassi ers,anddiscussrecentworkadaptingsuchlossfunctionstotheonlinesettingforlarge-scalelearning.Oneofthemostpopularclassi ersforhighlymulticlasslearningisone-vs-alllinearSVMs,whichhaveonlyO(Gd)parameterstolearnandstore,andO(Gd)timeneededfortesting.Aclearadvantageofone-vs-allisthattheGclassdiscriminantfunctionsffg(x)gcanbetrainedindependently.AnalternateparallelizableapproachistotrainallG2one-vs-oneSVMs,andletthemvoteforthebestclass(alsoknownasround-robinandall-vs-all).Binaryclassi erscanalsobecombinedusingerror-correctingcodeapproaches(DietterichandBakiri,1995;Allweinetal.,2000;CrammerandSinger,2002).Awell-regardedexperimentalstudyofmulticlassclassi cationapproachesbyRifkinandKlatau(2004)showedthatone-vs-allSVMsperformed\justaswell"onasetoftenbenchmarkdatasetswith4-49classesasone-vs-oneorerror-correctingcodeapproaches.Anumberofresearchershaveindependentlyextendedthetwo-classSVMoptimizationproblemtoajointmulticlassoptimizationproblemthatmaximizespairwisemarginssubjecttothetrainingsamplesbeingcorrectlyclassi ed,withrespecttopairwiseslackvariables(Vapnik,1998;WestonandWatkins,1998,1999;BredensteinerandBennet,1999).1Theseextensionshavebeenshowntobeessentiallyequivalentquadraticprogrammingproblems(Guermeur,2002).Theminimizedempiricallosscanbestatedasthesumofthepairwiseerrors:Lpairwise(f gg)=nXt=11 jYtjXy+2Yt1 jYCtjXy�2YCtjb�f(xt; y+)+f(xt; y�)j+;(8)wherejj+isshort-handformax(0;),bisamarginparameter,YCtisthecomplementsetofYt,andweaddednormalizerstoaccountforthecasethatagivenxtmayhavemorethanonepositivelabelsuchthatjYtj�1.CrammerandSinger(2001)insteadsuggestedtakingthemaximumhingelossoverallthenegativeclasses:Lmaxloss(f gg)=nXt=11 jYtjXy+2Ytmaxy�2YCtjb�f(xt; y+)+f(xt; y�)j+:(9) 1.SeealsotheworkofHerbrichetal.(2000)forarelatedpairwiselossfunctionforrankingratherthanclassi cation.1468 TrainingHighlyMulticlassClassifiersThismaximumhinge-lossissometimescalledmulticlassSVM,andcanbederivedfromamargin-bound(Mohrietal.,2012).Danielyetal.(2012)theoreticallycomparedmulticlassSVMwithone-vs-all,one-vs-one,tree-basedlinearclassi ersanderror-correctingoutputcodelinearclassi ers.TheyshowedthatthehypothesisclassofmulticlassSVMcontainsthatofone-vs-allandtree-classi ers,whichstrictlycontainthehypothesisclassofone-vs-oneclassi ers.ThusthepotentialperformancewithmulticlassSVMislarger.Howevertheyalsoshowedthattheapproximationerrorofone-vs-oneissmallest,withmulticlassSVMnextsmallest.Statnikovetal.(2005)comparedeightmulticlassclassi ersincludingthatofWestonandWatkins(1999)andCrammerandSinger(2001)onninecancerclassi cationproblemswith3to26classesandlessthan400samplesperproblem.Onthesesmall-scaledatasets,theyfoundtheCrammerandSinger(2001)classi erwasbest(ortied)on2/3ofthedatasets,andthepairwiselossgivenin(8)performedalmostaswell.Leeetal.(2004)proveintheirLemma2thatpreviousapproachestomulticlassSVMsarenotguaranteedtobeasymptoticallyconsistent.Formoreonconsistencyofmulticlassclassi cationlossfunctions,seeRifkinandKlatau(2004),TewariandBartlett(2007),Zhang(2004),andMrouehetal.(2012).Leeetal.(2004)proposedamulticlasslossfunctionthatisconsistent.TheyforcetheclassdiscriminantstosumtozerosuchthatPgf(x; g)=0forallx,andde netheloss:Ltotalloss(f gg)=nXt=1Xy�2YCtjf(xt; y�)�1 G�1j+:(10)Thislossfunctionjointlytrainstheclassdiscriminantssothatthetotalsumofwrongclassdiscriminantsforeachtrainingsampleissmall.Minimizingthislosscanbeexpressedasaconstrainedquadraticprogram.TheexperimentsofLeeetal.(2004)onafewsmalldatasetsdidnotshowmuchdi erencebetweentheperformanceof(10)and(8).5.OnlineLossFunctionsforTrainingLarge-scaleMulticlassClassi ersIftherearealargenumberoftrainingsamplesn,thencomputingthelossforeachcandidatesetofclassi erparametersbecomescomputationallyprohibitive.Theusualsolutionistominimizethelossinanonlinefashionwithstochasticgradientdescent,butexactlyhowtosamplethestochasticgradientsbecomesakeyissue.Next,wereviewtwostochasticgradientapproachesthatcorrespondtodi erentlossfunctions:AUCsampling(GrangierandBengio,2008)andtheWARPsamplingusedintheWsabieclassi er(Westonetal.,2011).5.1AUCSamplingForalargenumberoftrainingsamplesn,GrangierandBengio(2008)proposedoptimizing(8)bysequentiallyuniformlysamplingfromeachofthethreesumsin(8):1.drawonetrainingsamplext,2.drawonecorrectclassy+fromYt,1469 TrainingHighlyMulticlassClassifiersTosampleaviolatingclassfromVt;y+,thenegativeclassesinYCtareuniformlyrandomlysampleduntilaclassthatsatis estheviolationconstraint(12)isfound,orthenumberofallowedsuchtrials(generallysettobeG)isexhausted.Therankr(y+)neededtocalculatetheweightin(13)isestimatedtobe(G�1)dividedbythenumberofnegativeclassesy�2Ytthathadtobetriedbefore ndingaviolatingclassyvfromVt;y+.5.3SomeNotesComparingAUCandWARPLossWenotethatWARPsamplingismorelikelythanAUCsamplingtoupdatetheparametersofatrainingsample'spositiveclassy+ify+hasfewviolatingclasses,thatisif(xt;yt)isalreadyhighly-rankedby(1).Speci cally,supposeatrainingpair(xt;yt)israndomlysampled,andsupposeH�0oftheGclassesareviolatorssuchthattheirhinge-lossisnon-zerowithrespectto(x;y+).WARPsamplingwilldrawrandomclassesuntilit ndsaviolatorandmakesanupdate,butAUCwillonlymakeanupdateiftheonerandomclassitdrawshappenstobeaviolator,soonlyH=(G�1)ofthetime.Byde nition,thehigher-rankedthecorrectclassytisforxt,thesmallerthenumberofviolatingclassesH,andthelesslikelyAUCsamplingwillupdatetheclassi ertolearnfrom(xt;yt).Inthissense,WARPsamplingismorefocusedon xingclassparametersthatarealmostrightalready,whereasAUCsamplingismorefocusedonimprovingclassparametersthatareverywrong.Attesttime,classi erschooseonlythehighest-rankedclassdiscriminantasaclasslabel,andthusthefactthatAUCsamplingupdatesmoreoftenonlower-rankedclassesislikelythekeyreasonthatWARPsamplingperformssomuchbetterinpractice(seetheexperimentsofWestonetal.(2011)andtheexperimentsinthispaper).Eveninranking,itisusuallyonlythetoprankedclassesthatareofinterest.However,theWARPweightw(y+)givenin(13)partlycounteractsthisdi erencebyassigninggreaterweights(equivalentlyalargerstep-sizetothestochasticgradient)ifthecorrectclasshasmanyviolators,asthenitsrankislower.Inthispaper,oneoftheproposalswemakeistouseconstantweightsw(y+)=1,sothetrainingisevenmorefocusedonimprovingclassesthatarealreadyhighly-ranked.AUCsamplingisalsoinecientbecausesomanyoftherandomsamplesresultinazerogradient.Infact,wenotethattheprobabilitythatAUCsamplingwillupdatetheclassi erdecreasesiftherearemoreclasses.Speci cally,supposeforagiventrainingsamplepair(xt;yt)thereareHclassesthatviolateit,andthatthereareGclassesintotal.ThentheprobabilitythatAUCsamplingupdatestheclassi erfor(xt;yt)isH=(G�1),whichlinearlydecreasesasthenumberofclassesGisincreased.5.4OnlineVersionsofOtherMulticlassLossesWARPsamplingimplementsanonlineversionofthepairwiselossgivenin(8)(Westonetal.,2011).OnecanalsointerprettheWARPlosssamplingasanonlineapproximationofthemaximumhingelossgivenin(9),wherethemaximumviolatingclassisapproximatedbythesampledviolatingclass.Thisinterpretationdoesnotcallforarank-basedweightingw(y+),andinfactwefoundthatsettingw(y+)=1improvedaccuracybyroughly20%onalarge-scaleimageannotationtask(seeTable6).Abetterapproximationof(9)wouldrequire1471 Gupta,Bengio,andWestonsamplingmultipleviolatingclassesandthentakingtheclasswiththeworstdiscriminant,wedidnottrythisduetotheexpectedtimeneededto ndmultipleviolatingclasses.Further,wehypothesizethatchoosingtheclasswiththelargestviolationastheviolatingclasscouldactuallyperformpoorlyforpracticalhighlymulticlassproblemslikeImagenetbecausetheworstdiscriminantmaybelongtoaclassthatisamissingcorrectlabel,ratherthananincorrectlabel.AnonlineversionofthelossproposedbyLeeetal.(2004)andgivenin(10)wouldbemorechallengingtoimplementbecausetheGclassdiscriminantsarerequiredtobenormalized;wedonotknowofanysuchexperiments.5.5TheWsabieClassi erWestonetal.(2011)combinedtheWARPsamplingwithonlinelearningofasupervisedlineardimensionalityreduction.TheylearnanembeddingmatrixW2Rmdthatmapsagivend-dimensionalfeaturevectorxtoanm-dimensional\embedded"vectorWx2Rm,wheremd,andthentheGclass-speci cdiscriminantsofdimensionmaretrainedtoseparateclassesintheembeddingspace.Westonetal.(2011)referredtothiscombinationastheWsabieclassi er.ThischangestheWARPlossgivenin(11)tothenon-convexWsabieloss,de ned:LWsabie(W;f gg)=nXt=1Xy+2YtXyv2Vt;y+w(y+)jb�f(Wxt; y+)+f(Wxt; yv)j+:AddingtheembeddingmatrixWchangesthenumberofparametersfromGdtoGm+md.ForalargenumberofclassesGandasmallembeddingdimensionm(thecaseofinteresthere)thisreducestheoverallparameters,andsotheadditionoftheembeddingmatrixWactsasaregularizer,reducesmemory,andreducestestingtime.6.OnlineAdaptationoftheEmpiricalLosstoReduceImpactofHighlyConfusableClassesInthissection,wepresentasimpleandmemory-ecientonlineimplementationoftheempiricalclass-confusionlossweproposedinSection3.3thatreducestheimpactofhighlyconfusableclassesonthestandardempiricalloss.First,wedescribethebatchvariantofthisproposalandquantifyitse ect.TheninSection6.2wedescribeasamplingversion.InSection6.3,weproposeasimpleextensionthatexperimentallyincreasestheaccuracyoftheresultingclassi er,withoutusingadditionalmemory.WeshowtheproposedonlinestrategyworkswellinpracticeinSection9.6.1ReducingtheE ectofHighlyConfusableClassesByIgnoringLastViolatorsWeintroducethekeyideaofalastviolatorclasswithasimpleexamplebeforeaformalde nition.Supposeduringonlinetrainingthehundredthtrainingsamplex100haslabely100=tiger,andthatthelasttrainingsamplewesawlabelledtigerwasx5.Andsupposelionwasaviolatingclassforthattrainingsamplepair(x5;tiger),thatisj1�1472 Gupta,Bengio,andWeston6.2IgnoringSampledLastViolatorsforOnlineLearningBuildingontheWARPsamplingproposedbyWestonetal.(2011)andreviewedinSection5.2,weproposeanonlinesamplingimplementationof(14),whereforeachclassy+westoreonesampledlastviolatorandonlyupdatetheclassi erifthecurrentviolatorisnotthesameasthelastviolator.Speci cally,1.drawonetrainingsamplext,2.drawonecorrectclassy+fromYt,3.ifthereisnolastviolatorvt;y+orifvt;y+existsbutisnotaviolatorfor(xt;y+),thendrawandstoreoneviolatingclassyvfromVy+and(a)computethecorrespondingstochasticgradientofthelossin(8)(b)updatetheparameters.Table1re-visitsthesameexampleasearlier,andillustratesforeightsequentialtrainingexampleswhosetraininglabelwascatwhatthelastviolatorclassis,whetherthelastviolatorisacurrentviolator(inwhichcasethecurrenterrorisignored),orifnotignored,whichofthecurrentviolatorsisrandomlysampledfortheclassi erupdate.Throughoutthetraining,thestateofthesampledlastviolatorforanyclassy+canbeviewedasaMarkovchain.Weillustratethisfortheclassy+andtwopossibleviolatingclassesgandhinFigure3.Intheexperimentstofollow,wecoupletheproposedonlineempiricalclass-confusionlosssamplingstrategywithanembeddingmatrixasintheWsabiealgorithmforeciencyandregularization,andrefertothisasWsabie++.AcompletedescriptionofWsabie++isgiveninTable3,includingtheadagradstepsizeupdatesdescribedinthenextsection.ThememoryneededtoimplementthisdiscountingisO(G)becauseonlyonelastviolatorclassisstoredforeachoftheGclasses.SetofCatViolatorsCat'sLVCat'sLVViolates?NewViolatorSampled? 1:dogandpignone-dog2:dogandpigdogyesignored3:dogdogyesignored4:dogandpigdogyesignored5:pigdognopig6:noviolatorspignonone7:dognone-dog8:dogdogyesignoredTable1:Exampleofignoringsampledlastviolatorsforeightsequentialsamples(oneperrow)whosetraininglabeliscat.1474 TrainingHighlyMulticlassClassifiersTouse(15)inanonlinesetting,eachtimeatrainingsampleanditspositiveclassaredrawn,wecheckifanyq-thorderlastviolatorvqt;y+foranyqQisacurrentviolator,andifso,weignorethattrainingsampleandmovedirectlytothenexttrainingsamplewithoutupdatingtheclassi erparameters.Table3givesthecompleteproposedsamplingandupdatingalgorithmforEuclideandiscriminantfunctions,includingtheadaptiveadagradstep-sizeexplainedinSection7whichfollows.ForEuclideandiscriminantfunctionswedidnot nd(experimentally)thatweneededanyconstraintsoradditionalregularizersonWorf gg,thoughifdesiredaregularizationstepcanbeadded.7.AdagradForLearningRateConvergencespeedofstochasticgradientmethodsissensitivetothechoiceofstepsizes.Recently,Duchietal.(2011)proposedaparameter-dependentlearningrateforstochasticgradientmethods.Theyprovedthattheirapproachhasstrongtheoreticalregretguaranteesforconvexobjectivefunctions,andexperimentallyitproducedbetterresultsthancompa-rablemethodssuchasregularizeddualaveraging(Xiao,2010)andthepassive-aggressivemethod(Crammeretal.,2006).Inourexperiments,weappliedadagradbothtotheconvextrainingoftheone-vs-allSVMsandAUCsampling,aswellastothenon-convexWsabie++training.Inspiredbyourpreliminaryresultsusingadagradfornon-convexoptimization,Deanetal.(2012)alsotriedadagradfornon-convextrainingofadeepbeliefnetwork,andalsofounditproducedsubstantialimprovementsinpractice.Themainideabehindadagradisthateachparametergetsitsownstepsize,andeachtimeaparameterisupdateditsstepsizeisdecreasedtobeproportionaltotherunningsumofthemagnitudeofallpreviousupdates.Forsimplicity,welimitourdescriptiontothecasewheretheparametersbeingoptimizedareunconstrained,whichishowweimplementedit.Formemoryandcomputationaleciency,Duchietal.(2011)applyingadagradseparatelyforeachparameter(asopposedtomodelingcorrelationsbetweenparameters).WeappliedadagradtoadaptthestepsizefortheGclassi erdiscriminantsf ggandthemdembeddingmatrixW.Wefoundthatwecouldsavememorywithouta ectingexperimentalperformancebyaveragingtheadagradlearningrateovertheembeddingdi-mensionssuchthatwekeeptrackofonescalaradagradweightperclass.Thatis,letg;denotethestochasticgradientfor gattime,thenweupdate gasfollows: g;t+1= g;t� tX=0 T;g;g d!!�1=2g;:(16)Analogously,wefounditexperimentallye ectiveandmorememoryecienttokeeptrackofoneaveragedscalaradagradweightforeachofthemrowsoftheembeddingmatrixW.Therearetwomaine ectstousingadagrad.First,supposetherearetwoclassesthatareupdatedequallyoften,thentheclasswithlargerstochasticgradientsfgwillexperienceafaster-decayinglearningrate.Second,andwebelievethemorerelevantissueforouruse,isthatsomeclassesareupdatedfrequently,andsomeclassesrarely.Supposethatallstochasticgradientsfghavethesamemagnitude,thentheclassesthatareupdatedmorerarelyexperiencerelativelylargerupdates.Inourexperimentstheseconde ectwas1477 TrainingHighlyMulticlassClassifierspredominant,whichwetestedbysettingthelearningrateforeachparameterproportionaltotheinversesquarerootofthenumberoftimesthatparameterhasbeenupdated.This\countingadagrad"producedresultsthatwerenotstatisticallydi erentusing(16).(Theexperimentalresultsinthispaperarereportedusingadagradproperasper(16).)Weuse torefertotherunningsumofgradientmagnitudesinthecompleteWsabie++algorithmdescriptiongiveninTable3.8.ExperimentsWe rstdetailthedatasetsused.TheninSection8.2wedescribethefeatures.InSection8.3wedescribethedi erentclassi erscomparedandhowtheparametersandhyperparameterswereset.8.1DataSetsExperimentswererunwithfourdatasets,assummarizedinTable4anddetailedbelow. 16kImageNet22kImageNet21kWebData97kWebData NumberofClasses 15,58921,84121,17196,812NumberofSamples 9million14million9million40millionNumberofFeatures 102447910241024Table4:Datasets.8.1.1ImageNetDataSetsImageNet(Dengetal.,2009)isalargeimagedatasetorganizedaccordingtoWordNet(Fellbaum,1998).ConceptsinWordNet,describedbymultiplewordsorwordphrases,arehierarchicallyorganized.ImageNetisagrowingimagedatasetthatattachesoneoftheseconceptstoeachimageusingaquality-controlledhuman-veri edlabelingprocess.Weusedthespring2010andfall2011releasesoftheImagenetdataset.Thespring2010versionhasaround9Mimagesand15,589classes(16kImageNet).Thefall2011versionhasabout14Mimagesand21,841classes(22kImageNet).Forbothdatasets,weseparatedout10%oftheexamplesforvalidation,10%fortest,andtheremaining80%wasusedfortraining.8.1.2WebDataSetsWealsohadaccesstoalargeproprietarysetofimagestakenfromtheweb,togetherwithanoisyannotationbasedonanonymizedusers'clickinformation.Wecreatedtwodatasetsfromthiscorpusthatwerefertoas21kWebDataand97kWebData.The21kWebDatacontainsabout9Mimages,dividedinto20%forvalidation,20%fortest,and60%fortrain,andtheimagesarelabelledwith21,171distinctclasses.The97kWebDatacontainsabout40Mimages,dividedinto10%forvalidation,10%fortest,and80%fortrain,andtheimagesarelabelledwith96,812distinctclasses.Thereare vemaindi erencesbetweentheWebDataandImageNet.First,thetypesoflabelsfoundinImagenetaremoreacademic,followingthestrictstructureofWordNet.1479 Gupta,Bengio,andWestonIncontrast,theWebDatalabelsaretakenfromasetofpopularqueriesthatweretheinputtoageneral-purposeimagesearchengine,soitincludespeople,brands,products,andabstractconcepts.Second,thenumberofimagesperlabelinImagenetisarti ciallyforcedtobesomewhatuniform,whiletheWebDatadistributionofnumberofimagesperlabelisgeneratedbypopularitywithusers,andisthusmoreexponentiallydistributed.Third,becauseofthepopularoriginsoftheWebdatasets,classesmaybetranslationsofeachother,pluralvs.singularconcepts,orsynonyms(forexamples,seeTable7).Thusweexpectmorehighly-confusableclassesfortheWebDatathanImageNet.Afourthkeydi erenceisImagenetdisambiguatespolysemouslabelswhereasWebDatadoesnot,forexample,animagelabeledpalmmightlooklikethepalmofahandorlikeapalmtree.The fthdi erenceisthattheremaybemultiplegivenpositivelabelsforsomeoftheWebsamples,forexample,thesameimagemightbelabelledmountain,mountains,Himalaya,andIndia.Lastly,classesmaybeatdi erentandoverlappingprecisionlevels,forexampletheclasscakeandtheclassweddingcake.8.2FeaturesWedonotfocusonfeatureextractioninthiswork,althoughfeaturescertainlycanhaveabigimpactonperformance.Forexample,SanchezandPerronnin(2011)recentlyachieveda160%gaininaccuracyonthe10kImageNetdatatsetbychangingthefeaturesbutnottheclassi cationmethod.Inthispaperweusefeatures,similartothoseusedinWestonetal.(2011).We rstcombinedmultiplespatial(GraumanandDarrell,2007)andmultiscalecolorandtextonhistograms(LeungandMalik,1999)foratotalofabout5105dimensions.Thedescriptorsaresomewhatsparse,withabout50000non-zeroweightsperimage.Someoftheconstituenthistogramsarenormalizedandsomearenot.WethenperformkernelPCA(Schoelkopfetal.,1999)onthecombinedfeaturerepresentationusingtheintersectionkernel(Barlaetal.,2003)toproducea1024-dimensionalor479-dimensionalinputvectorperimage(seeTab.4),whichisthenusedasthefeaturevectorsfortheclassi ers.8.3Classi ersComparedandHyperparametersWeexperimentallycomparedthefollowinglinearclassi ers:nearestmeans,one-vs-allSVMs,AUC,Wsabie,andtheproposedWsabie++classi ers.Table5comparesthesemethodsastheywereimplementedfortheexperiments.Thenearestmeansclassi eristhemostecienttotrainofthecomparedmethodsasitonlypassesoverthetrainingsamplesonceandcomputesthemeanofthetrainingfeaturevectorsforeachclass(andtherearenohyperparameters).Likethenearestmeansclassi er,weimplementedWsabie++withEuclideandiscrim-inants(asdetailedinTable3)andassuchitcanbeconsideredadiscriminativenearestmeansclassi er.TestingwithEuclideandiscriminantscaneasilybemadefasterbyap-plyingexactorapproximatefastk-NNmethods,wheretheclassprototypesf ggplaytheroleoftheneighbors.Further,Euclideandiscriminantslendthemselvesmorenaturallytovisualizationthantheinnerproduct,aseachclassisrepresentedbyaprototype.1480 Gupta,Bengio,andWestonOne-vs-alllinearSVMsarethemostpopularchoiceforlarge-scaleclassi ersduetostud-iesshowingtheirgoodperformance,theirparallelizabletraining,relativelysmallmemory,andfasttest-time(RifkinandKlatau,2004;Dengetal.,2010;SanchezandPerronnin,2011;Perronninetal.,2012;Linetal.,2011).Perronninetal.(2012)highlightstheimportanceofgettingtherightbalanceofnegativetopositiveexamplesusedtotraintheone-vs-alllinearSVMs.Asintheirpaper,wecross-validatetheexpectednumberofnegativeex-amplesperpositiveexample;theallowablechoiceswerepowersof2.Incontrast,earlierpublishedresultsbyWestonetal.(2011)thatcomparedWsabietoone-vs-allSVMsusedonenegativeexampleperpositiveexample,analogoustotheAUCclassi er.Weincludedthiscomparison,whichwelabelledOne-vs-allSVMs1+:1-inthetables.BothWsabieandWsabie++jointlytrainanembeddingmatrixWasdescribedinSection3.5.Theembeddingdimensiondwaschosenonthevalidationsetfromthechoicesd=f32;64;96;128;192;256;384;512;768;1024gembeddingdimensions.Inaddition,wecreatedensembleWsabieandWsabie++classi ersbyconcatenatingbm dcsuchd-dimensionalmodelstoproduceaclassi erwithatotalofmparameterstocompareclassi ersthatrequirethesamememoryandtest-time.Allhyperparameterswerechosenbasedontheaccuracyonaheld-outvalidationset.Step-size,margin,andregularizationconstanthyperparameterswerevariedbypowersoften.TheorderQofthelastviolatorswasvariedbypowersof2.ChosenhyperparametersarerecordedinTable9.BoththepairwiselossandWsabieclassi erareimplementedwithstandard`2constraintsontheclassdiscriminants(andforWsabie,ontherowsoftheembeddingmatrix).WedidnotuseanyregularizationconstraintsforWsabie++.WeinitializedtheWsabieparametersandSVMparametersuniformlyrandomlywithintheconstraintset.Weinitializetheproposedtrainingbysettingall gtotheorigin,andallcomponentsoftheembeddingmatrixareequallylikelytobe�1or1.Experimentswithdif-ferentinitializationschemesforthesedi erentclassi ersshowedthatdi erent(reasonable)initializationsgaveverysimilarresults.Withtheexceptionofnearestmeans,allclassi ersweretrainedonlinewithstochasticgradients.Wealsousedadagradfortheconvexoptimizationsofbothone-vs-allSVMsandtheAUCsampling,whichincreasedthespeedofconvergence.Recently,Perronninetal.(2012)showedgoodresultswithone-vs-allSVMclassi ersandtheWARPlosswheretheyalsocross-validatedanearly-stoppingcriterion.Adagradreducesstepsizesovertime,andthisremovedtheneedtoworryaboutearlystopping.Infact,wedidnotseeanyobviousover ttingwithanyoftheclassi ertraining(validationsetandtestseterrorswerestatisticallysimilar).Eachalgorithmwasallowedtotrainonupto100loopsthroughtheentiretrainingsetoruntilthevalidationsetperformancehadnotchangedin24hours.Eventhoserunsthatrantheentire100loopsappearedtohaveessentiallyconverged.ImplementedinC++withoutparallelization,allalgorithms(exceptnearestmeans)tookaroundoneweektotrainthe16kImagenetdataset,aroundtwoweekstotrainthe21kand22kdatasets,andaroundonemonthtotrainthe97kdataset.Alsoinallcasesroughly80%ofthevalidationaccuracywasachievedinroughlythe rst20%ofthetrainingtime.Becausestochasticgradientdescentusesrandomsamplingofthetrainingsamples,mul-tiplerunswillproduceslightlydi erentresults.Toaddressthisrandomness,weran verunsofeachclassi erforeachsetofcandidateparameters,andreportedthetestaccuracy1482 Gupta,Bengio,andWestonClassi er TestAccuracy Wsabie(Westonetal.,2011) 3.7%Wsabie+10lastviolators 5.0%Wsabie+adagrad 5.0%Wsabie+w(r(y+))=1in(11) 4.1%Wsabie+adagrad+10lastviolators 5.9%Wsabie+adagrad+w(r(y+))=1in(11) 6.0%Wsabie+adagrad+w(r(y+))=1in(11)+1lastviolator 6.3%Wsabie+adagrad+w(r(y+))=1in(11)+10lastviolators 7.1%Wsabie+adagrad+w(r(y+))=1in(11)+100lastviolators 6.8%Table6:E ectoftheproposeddi erencescomparedtoWsabieforad=100dimensionalembeddingspaceon21kWebData.Table7givesexamplesoftheclassescorrespondingtoneighboringf ggintheembeddedfeaturespaceaftertheWsabie++training.Class 1-NN2-NN3-NN4-NN5-NN poodle canichepudellabradorpuppiescocker spaniel dolphin dauphindelfindolfinjnendelfinerdolphins SanDiego PuertoSydneyVancouverKanadaTripoli Madero mountain mountainsmontagneEverestAlaskaHimalaya router modemswitchserverlannetwork calligraphy fontsIslamicbordersquotesnetwork calligraphyTable7:Foreachoftheclassesontheleft,thetableshowsthe venearest(intermsofEuclideandistance)classprototypesf ggintheproposeddiscriminativelytrainedembeddedfeaturespacefor21kWebDataset.Becausetheseclassesoriginatedaswebqueries,someclassnamesaretranslationsofeachother,forexampledolphinanddauphin(Frenchfordolphin).Whilethesemayseemlikeexactsynonyms,infactdi erentlanguagecommunitiesoftenhaveslightlydi erentvisualnotionsofthesameconcept.Similarly,theclassesObamaandPresidentObamaareexpectedtobelargelyoverlapping,buttheirclassdistributionsdi erintheformalityandcontextoftheimages.Lastly,weillustratehowtheWsabie++testaccuracydependsonthenumberofembed-dingdimensions.Theseresultsareforthe21kWebDataset,withthestep-sizeandmargin1484 TrainingHighlyMulticlassClassifiers StepsizeMarginEmbeddingBalance#LVs dimension One-vs-allSVM1+:1- 16kImageNet .01.121kImageNet .01.121kWebData .1197kWebData .01.1 One-vs-allSVM 16kImageNet .01.16421kImageNet .01.16421kWebData .116497kWebData .01.1128 AUCSampling 16kImageNet .01.121kImageNet .01.121kWebData .001.0197kWebData .001.01 Wsabie 16kImageNet .01.112821kImageNet .001.112821kWebData .001.125697kWebData .0001.1256 Wsabie++: 16kImageNet 1010,000192821kImageNet 1010,000192821kWebData 1010,000256897kWebData 1010,00025632Table9:Classi erparameterschosenusingvalidationsettoconsiderforeachpositiveclass,ratherthantheWARPsamplingwhichdrawsnegativeclassesuntilit ndsaviolator.Preliminaryresultsshowedthatthevalidationsetchosethelargestparameterchoicewhichwasalmostthesameasthenumberofclasses.ThusthatapproachrequiredanotherhyperparameterbutseemedtobedoingexactlywhatWARPsamplingdoesandappearedtoo ernoaccuracyimprovementoverWARPsampling.Thisresearchfocusedonaccurateandecientclassi cation,andnotontheissueoftrainingtime.Withtheexceptionofnearestmeans,themethodscomparedwereimple-mentedwithstochasticgradientdescentforecientonlinetrainingtodealwiththelargenumberoftrainingsamplesn.Asimplemented,themethodstookroughlyequallylongtotrain.However,paralleltrainingoftheGone-vs-allSVMswouldhavebeenroughlyGtimesasfast.Whilenotasnaturallyparallelizable,wehavehadsomesuccessinparallelizingthe1487 TrainingHighlyMulticlassClassifierst�1,Ot=1i Zt=1andZt�1=0.Thus,E[Ot]=E[Zt=1andZt�1=0]=P(Zt=1;Zt�1=0)=P(Zt=1)P(Zt=0)=p(1�p)bytheindependenceoftheBernoullirandomvariablesZtandZt�1.Thentheexpectednumberofconfusionscountedby(14)isE[PtOt]=PtE[Ot]bylinearity,whichcanbeexpanded:E[O1]+Pnt=2E[Ot]=p+(n�1)p(1�p).ReferencesE.L.Allwein,R.E.Schapire,andY.Singer.Reducingmulticlasstobinary:aunifyingapproachformarginclassi ers.JournalMachineLearningResearch,1:113{141,2000.A.Barla,F.Odone,andA.Verri.Histogramintersectionkernelforimageclassi cation.Intl.Conf.ImageProcessing(ICIP),3:513{516,2003.S.Bengio,J.Weston,andD.Grangier.Labelembeddingtreesforlargemulti-classtasks.InNIPS,2010.Y.Bengio.LearningdeeparchitecturesforAI.FoundationsandTrendsinMachineLearn-ing,2(1):1{127,2009.Y.Bengio,J.Laouradour,R.Collobert,andJ.Weston.Curriculumlearning.InICML,2009.E.J.BredensteinerandK.P.Bennet.Multicategoryclassi cationbysupportvectorma-chines.ComputationalOptimizationandApplications,12:53{79,1999.L.D.Brown,T.T.Cai,andA.DasGupta.Intervalestimationforabinomialproportion.StatisticalScience,16(2):101{117,2001.K.CrammerandY.Singer.Onthealgorithmicimplementationofmulticlasskernel-basedvectormachines.JournalMachineLearningResearch,2001.K.CrammerandY.Singer.Onthelearnabilityanddesignofoutputcodesformulticlassproblems.MachineLearning,47(2):201{233,2002.K.Crammer,O.Dekel,J.Keshet,S.Shalev-Shwartz,andY.Singer.Onlinepassiveaggres-sivealgorithms.JournalMachineLearningResearch,7:551{585,2006.A.Daniely,S.Sabato,andS.Shalev-Shwartz.Multiclasslearningapproaches:Atheoreticalcomparisonwithimplications.InNIPS,2012.J.Dean,G.S.Corrado,R.Monga,K.Chen,M.Devin,Q.V.Le,M.Z.Mao,M.Ranzato,A.Senoir,P.Tucker,K.Yang,andA.Ng.Largescaledistributeddeepnetworks.InNIPS,2012.J.Deng,W.Dong,R.Socher,L.-J.Li,K.Li,andL.Fei-Fei.ImageNet:ALarge-ScaleHier-archicalImageDatabase.InIEEEConf.ComputerVisionPatternRecognition(CVPR),2009.1489 TrainingHighlyMulticlassClassifiersM.Mohri,A.Rostamizadeh,andA.Talwalkar.FoundationsofMachineLearning.MITPress,Cambridge,MA,2012.Y.Mroueh,T.Poggio,L.Rosasco,andJ-JE.Slotine.Multiclasslearningwithsimplexcoding.InNIPS,2012.D.NisterandH.Stewenius.Scalablerecognitionwithavocabularytree.InCVPR,2006.A.Paton,N.Brummitt,R.Govaerts,K.Harman,S.Hinchcli e,B.Allkin,andE.Lughadha.Target1oftheglobalstrategyforplantconservation:aworkinglistofallknownplantspeciesprogressandprospects.Taxon,57:602{611,2008.F.Perronnin,Z.Akata,Z.Harchaoui,andC.Schmid.Towardsgoodpracticeinlarge-scalelearningforimageclassi cation.InCVPR,2012.R.RifkinandA.Klatau.Indefenseofone-vs-allclassi cation.JournalMachineLearningResearch,5:101{141,2004.J.SanchezandF.Perronnin.High-dimensionalsignaturecompressionforlarge-scaleimageclassi cation.InCVPR,2011.B.Schoelkopf,A.J.Smola,andK.R.Muller.Kernelprincipalcomponentanalysis.Ad-vancesinkernelmethods:supportvectorlearning,pages327{352,1999.O.ShamirandO.Dekel.Multiclass-multilabelclassicationwithmoreclassesthanexamples.InProc.AISTATS,2010.A.Statnikov,C.F.Aliferis,I.Tsamardinos,D.Hardin,andS.Levy.Acomprehensiveevaluationofmulticategoryclassi cationmethodsformicroarraygeneexpressioncancerdiagnosis.Bioinformatics,21(5):631{643,2005.A.TewariandP.L.Bartlett.Ontheconsistencyofmulticlassclassi cationmethods.JournalMachineLearningResearch,2007.N.Usunier,D.Bu oni,andP.Gallinari.Rankingwithorderedweightedpairwiseclassi -cation.InICML,2009.V.Vapnik.StatisticalLearningTheory.1998.J.WestonandC.Watkins.Multi-classsupportvectormachines.TechnicalReportCSD-TR-98-04Dept.ComputerScience,RoyalHolloway,UniversityLondon,1998.J.WestonandC.Watkins.Supportvectormachinesformulti-classpatternrecognition.InProc.EuropeanSymposiumonArti cialNeuralNetworks,1999.J.Weston,S.Bengio,andN.Usunier.Wsabie:Scalinguptolargevocabularyimageannotation.InIntl.JointConf.Arti cialIntelligence,(IJCAI),pages2764{2770,2011.J.Weston,A.Makadia,andH.Yee.Labelpartitioningforsublinearranking.InICML,2013a.1491 Gupta,Bengio,andWestonJ.Weston,R.Weiss,andH.Yee.Anityweightedembedding.InProc.InternationalConf.LearningRepresentations,2013b.L.Xiao.Dualaveragingmethodsforregularizedstochasticlearningandonlineoptimization.MicrosoftResearchTechnicalReportMSR-TR-2010-23,2010.T.Zhang.Statisticalanalysisofsomemulti-categorylargemarginclassi cationmethods.JournalMachineLearningResearch,5:1225{1251,2004.1492