/
LearningFeaturesandPartsforFine-Grained(InvitedPaper)JonathanKrause,T LearningFeaturesandPartsforFine-Grained(InvitedPaper)JonathanKrause,T

LearningFeaturesandPartsforFine-Grained(InvitedPaper)JonathanKrause,T - PDF document

conchita-marotz
conchita-marotz . @conchita-marotz
Follow
402 views
Uploaded On 2015-07-30

LearningFeaturesandPartsforFine-Grained(InvitedPaper)JonathanKrause,T - PPT Presentation

Fig1ThekeytonegrainedrecognitionislocalizingimportantpartsandrepresentingpartappearancediscriminativelyasglobalcuesdescribingtheoverallshapeorcoloroftencannotcapturethesubtledifferencesWithoutan ID: 97249

Fig.1.Thekeytone-grainedrecognitionislocalizingimportantpartsandrepresentingpartappearancediscriminatively asglobalcuesdescribingtheoverallshapeorcoloroftencannotcapturethesubtledifferences.Withoutan

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "LearningFeaturesandPartsforFine-Grained(..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

LearningFeaturesandPartsforFine-Grained(InvitedPaper)JonathanKrause,TimnitGebru,JiaDengy,Li-JiaLiz,LiFei-FeiStanfordUniversity:jkrause,tgebru,feifeiliyUniversityofMichigan:jiadeng@umich.eduzYahoo!Research:lijiali@yahoo-inc.com—Thispaperaddressestheproblemofne-grainedrecognition:recognizingsubordinatecategoriessuchasbirdspecies,carmodels,ordogbreeds.Wefocusontwomajorchallenges:learningexpressiveappearancedescriptorsandlo-calizingdiscriminativeparts.Tothisend,weproposeanobjectrepresentationthatdetectsimportantpartsanddescribesne-grainedappearances.Thepartdetectorsarelearnedinafullyunsupervisedmanner,basedontheinsightthatimageswithsimilarposescanbeautomaticallydiscoveredforne-grainedclassesinthesamedomain.Theappearancedescriptorsarelearnedusingaconvolutionalneuralnetwork.Ourapproachrequiresonlyimagelevelclasslabels,withoutanyuseofpartannotationsorsegmentationmasks,whichmaybecostlytoobtain.Weshowexperimentallythatcombiningthesetwoinsightsisaneffectivestrategyforne-grainedrecognition.I.INTRODUCTIONFine-grainedrecognition[1]–[7]referstothetaskofdistin-guishingsub-ordinatecategoriessuchasbirdspecies[8],[9],dogbreeds[10],aircraft[11],orcarmodels[12],[13].Itisoneofthecornerstonesofobjectrecognitionduetothepotentialtomakecomputersrivalhumanexpertsinvisualunderstanding.However,twomajorchallengesneedtobesolvedbeforene-grainedrecognitioncanachievethisgoal.First,recogniz-ingne-grainedclassestypicallyrequiresdifferentiatingnedetailsinappearance.Itcallsforanappearancerepresentationthatretainsdetailscriticalfordiscriminationanddiscardsunnecessaryinformation.Theretaineddiscriminativedetailscanbeverysubtleandhighlydomain-specic.Forexample,thetwocarsinFig.1differatthefrontbumperandturningsignal.DescriptorssuchasSIFT[14]orHOG[15],whilesuccessfulinmorecoarse-grainedrecognitiontasks,maynotbediscriminativeenoughtodifferentiateatthislevelofdetail.Anotherchallengeisindiscoveringandlocatingthepartsthatcontaindiscriminativedetails.Whenhumansdescribedifferencesbetweenne-grainedclasses,wealmostalwayspointoutthelocation(“verticalbarsonthecar'sgrille”,“blackpatchonthebird'sbeak”).Thatis,welocatetherelevantobjectpartsandthenchecktheappearance.TheBeetlesinFig.1aremucheasiertodifferentiatewhentoldtoexplicitlylookatthefrontbumper.Thisbringsforwardtheissueofpartdiscovery–whichpartsarediscriminativeandwherearethey?Onepossibilityistoannotatethelocationofvariouspartsbyhand.However,thisapproachiscostlyanditmaybedifculttoscaleuptohandlemanydifferenttypesofne-grainedclasses.Wehypothesizethattheultimatesolutionto Fig.1.Thekeytone-grainedrecognitionislocalizingimportantpartsandrepresentingpartappearancediscriminatively,asglobalcuesdescribingtheoverallshapeorcoloroftencannotcapturethesubtledifferences.Withoutanyinformationatthelevelofpartsitisdifculttodifferentiatebetweentwoverysimilarclasses.Similarly,featuressuchasHOG[15],whicharenotdiscriminativeforne-grainedclasses,donotcontainenoughinformation.ne-grainedrecognitionshouldentailbothlocalizingimportantpartswithminimalsupervisionandeffectivelydescribingtheirappearancesinawaythatdoesnotdiscardinformationusefulforclassication.Inthispaperwesimultaneouslytacklebothfeaturelearn-ingandpartdiscovery.Ourcentralideaisfeaturesandpartstoformauniedobjectrepresentation.Specically,weuseconvolutionalneuralnetworks(CNNs)tolearnappearancedescriptors,andperformunsupervisedpartdiscoverytoobtainacollectionofpartdetectors.Bylearningthefeaturesappropriatetodescribetheobjectcategoriesinquestion,weletthedatadeterminewhichfeaturesareeffec-tivefordiscrimination,whichhelpsavoidlosinginformationusefulforcategorization.Bykeepingpartdiscoverycompletelyunsupervisedwithrespecttopartannotations,weaimtomakeouralgorithmscalabletoavarietyofne-graineddomains,includingonesforwhichitisnotknownaprioriwhichpartsarediscriminative.Inrecognitiontime,wedetectpartsand representtheirappearancesusingthelearnedfeatures,leadingtoan“EnsembleofLocalizedLearnedFeatures”(ELLF),anovelrepresentationforne-grainedrecognition.Toourknowledgeourapproachisthersttointegratefeaturelearningandunsupervisedpartdiscoveryinne-grainedrecognition.II.RELATEDORKA.Part-basedrepresentationsManyne-grainedrecognitiontechniquesinvolvepart-basedrepresentationsinspiredbyworkongenericobjectrecognition[16],[17].Someexplicitlymodelpose[2],[18]whereasothersuselessstructuredapproaches[4],[19]–[21].Typicallypartdetectorsarelearnedusinghand-annotatedkey-points.Ourapproachdepartsfrommostpriorworkinthatthepartdetectorsarelearnedwithzerosupervision,whichmeansthatwecantacklemultipledomains,includingonesforwhichnothingmorethanclasslabelsandboundingboxesareavailable.B.FeatureLearningFeaturelearningisapromisingapproachthatcangeneratepowerfulappearancerepresentations.Muchworkhasfocusedonencodinglow-levelfeaturessuchasSIFTorHOG(e.g.[5],[22],[23])orminingdiscriminativetemplates[5],[6].Therecentsuccessofconvolutionalneuralnetworks[24],[25]onlarge-scaleclassicationandfacerecognition[26],[27]demonstratesthatpowerfulfeaturescanbelearneddirectlyfrompixels.Thisinspiresustoadoptconvolutionalneuralnetworks(CNNs)forne-grainedrecognition.NotethatunliketheDeCAFsystem[28]thattrainsfeaturesonImageNet[29],wedonotperformanypre-trainingusingadditionaldata.Thisrequirescarewhenchoosingthenetworkarchitectureandnecessitatesusingalargervarietyofdatadeformationsinordertocopewithandincreasethesizeofthetrainingset.Toourknowledgethisisthersttimedeepneuralnetworkshavebeenusedforne-grainedrecognitionwithoutanyformofdomainC.OtherapproachesInadditiontotheapproachesoutlinedabove,segmentationhasalsobeenfoundtobeparticularlyuseful[21],[30]–[32]inne-grainedrecognitiontasks.Anotherlineofresearchfocusesonputtinghumansintheloop[33],[33]–[36].Thesearecomplementaryapproachesthatcanbejointlyusedwithourmethod,andwedonotattempttoincorporatetheseadditionalcuesinourwork.III.APPROACHA.OverviewOurrepresentationbuildsontheintuitionthatweneedtolocalizepartsandthencomparetheirappearances.Fig.2providesanoverviewofthealgorithm.Themainideaistohavearepresentationthatenableseasycomparisonofappear-ancefeaturesoncorrespondingparts.ThisleadstoELLF:EnsembleofLocalizedLearnedFeatures.Supposewehaveacollectionofnobjectpartswithassociatedpartdetectors,whichweassumefornowhavealreadybeentrained.Givenaninputimage(Fig.2(a)),letaibetheappearanceofparti,asdescribedbyaconvolutionalneuralnetwork(Fig.2(b)).TheELLFrepresentationisthensimply(a1;a2;:::;an),theconcatenationofpartappearances(Fig.2(c)).Notethatduetoviewpointchangeandocclusion,notallpartsarenecessarilydetected.Whenpartiisnotdetected,theappearanceaisettozero,preventingaclassier(Fig.2(e))fromusinganyinformationatthatpart.WithimagesrepresentedbyELLF,wecanthentrainclassierssuchaslinearSVMstoperformne-grainedclassication.Forus,thecollectionofpartsisdeterminedinanunsupervisedframeworkandtheyaredescribedusingfeaturesfromaconvolutionalneuralnetwork.OnedesirablepropertyofusingELLFisthatitcomparestheappearancesofeachpartandaggregatesthesimilaritiestogether.Thisisdifferentfromtraditionalapproachesforgenericobjectrecognitionsuchasspatialpyramidmatching(SPM)[37]wherealinearkernelcomparestheappearancesatthesamespatiallocationinsteadofthesame.SPMisthussub-optimalforobjectsofdifferentposes,becauseallthepartsarenotnecessarilyvisibleoratthesamelocationacrossWewouldliketohighlightoneotherdifferencebetweenELLFandtraditionalbag-of-wordsrepresentations.Alinearkerneldenedonbag-of-wordshistogramsoritssoftlyquan-tizedgeneralizationssuchasLLC[22]roughlycorrespondstocomparingthepresenceofeachvisualword.Inthiscasewehavealreadyquantizedtheappearancesintovisualwordsandareonlycheckingwhetherspecicvisualwordsoccur.Thesubtleappearancedifferencesforne-grainedclassesmightstillgetlostinquantization.Incontrast,sincewedescribeobjectpartsusingfeaturestrainedfromaCNN,ourrepresentationkeepsrichappearancedescriptorsinthenalNowthatwehavedenedELLF,weproceedtode-scribetheprocessofgeneratingELLF.Therearetwokeycomponents:learningdiscriminativeappearancefeaturesanddiscoveringparts.B.FeatureLearningAhallmarkofne-grainedrecognitionisthatitdemandsrichandexpressiveappearancedescriptors,astraditionalde-scriptorslikeSIFT[14]orHOG[15]maynotcapturetherightbalancebetweendiscriminativenessandinvarianceforne-grainedclasses.Tothisendweadoptthephilosophyofend-to-endtrainingoffeaturedescriptorsusingneuralnetworks,allowingthedescriptorstoadapttotheidiosyncrasiesofindividualcategories.Toourknowledgethisisthersttimedeepfeaturelearningisappliedonne-grainedrecognitioninvolvingnopre-trainingwithadditionaldata.Wedemonstratethatevenonrelativelysmalldatasets,featurelearningcanbeeffectiveforne-grainedrecognition.Inparticular,weuseaconvolutionalneuralnetwork(CNN)[24]thatacceptspixelsasitsinputandoutputsprob-abilitiesofclasses.WemodifythearchitectureofKrizhevskyetal.[25]toaccountforoursmaller-scaledata,whichwehavefoundtobeveryimportantforpreventingovertting.Thenetworkconsistsoftwoconvolutionallayersfollowedbythreefullyconnectedlayerswithasoftmaxloss.Eachconvolutionallayerperformsconvolutionswithabankofltersonthe3Dinputmatrixandoutputslterresponsesinthe Fig.2.OverviewofourEnsembleofLocalizedLearnedFeatures(ELLF)representation.Givenaninputimage(a),wedetectpartsusingacollectionofunsupervisedpartdetectors(Sec.III-C).Wealsofeedtheimageintoaconvolutionalneuralnetwork(CNN)(b)thatoutputsagridofdiscriminativefeatures(Sec.III-B).TheCNNislearnedwithclasslabelsandthentruncated,retainingthersttwoconvolutionallayerswhichretainspatialinformation.WedescribetheappearanceofeachdetectedpartusingthelearnedCNNfeaturesbypoolinginthedetectedregionofeachpart(c).Appearanceofofanyundetectedpartsissettozero.ThisresultsinourELLFrepresentationthatisthenusedtopredictne-grainedobjectcategories(e).Incomparison,astandardCNN(d)passestheoutputoftheconvolutionallayersthroughseveralfullyconnectedlayersinordertomakeaprediction.formofa3Dmatrix.Sincelterparametersarelearnedfromthedata,thenetworkhasthepotentialtogeneratefeaturedescriptorstailoredtospecicdomains.MoredetailsaregiveninSec.IV-A.Aftertraining,weremovethefullyconnectedlayersandusethetwoconvolutionallayersasageneratorofpixel-levelappearancedescriptors.Notethatitisnecessarytocutoffthefeaturesatthispointinordertomaintainspatialinformation–featuresinthefullyconnectedlayersarecompletelyunordered.Toobtainadescriptorforaregion(suchasaboundingboxgivenbyapartdetector),weperformmax-poolingofthedescriptorslocatedinsidetheregion.Thus,onewaytointerpretourpartsisasmovablepoolingregionsinaCNNarchitecture.C.PartDiscoveryThegoalofpartdiscoveryistoobtainacollectionofreliablepartdetectors.Ourkeycontributionisapartdiscoveryalgorithmthatisfullyunsupervised.Previousworkhasreliedonhand-annotatedkeypointstotrainpartdetectors[4].Herewebypasshumanannotationscompletely,whichhastheadvantageofscalingtoverylarge-scaledatasets.Howcanwetrainpartdetectorswithoutanyannotations?Thekeyobservationisthatobjectswiththesameposecanoftenbeautomaticallydiscoveredbylocallow-levelcues.Aligningposesbetweenimagesis,ingeneral,adifcultproblem,becauseappearancemayvarywildlyevenwithinthesamecategory.However,localizingpartsprimarilydependsonanunderstandingoftheoverallobjectshapewithouttheneedtoscrutinizethelocaldetails—ablurredimageofadogmaypreventyoufromrecognizingthebreedbutwilllikelyholdenoughinformationforyoutolocalizetheparts.Thisintuitionmotivatesourpartdiscoveryprocedure.Werstdiscoversetsofalignedimageswithsimilarposes.Undertheassumptionthatimageswithinasetarewellaligned,thesamepartshavesimilarlocationsacrossimages.Wecanthustrainapartdetectorusingthepatchesfromthesamespatiallocationaspositiveexamplesandpatchesfromelsewhereasnegativeexamples.Fig.3(top)illustratesthisintuition.Wenowelaborateontheindividualsteps.1)DiscoveringAlignedImages:Therststepisdiscover-ingsetsofalignedimages.Weusearandomizedalgorithm.Wepickaseedimageatrandom(Fig.3(a))andthenretrievenearestneighborsintermsofHOGfeatures,extractedatmul-tiplescales.Tohelpreducetheinuenceofthebackground,weperformGrabCut[38]beforeextractingHOGfeatures,initializingtheforegroundmodelwiththeobject'sboundingbox,whichistypicallygiveninne-grainedrecognition.Theseforegroundsegmentationsarecenteredforthepurposeofcomparingacrossimages.Werepeatthisprocess,randomlysamplingmultiplesetsandusingeachsettogeneratemultiplepartdetectors.Withareasonablenumberoftrainingimagestochoosefrom,thismethodtypicallyresultsinasetofimageswithnearlythesamepose(Fig.3(b)).2)PartSelection:Nextweselectthepartstodetect,aseverylocationwithinthesegmentedforegroundcanbeapotentialpart.Toaddressthisissue,werandomlysamplealargenumberofregionswithvarioussizesascandidates(Fig.3(c)).Wethenselectthepartswiththehighestenergy,asmeasuredbythevarianceofHOGacrossimages(Fig.3(d)).Thishelpspreventselectingpartswhichlackdiscriminativeinformation–apartwhichdoesnotvaryatallacrossimagescannotbeusefulfordiscrimination.Eachtimeweselectapart,weremovefromthecandidatelistanypartsthatoverlapmorethanaxedthresholdwithanalreadyselectedpart,setto15%inourimplementation.Thishelpspreventlearningredundantpartsforagivensetofalignedimages.3)DetectorLearning:Wethenlearnadetectorforeachselectedpart(Fig.3(e)).Specically,letIjbethealignedimagesandzbelocationoftheselectedpart.Undertheassumptionthattheimagesarewellaligned,ourlearningobjectiveforthepartdetectorisndingatemplatewminimizesthehingelossminwPjmax;wTh(Ij;z)+PjPzjmax;1+wTh(Ij;zj);h(Ij;z)extractsfeatures(HOG)onimageIjpositivepatchlocationz.Thevariablezjarethelocations (b) neighbors (a) seed image(c) candidate parts (random) (d) selected parts (e) part detectors neighborhoodpart detectors Fig.3.Top:Ourfullyunsupervisedpartdiscoverypipeline.Werandomlysampleaseedtrainingimage(a)andretrievethenearestneighbors(b)intermsofglobalHOGappearance.Thisallowsustoidentifywellalignedimageswithsimilarposes.Fromalargesampleofrandomparts(c)wepickthesub-windowswithhighenergy(d)ascandidateparts,whicharethenusedtotrainournalpartdetectors(e),visualizedastheaverageofthepatchesusedaspositiveexamplesintraining.Bottom:Examplesofpartsdiscoveredbyourmethod.Leftmostistheseedimageusedtogeneratethesetofalignedimages,asubsetofwhichisshowninthemiddle.Shownattherightaretheaverageoftheimagepatchesusedaspositivestotraineachpartdetectorandthelearnedweights.Ourmethodisabletodiscoveravarietyofpartsfromeachneighborhood.ofthenegativepatchesonimageIj,chosenrandomlysuchthattheydonotoverlapwiththepositivepatchatlocationzWenowrelaxtheassumptionthattheimagesarewellalignedtoberobusttomisalignment.Insteadofhavingaxedz,weintroducealatentvariablezjtorepresentthetruelocationofthepartonimagezj.OurlearningobjectiveisthusminwPjmax;maxz+jwTh(Ij;zj)+PjPzjmax;1+wTh(Ij;zj);wherewesearchforthebestmatchoverallpossiblezj.Theobjectivecanbeoptimizedbyalternatingbetweenoptimizingzjwithxedwandoptimizingwxedzj,similartothelatentSVMoptimizationintroducedin[17].Weinitializethelatentvariablezjwiththeoriginalz.Alsosimilarto[17],weaugmenttheHOGh(I;z)(dxdy;dx2;dy2)toincludeaspatialpriorthatpenalizespatchestoofarawayfromtheoriginalzdx=xzxz+dy=yzyz+,where(xz;yz)isthecoordinateofthelocationz(xz+;yz+)isthecoordinateoftheoriginallocation.ThiseffectivelydenesaGaussianprioronthetruelocationrelativetotheoriginallocationzpreventingpartdetectorsfromspuriouslyringatregionsthatbychanceappearsimilartothepartwhilestillallowingthepartsthemselvestomovearoundinordertobestttheactualpartlocationineachimage.Atdetectiontime,wesetathresholdonthedetectorresponse.Iftheresponseisbelow,thepartisconsiderednotvisibleintheimageanditsappearancedescriptorwillbesettozero,preventingtheclassierfromreceivinganyinformationaboutapartthatisnotpresent.4)EnsembleofParts:Toobtainacollectionofpartdetec-tors,werepeatourdiscoveryproceduremultipletimes.Itisworthnotingthattherandomizationthroughoutourdiscoveryprocedurecanhelpincreasetherobustnessoftherecognitionalgorithm.Aswewilldemonstrateinourexperiments,increas-ingthenumberofrandomlysampledpartdetectorsimprovesperformance.SeeFig.3(bottom)formoreexamplesofourpartdiscoverypipeline.IV.EXPERIMENTSA.DatasetsandImplementationDetailsWeevaluateouralgorithmontheCars[13]ne-grainedbenchmark.Wefollowthestandardevaluationprocedure:alltrainingandtestingisperformedoncroppedimagesfromtheobjectboundingboxes.Wereportclassicationaccuracy,i.e.theaccuracyasaveragedoverthetestexamples.Inour experimentsnoannotationsexceptclasslabelsandboundingboxesareusedintraining.Weusetheimplementation[25]ofCNNs.Thenetworkconsistsof48lterswithpoolingregionswithstride2fortherstconvolutionallayer,and128lterswithpoolingregionsofsizeevery6pixelsforthesecondconvolutionallayer.Thenumberofunitsforthethreefullyconnectedlayersareall2048.Allunitsarerectiedlinearunitsexceptforthefullyconnectedlayers,wherewefoundthatlinearunitsworkbest.WeresizeeachimagetoandusevarioustechniquesforCNNstopreventovertting,includingdropout,colorperturbation,randomrotations,andsamplingsubwindowsofasdescribedin[25].Thisarchitecturewasdeterminedbyextensiveexperimentsusingavalidationsetdrawnfromthetrainingset,asthenetworkusedin[25]suffersfromsubstan-tialoverttingonne-graineddatasets,whichtypicallyhaveover100xlesstrainingdatathaninILSVRC2012[39],thedatasetusedtotrain[25].Attesttime,weaveragepredictionsoverthefourcorners,middlepatch,andtheirhorizontalips.Forpartdiscovery,eachalignedsetconsistsof1queryim-ageand49nearestneighbors.Togeneratethecandidatepartsfromthealignedimages,werandomlysample5000patchesandpickthetop10withthehighestHOGenergy,measuredacrossimages.Thethresholdusedforpartdetectionissetto.Whilethislowthresholdresultsinpartdetectionsinnearlyeveryimage,evenonesinwhichthepartisnotpresent,wehavefoundthatthisimprovesperformance,withtheintuitionthatthesmallamountofsignalwegetbyincreasingthenumberofpartdetectionsisworththecorrespondingincreaseinnoise.Totrainournalclassier,alinearSVM,weuseasdataELLFfeaturesextractedontheoriginalimagesaswellastheirhorizontalips.Notethatthismeansthattheclassieritselfdoesnothaveaccesstosomeofthedatadeformations–namely,thecolorperturbations,randomrotations,orsubwin-dowsampling.Italsodoesnotusedropoutforregularization.AlthoughinprincipleonecouldtraintheSVMwiththesede-formations,partdetectionremainsarelativelycostlyoperationcomparedtoextractingCNNfeatures,whichmakescomputingELLFfeaturesonmanydeformationsexpensive.Attesttimeweaveragepredictionsovereachtestimageanditshorizontaliptoproducethenalclassication.B.ResultsandAnalysisTableIreportsourresultscomparedwithpriorwork.ELLFbeatsthepreviousstateoftheart,LLC[22],aswellasBB[34]andBB-3D-G[13],twoworksdesignedforne-grainedrecog-nition.Thisvalidatestheclaimthatlearningdiscriminativefeaturesandusingthemwithpartsdiscoveredautomaticallyisaneffectivestrategyforne-grainedrecognition.Inthefollowingsectionswepresentmoreanalysis.1)FeatureLearning:AplainCNN(70.5%)alreadyout-performspreviousworkusingtraditionalfeaturessuchasSIFTorHOG(BB,BB-3D-G,andLLC).Thisiswithoutextraunlabeleddataoranykindofpre-training.Itsuggeststhatfeaturelearningisabletogeneraterichappearancedescriptorsthatadapttoparticularcategories,evenwithourlimitedamountofdata. Method Accuracy BB[34] 63.6 BB-3D-G[13] 67.6 LLC[22] 69.5 CNN-SPM(small) 67.9 CNN-SPM(large) 69.3 CNN 70.5 ELLF(ours) 73.9 TABLEI.AINRESULTSONCLASSIFYINGCARS.BB,BB-3D-G,ANDLLCRESULTSASREPORTEDIN[13].2)UsefulnessofParts:Wenextperformacontrolexperi-mentthatshowcasethekeybenetsourELLFrepresentation—thesamesegmentsoftherepresentationsfromtwoimagesrefertoappearancesofthesamepart.Toverifythisintuition,wereplaceourdetectedpartswithSPMgrids,wherethesamesegmentsoftherepresentationsrefertothesameimageloca-tioninsteadofpart.CNN-SPM(small)andCNN-SPM(large)intableIreporttheresultsofthiscontrolexperiments.CNN-SPM(small)uses1x1,2x2,and4x4spatialpoolingregions,andCNN-SPM(large)inadditionuses8x8regions,bothbuild-ingontopofourCNNfeatures.TheELLFrepresentationoutperformsbothstandardandveryhigh-dimensionalSPMrepresentations,demonstratingthatitisindeedhelpfultoenablecomparingappearancesatcorrespondingpartsandthatthegainsfromusingELLFarenotsimplyduetofeaturedi-mension.NotethattheperformanceofbothCNN-SPM(small)andCNN-SPM(large)isbelowthatofastandardCNN,whichisduetothelowernumberofdeformationstheSVMclassieristrainedonandthefactthatitisnottrainedusingdropout(seeSec.IV-Afordetails).3)ConsistencyofParts:Tovalidatethatthediscoveredpartsgeneralizebeyondourtrainingdata,weshowasampleofpartsandtheirtoptendetectionsonthetestsetinFig.7.Thetoptworowsshowthatdiscoveredpartstendtoreratherconsistently,evenundermildchangesinviewpoint.Examplesoffailurecasesaregiveninthelastrow,inwhich180-degreerotationsofcarscausethepartdetectorstomisreonpatcheswhich,althoughlocallyverysimilartothetargetpartinappearanceandposition,arenonethelessdifferentparts.4)NumberofParts:Wealsoinvestigatehowmuchpartdiscoverycontributestoperformance.Fig.4plotstheclassi-cationaccuracyversusnumberofpartdetectorsdiscovered.ItalsoplotstheperformanceofdirectlyusingfullCNNmodelswithoutpartdiscovery(CNN).Performanceincreaseswiththenumberofparts,uptoapointwhenitplateaus.Remarkably100partdetectorsaresufcienttosignicantlyimprovethestandaloneCNNmodel.Thisshowsthatpartdiscoveryisanessentialcomponentofourrepresentation.Eventuallyperfor-mancesaturates(ataround1000partsdiscovered).5)ELLFvs.CNNpredictions:InFig.5weshowasamplingofimageswhereourmethodwascorrectandastandardCNNwasincorrect,withthepartsthatcontributedmosttowardthedecisionvalueofthecorrectclassdisplayed.Ourpartdetectorsreonadiverserangeofparts,whereasaCNNisconnedtoaxedpoolinggrid.InFig.6weshowexamplefailuremodesofELLF,wheretheCNNwascorrectbutELLFwasnot.OnedisadvantageofELLF'srelianceonGrabCutforsegmentationisthatpartdetectionsuffers 10 100 1000 50 55 60 65 70 75 Num. PartsAccuracyAccuracy vs. Number of Parts ELLF CNN Fig.4.CarclassicationaccuracyversusthenumberofpartsusedinourELLFrepresentation.ResultsachievedusingCNNalonearealsoincluded(greenline).whenthesegmentationisawed,hurtingourrecognition6)ConfusingClasses:InFig.8weshowthepairsofclassesmostconfusedwithoneanother.TheconfusionscorebetweenapairofclassesC;DisdeterminedbyConf(C;D)=C;D PiC;i+D;C PiD;ia;bisthenumberofimageswithgroundtruthlabelapredictedasclassb.Thesepairstendtoconsistofdifferentmodelsoryearswithinthesamemake,exceptfortheclassesChevroletExpressCargoVan2007vs.GMCSavanaVan2012,inwhichcasethedifferenceinmakeisvisuallyrepresentedonlybythelogosmallonthefrontofthevan.7)MostUsefulParts:Whichparts,discoveredinanunsu-pervisedfashion,aremostusefulfordiscriminationacrossall196categories?Tomeasurethis,wesumtheabsolutevalueofthelearnedclassierweightsacrossclassesanddimensionsforeachparttoproduceanimportancescoreforeachpart.Thetop10partsareshowninFig.9.Thesepartsarerelativelylargeandthuscangiveinformationabouttheoverallshapeortypeofcar,e.g.whetherthecarisasedan,SUV,orcoupe.Wealsoobservethatthesepartstendtooccurinthetophalfoftheautomobiles,almostneveroverlappingwiththetires.Thisagreeswiththeintuitionthattiresarenotusefulregionstolookatwhendiscriminatingcars.C.LimitationsAlthoughitisastepintherightdirection,ELLFisfarfromperfect.Performancecanbeimprovedbypre-trainingtheCNNonImageNet[25],[29],especiallyforsmallerdatasetsliketheCarsdataset[13]usedinthispaper.However,suchconstraintswillbealleviatedasne-graineddatasetscontinuetogrowinsize.Second,inourimplementation,partdetectiontakessignicantlymoretimethanextractingCNNfeatures.InordertoincreasethepracticalityofELLF,partdetectionneedstobespedup.Finally,insteadofextractingCNNfeatures,disjointlylearningpartsandthenlearningaclassier(anSVM),jointlylearningpartswithfeatureswouldlikelybringimprovementstobothpartdiscoveryandfeaturelearning.Thismethodwouldalsoenableustofurthertakeadvantageofdatadeforma-tions,sincethecostofpartdetectionmakesitimpracticaltotrainourclassieronasignicantnumberofdeformations(seeSec.IV-A).Althoughtheadvantagesofjointlylearningfeaturesandpartsareclear,trainingsuchanon-rigidneuralnetworkefciently(i.e.onaGPU)posesmanyimplementationchallenges–learningmovablepartsandpoolingregionsisanopenprobleminthedesignandimplementationofneuralnetworks.V.DISCUSSIONANDUTUREORKInthispaperwehaveproposedanapproachforne-grainedrecognitionthattacklesbothfeaturelearningandpartdiscovery.Ourmainresultsare1)thatlearningdiscriminativefeaturesinasupervisedsettingcanbeeffectiveforne-grainedrecognition,evenatthesmallscalespresentincurrentne-graineddatasets,and2)thatonecanlearnpartsusefulforrecognitionwithoutanypart-levelannotations.Inthefuturewewouldliketocombinepartdiscoveryandfeaturelearningintoajointmodelandliftourpartrepresentationinto3D,whichshouldyieldmoreaccuratecorrespondences.CKNOWLEDGMENTSThisworkwaspartiallysupportedbyanONRMURIgrantandtheYahoo!FREPprogram.EFERENCES[1]A.B.HillelandD.Weinshall,“Subordinateclassrecognitionusingrelationalobjectmodels,”,2007.[2]R.Farrell,O.Oza,N.Zhang,V.I.Morariu,T.Darrell,andL.S.Davis,“Birdlets:Subordinatecategorizationusingvolumetricprimitivesandpose-normalizedappearance,”in,2011.[3]O.M.Parkhi,A.Vedaldi,A.Zisserman,andC.Jawahar,“Catsanddogs,”in,2012.[4]T.BergandP.N.Belhumeur,“POOF:Part-BasedOne-vs-OneFeaturesforne-grainedcategorization,faceverication,andattributeestima-tion,”in,2013.[5]S.Yang,L.Bo,J.Wang,andL.Shapiro,“Unsupervisedtemplatelearningforne-grainedobjectrecognition,”in,2012.[6]B.Yao,G.Bradski,andL.Fei-Fei,“Acodebook-freeandannotation-freeapproachforne-grainedimagecategorization,”in,2012.[7]K.Duan,D.Parikh,D.Crandall,andK.Grauman,“Discoveringlocalizedattributesforne-grainedrecognition,”in,2012.[8]C.Wah,S.Branson,P.Welinder,P.Perona,andS.Belongie,“Thecaltech-ucsdbirds-200-2011dataset,”CaliforniaInstituteofTechnol-ogy,Tech.Rep.CNS-TR-2011-001,2011.[9]P.Welinder,S.Branson,T.Mita,C.Wah,F.Schroff,S.Belongie,andP.Perona,“Caltech-UCSDBirds200,”CaliforniaInstituteofTechnology,Tech.Rep.CNS-TR-2010-001,2010.[10]A.Khosla,N.Jayadevaprakash,B.Yao,andL.Fei-Fei,“Noveldatasetforne-grainedimagecategorization,”in,2011.[11]S.Maji,J.Kannala,E.Rahtu,M.Blaschko,andA.Vedaldi,“Fine-grainedvisualclassicationofaircraft,”Tech.Rep.,2013.[12]M.Stark,J.Krause,B.Pepik,D.Meger,J.J.Little,B.Schiele,andD.Koller,“Fine-grainedcategorizationfor3dsceneunderstanding,”in,2012.[13]J.Krause,M.Stark,J.Deng,andL.Fei-Fei,“3dobjectrepresentationsforne-grainedcategorization,”in,2013.[14]D.G.Lowe,“Distinctiveimagefeaturesfromscale-invariantkeypoints,”,2004. BMW 1-SeriesConvertible 2012 BMW 3-SeriesSedan 2012 Land Rover Range RoverSUV 2012 Suzuki SX4Hatchback 2012 Ferrari FFCoupe 2012 Dodge ChargerSedan 2012 Honda OdysseyMinivan 2012 Land Rover LR2SUV 2012 Ferrari 458 ItaliaCoupe 2012 Ferrari 458 ItaliaConvertible 2012 Acura TLType-S 2008 Rolls-Royce PhantomSedan 2012 Ford Expedition ELSUV 2009 Isuzu AscenderSUV 2008 Chevrolet Monte CarloCoupe 2007 Ford F-150Regular Cab 207 Dodge ChargerSedan 2012 BMW Z4Convertible 2012 Audi TTHatchback 2011 Audi TT RSCoupe 2012 Chevrolet CorvetteRon Fellows Edition Z06 2007 Suzuki SX4Sedan 2012 Eagle TalonHatchback 1998 Aston Martin V8 VantageConvertible 2012 Scion xDHatchback 2012 Suzuki SX4Sedan 2012 Spyker C8Convertible 2009 Spyker C8Coupe 2009 Suzuki AerioSedan 2007 Honda OdysseyMinivan 2007 BMW 1-SeriesCoupe 2012 Acura TLType-S 2008 Ford FiestaSedan 2012 Toyota CamrySedan 2012 Hyundai ElantraSedan 2007 Volkswagen GolfHatchback 2012 ELLFCNNELLFCNNELLFCNNELLFCNN Chevrolet ExpressCargo Van 2007 GMC SavanaVan 2012 Chevrolet ExpressCargo Van 2007 GMC Savana Van 2012 Fig.5.ExampleimageswhereELLFwascorrectandastandardCNNwasincorrect.Oneachimagethevepartsforourmethodthatcontributedmosttoacorrectclassicationareshown.Incorrectpredictionsarecoloredgrayandinitalics. GMC CanyonExtended Cab 2012 Bentley ArnageSedan 2009 Chevrolet SonicSedan 2012 Ford MustangConvertible 2007 Mercedes-BenzS-Class Sedan 2012 Aston Martin V8 VantageConvertible 2012 Hummer H2SUT 2009 Ford E-SeriesWagon Van 2012 Chevrolet Silverado 1500Extended Cab 2012 Chevrolet Silverado 1500Regular Cab 2012ELLFCNN Fig.6.ExamplefailurecasesofELLF.Incorrectpredictionsarecoloredgrayandinitalics.PartdetectionforELLFsufferswhenGrabCutproducesanincorrectsegmentation,eitherbysegmentingouttoomuchofthetargetcarorbykeepingtoomuchofthebackground.[15]N.DalalandB.Triggs,“Histogramsoforientedgradientsforhumandetection,”in,2005.[16]L.BourdevandJ.Malik,“Poselets:Bodypartdetectorstrainedusing3dhumanposeannotations,”in,2009.[17]P.F.Felzenszwalb,R.Girshick,D.McAllester,andD.Ramanan,“Objectdetectionwithdiscriminativelytrainedpartbasedmodels,”PAMI,2010.[18]N.Zhang,R.Farrell,andT.Darrell,“Posepoolingkernelsforsub-categoryrecognition,”in,2012.[19]J.Liu,A.Kanazawa,D.Jacobs,andP.Belhumeur,“Dogbreedclassicationusingpartlocalization,”in,2012.[20]E.Gavves,B.Fernando,C.Snoek,A.Smeulders,andT.Tuytelaars,“Fine-grainedcategorizationbyalignments,”in,2013.[21]Y.Chai,E.Rahtu,V.Lempitsky,L.VanGool,andA.Zisserman,“Tricos:atri-levelclass-discriminativeco-segmentationmethodforimageclassication,”in,2012.[22]J.Wang,J.Yang,K.Yu,F.Lv,T.Huang,andY.Gong,“Locality-constrainedlinearcodingforimageclassication,”in,2010.[23]F.PerronninandC.Dance,“Fisherkernelsonvisualvocabulariesforimagecategorization,”in,2007.[24]Y.LeCun,L.Bottou,Y.Bengio,andP.Haffner,“Gradient-basedlearningappliedtodocumentrecognition,”ProceedingsoftheIEEEvol.86,no.11,pp.2278–2324,1998.[25]A.Krizhevsky,I.Sutskever,andG.Hinton,“Imagenetclassicationwithdeepconvolutionalneuralnetworks,”in,2012.[26]G.B.Huang,H.Lee,andE.Learned-Miller,“Learninghierarchicalrepresentationsforfacevericationwithconvolutionaldeepbeliefnetworks,”in,2012.[27]G.B.Huang,M.Mattar,H.Lee,andE.G.Learned-Miller,“Learningtoalignfromscratch,”in,2012. PartTop Test DetectionsPartTop Test Detections Fig.7.Asampleofpartsandthetentestdetectionswiththehighestresponse.Eachpartisvisualizedontopoftheseedimageoftheneighborhoodwhichproducedthetrainedpart.Thersttworowsaresuccesscases:thepartsdetectorsreconsistentlyonthesamepartsofthecar,evenunderthepresenceofsomeviewpointvariation.Thelastrowarefailurecases:sinceeachpartdetectorisbasedonlocalevidence,whendifferentpartshavethesameappearanceandoccurinthesamepositionintheimageplane,ascanoccurwhenacarundergoesa180-degreerotation,thepartdetectorsmisre. Chevrolet ExpressCargo Van 2007Chevrolet ExpressVan 2007Chevrolet ExpressCargo Van 2007GMC Savana Van 2012Chevrolet Silverado 1500Hybrid Crew Cab 2012Chevrolet Silverado 1500Extended Cab 2012Audi A5 Coupe 2012Audi S5 Coupe 2012Aston Martin V8Vantage Coupe 2012Aston Martin V8Vantage Convertible 2012 Fig.8.ThevemostconfusingpairsofclassesforELLFintheCardataset,indescendingorderofconfusionasdeterminedbyEq.3.Weobservethatthesepairsofclassesdifferonlyinverysmalldetails. Fig.9.Themostusefulpartsforoverallcarclassication.Thesepartstendtobelarge,givinginformationaboutthegeneraltypeofcar(SUV,sedan,etc.).[28]J.Donahue,Y.Jia,O.Vinyals,J.Hoffman,N.Zhang,E.Tzeng,andT.Darrell,“Decaf:Adeepconvolutionalactivationfeatureforgenericvisualrecognition,”arXivpreprintarXiv:1310.1531,2013.[29]J.Deng,W.Dong,R.Socher,L.-J.Li,K.Li,andL.Fei-Fei,“Imagenet:Alarge-scalehierarchicalimagedatabase,”in,2009.[30]M.-E.NilsbackandA.Zisserman,“Delvingintothewhorlofowersegmentation,”in,2007.[31]A.AngelovaandS.Zhu,“Efcientobjectdetectionandsegmentationforne-grainedrecognition,”in,2013.[32]Y.Chai,V.Lempitsky,andA.Zisserman,“Symbioticsegmentationandpartlocalizationforne-grainedcategorization,”in,2013.[33]C.Wah,S.Branson,P.Perona,andS.Belongie,“Multiclassrecognitionandpartlocalizationwithhumansintheloop,”in,2011.[34]J.Deng,J.Krause,andL.Fei-Fei,“Fine-grainedcrowdsourcingforne-grainedrecognition,”in,2013.[35]D.ParikhandK.Grauman,“Interactivelybuildingadiscriminativevocabularyofnameableattributes,”in,2011.[36]S.Branson,C.Wah,F.Schroff,B.Babenko,P.Welinder,P.Perona,andS.Belongie,“Visualrecognitionwithhumansintheloop,”[37]S.Lazebnik,C.Schmid,andJ.Ponce,“Beyondbagsoffeatures:Spatialpyramidmatchingforrecognizingnaturalscenecategories,”in[38]C.Rother,V.Kolmogorov,andA.Blake,“Grabcut:Interactivefore-groundextractionusingiteratedgraphcuts,”inACMTransactionsonGraphics(TOG),vol.23,no.3.ACM,2004,pp.309–314.[39]J.Deng,A.Berg,S.Satheesh,H.Su,A.Khosla,andL.Fei-Fei,“Largescalevisualrecognitionchallenge,”www.image-net.org/challenges/LSVRC/2012