stanfordedu Abstract In this paper we study the problem of 64257negrained im age categorization The goal of our method is to explore 64257ne image statistics and identify the discriminative image patches for recognition We achieve this goal by combin ID: 9842
Download Pdf The PPT/PDF document "Combining Randomization and Discriminati..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
CombiningRandomizationandDiscriminationforFine-GrainedImageCategorizationBangpengYaoAdityaKhoslaLiFei-FeiComputerScienceDepartment,StanfordUniversity,Stanford,CAInthispaper,westudytheproblemofne-grainedim-agecategorization.Thegoalofourmethodistoexplore BangpengYaoandAdityaKhoslacontributedequallytothispaper. methodsignicantlyimprovesthestrengthofthedecisiontreesintherandomforestwhilestillmaintaininglowcorre-lationbetweenthetrees.Thisallowsourmethodtoachievelowgeneralizationerroraccordingtothetheoryofrandomforest[Weevaluateourmethodontwone-grainedcategoriza-tiontasks:humanactivityrecognitioninstillimages[andsubordinatecategorizationofcloselyrelatedanimalspecies[],outperformingstate-of-the-artresults.Further-more,ourmethodidentiessemanticallymeaningfulimagepatchesthatcloselymatchhumanintuition.Additionally,ourmethodtendstoautomaticallygenerateacoarse-to-nestructureofdiscriminativeimageregions,whichparallelsthehumanvisualsystem[Theremainingpartofthispaperisorganizedasfollows:discussesrelatedwork.Sec.describesourdensefeaturespaceandSec.describesouralgorithmformin-ingthisspace.ExperimentalresultsarediscussedinSec.andSec.concludesthepaper.2.RelatedWorkImageclassicationhasbeenstudiedformanyyears.Mostoftheexistingworkfocusesonbasic-levelcategoriza-tionsuchasobjects[]orscenes[].Inthispaperwefocusonne-grainedimagecategorization[whichrequiresanapproachtocapturetheneanddetailedinformationinimages.Inthiswork,weexploreadensefeaturerepresentationtodistinguishne-grainedimageclasses.Ourpreviousworkhasshowntheadvantageofdensefeatures(Groupletfea-tures[])inclassifyinghumanactivities.InsteadofusingthegenerativelocalfeaturesasinGrouplet,herewecon-sideraricherfeaturespaceinadiscriminativesettingwherebothlocalandglobalvisualinformationarefusedtogether.Inspiredby[],ourapproachalsoconsiderspairwiseinteractionsbetweenimageregions.Weusearandomforestframeworktoidentifydiscrimi-nativeimageregions.Randomforestshavebeenusedsuc-cessfullyinmanyvisiontaskssuchasobjectdetection[segmentation[]andcodebooklearning[].Inspiredfrom[],wecombinediscriminativetrainingandrandom-izationtoobtainaneffectiveclassierwithgoodgeneraliz-ability.Ourmethoddiffersfrom[]inthatforeachtreenode,wetrainanSVMclassierfromoneoftherandomlysampledimageregions,insteadofusingAdaBoosttocom-bineweakfeaturesfromaxedsetofregions.Thisallowsustoexploreanextremelylargefeaturesetefciently.Aclassicalimageclassicationframework[]isFea-tureExtractionPoolingFeatureextractionaction14]andbetterods[]havebeenextensivelystudiedforobjectrecogni-tion.Inthiswork,weusediscriminativefeatureminingandrandomizationtoproposeanewfeature Figure2.(a)Illustrationofourdensesamplingspace.Wedenselysamplerectangularimagepatcheswithvaryingwidthsandheights.Theregionsarecloselylocatedandhavesignicantover-laps.Thereddenotethecentersofthepatches,andthear-rowsindicatetheincrementofthepatchwidthorheight.(Theac-tualdensityofregionsconsideredinouralgorithmissignicantlyhigher.Thisgurehasbeensimpliedforvisualclarity.)WenotethattheregionsconsideredbySpatialPyramidMatching[]isaverysmallsubsetlyingalongthediagonaloftheheight-widthplanethatweconsider.(b)Illustrationofsomeimagepatchesthatmaybediscriminativeforplaying-guitar.Allthosepatchescanbesampledfromourdensesamplingspace.approach,anddemonstrateitseffectivenessonne-grainedimagecategorizationtasks.3.DenseSamplingSpaceOuralgorithmaimstoidentifyneimagestatisticsthatareusefulforne-grainedcategorization.Forexample,inordertoclassifywhetherahumanisplayingaguitarorholdingaguitarwithoutplayingit,wewanttousetheim-agepatchesbelowthehumanfacethatarecloselyrelatedtothehuman-guitarinteraction(Fig.).Analgorithmthatcanreliablylocatesuchregionsisexpectedtoachievehighclassicationaccuracy.Weachievethisgoalbysearchingoverrectangularimagepatchesofarbitrarywidth,height,andimagelocation.Werefertothisextensivesetofimageregionsasthedensesamplingspace,asshowninFig.Furthermore,tocapturemorediscriminativedistinctions,weconsiderinteractionsbetweenpairsofarbitrarypatches.Thepairwiseinteractionsaremodeledbyapplyingconcate-nation,absoluteofdifference,orintersectionbetweenthefeaturerepresentationsoftwoimagepatches.However,thedensesamplingspaceisveryhuge.Sam-plingimagepatchesofsizeinaageeveryfourpixelsleadstothousandsofpatches.Thisincreasesmany-foldswhenconsideringregionswitharbi-trarywidthsandheights.Furtherconsideringpairwisein-teractionsofimagepatcheswilleffectivelyleadtotrillionsoffeaturesforeachimage.Inaddition,thereismuchnoiseandredundancyinthisfeatureset.Ontheonehand,manyimagepatchesarenotdiscriminativefordistinguish-ingdifferentimageclasses.Ontheotherhand,theimagepatchesarehighlyoverlappedinthedensesamplingspace,whichintroducessignicantredundancyamongthesefea- tures.Therefore,itischallengingtoexplorethishigh-dimensional,noisy,andredundantfeaturespace.Inthiswork,weaddressthisissueusingrandomization.4.RandomForestwithDiscriminativeDeci-sionTreesInordertoexplorethedensesamplingfeaturespaceforne-grainedvisualcategorization,wecombinetwocon-cepts:(1)Discriminativetrainingtoextracttheinformationintheimagepatcheseffectively;(2)toex-plorethedensefeaturespaceefciently.Specically,weadoptarandomforest[]frameworkwhereeachtreenodeisadiscriminativeclassierthatistrainedononeorapairofimagepatches.Inoursetting,thediscriminativetrainingandrandomizationcanbenetfromeachother.Wesumma-rizetheadvantagesofourmethodbelow:Therandomforestframeworkallowsustoconsiderasubsetoftheimageregionsatatime,whichallowsustoexplorethedensesamplingspaceefcientlyinaprincipledway.Randomforestselectsabestimagepatchineachnode,andthereforeitcanremovethenoise-proneimagepatchesandreducetheredundancyinthefeatureset.Byusingdiscriminativeclassierstotrainthetreenodes,ourrandomforesthasmuchstrongerdecisiontreeswithsmallcorrelation.Thisallowsourmethodtohavelowgeneralizationerror(Sec.)comparedwiththetraditionalrandomforest[]whichusesweakclas-siersinthetreenodes.AnoverviewoftherandomforestframeworkweuseisshowninAlgorithm.Inthefollowingsections,werstdescribethisframework(Sec.).Thenweelabo-rateonourfeaturesampling(Sec.)andsplitlearning)strategiesindetail,anddescribethegeneralizationtheory[]ofrandomforestwhichguaranteestheeffective-nessofouralgorithm(Sec.4.1.TheRandomForestFrameworkRandomforestisamulti-classclassierconsistingofanensembleofdecisiontreeswhereeachtreeisconstructedviasomerandomization.AsillustratedinFig.,theleafnodesofeachtreeencodeadistributionovertheimageclasses.Allinternalnodescontainabinarytestthatsplitsthedataandsendsthesplitstoitschildrennodes.Thesplit-tingisstoppedwhenaleafnodeisencountered.Animageisclassiedbydescendingeachtreeandcombiningtheleafdistributionsfromallthetrees.Thismethodallowstheex-ibilitytoexplorealargefeaturespaceeffectivelybecauseitonlyconsidersasubsetoffeaturesineverytreenode.Eachtreereturnstheposteriorprobabilityofanexamplebelongingtothegivenclasses.Theposteriorprobabilityof Weak classifierLeaf(a)Conventionalrandomdecisiontree. Strong classifierLeaf(b)Discriminativedecisiontree.Figure3.Comparisonofconventionalrandomdecisiontreeswithourdiscriminativedecisiontrees.Solidbluearrowsshowbinarysplitsofthedata.Dottedlinesfromtheshadedimageregionsindi-catetheregionusedateachnode.Conventionaldecisiontreesuseinformationfromtheentireimageateachnode,whichencodesnospatialorstructuralinformation,whileourdecisiontreessamplesingleormultipleimageregionsfromthedensesamplingspace).Thehistogramsbelowtheleafnodesillustratethepos-teriorprobabilitydistribution).In(b),dottedredarrowsbetweennodesshowournestedtreestructurethatallowsinformationtoowinatop-downmanner.Ourapproachusesstrongclassiersineachnode(Sec.),whiletheconventionalmethodusesweakclassiers.aparticularclassateachleafnodeislearnedasthepro-portionofthetrainingimagesbelongingtothatclassatthegivenleafnode.Theposteriorprobabilityofclassatleafoftreeisdenotedas.Thus,atestimagecanbeclassiedbyaveragingtheposteriorprobabilityfromtheleafnodeofeachtree: isthepredictedclasslabel,isthetotalnumberoftrees,andistheleafnodethattheimagefallsinto.Inthefollowingsections,wedescribetheprocessofob-usingouralgorithm.Readerscanrefertopreviousworks[]formoredetailsoftheconven-tionaldecisiontreelearningprocedure.4.2.SamplingtheDenseFeatureSpaceAsshowninFig.,eachinternalnodeinourdeci-siontreecorrespondstoasingleorapairofrectangularim-ageregionsthataresampledfromthedensesamplingspace),wheretheregionscanhavemanypossiblewidths,heights,andimagelocations.Inordertosampleacandidateimageregion,werstnormalizeallimagestounitwidth foreachtree -Obtainarandomsetoftrainingexamplesneedstosplit i.Randomlysamplethecandidate(pairsof)imageregions(Sec.ii.Selectthebestregiontosplitintotwosets forthecurrentleafnode. Algorithm1Overviewoftheprocessofgrowingdecisiontreesintherandomforestframework.andheight,andthenrandomlysamplefromauniformdistribution.Thesecoordinatesspecifytwodiagonallyoppositeverticesofarectangularre-gion.Suchregionscouldcorrespondtosmallareasoftheimage(e.g.thepurpleboundingboxesinFig.)oreventhecompleteimage.Thisallowsourmethodtocapturebothglobalandlocalinformationintheimage.Inourapproach,eachsampledimageregionisrepre-sentedbyahistogramofvisualdescriptors.Forapairofregions,thefeaturerepresentationisformedbyapplyinghistogramoperations(e.g.concatenation,intersection,etc.)tothehistogramsobtainedfrombothregions.Furthermore,thefeaturesareaugmentedwiththedecisionvalue(describedinSec.)ofthisimagefromitsparentnode(indicatedbythedashedredlinesinFig.).Therefore,ourfeaturerepresentationcombinestheinformationofallupstreamtreenodesthatthecorrespondingimagehasde-scendedfrom.Werefertothisideaasnesting.Usingfeaturesamplingandnesting,weobtainacandidatesetof,correspondingtoacandidateimagere-gionofthecurrentnode.Implementationdetails.Ourmethodisexibletousemanydifferentvisualdescriptors.Inthiswork,wedenselyextractSIFT[]descriptorsoneachimagewithaspacingoffourpixels.Thescalesofthegridstoextractdescrip-torsare8,12,16,24,and30.Usingk-meansclustering,weconstructavocabularyofcodewords.Then,weuseLocality-constrainedLinearCoding[]toassignthede-scriptorstocodewords.Abag-of-wordshistogramrepre-sentationisusediftheareaofthepatchissmallerthan0.2,whilea2-levelor3-levelspatialpyramidisusediftheareaisbetween0.2and0.8orlargerthan0.8respectively.Duringsampling(stepiofAlgorithm),weconsiderfoursettingsofimagepatches:asingleimagepatchandthreetypesofpairwiseinteractions(concatenation,inter- Adictionarysizeof1024,256,256isusedforPASCALaction[PPMI[],andCaltech-UCSDBirds[]datasetsrespectively.section,andabsoluteofdifferenceofthetwohistograms).Wesample25and50imageregions(orpairsofregions)intherootnodeandtherstlevelnodesrespectively,andsample100regions(orpairsofregions)inallothernodes.Samplingasmallernumberofimagepatchesintherootcanreducethecorrelationbetweentheresultingtrees.4.3.LearningtheSplitsInthissection,wedescribetheprocessoflearningthebinarysplitsofthedatausingSVM(stepiiinAlgorithmThisisachievedintwosteps:(1)Randomlyassigningallexamplesfromeachclasstoabinarylabel;(2)UsingSVMtolearnabinarysplitofthedata.Assumethatwehaveclassesofimagesatagivennode.Weuniformlysamplebinaryvariables,,andas-signallexamplesofaparticularclassaclasslabelofAseachnodeperformsabinarysplitofthedata,thisal-lowsustolearnasimplebinarySVMateachnode.Thisimprovesthescalabilityofourmethodtoalargenumberofclassesandresultsinwell-balancedtrees.Usingthefeatureofanimageregion(orpairsofregions)asdescribedinSec.,wendabinarysplitofthedata:gotoleftchildgotorightchildisthesetofweightslearnedfromalinearSVM.Weevaluateeachbinarysplitthatcorrespondstoanim-ageregionorpairsofregionswiththeinformationgaincri-teria[],whichiscomputedfromthecompletetrainingim-agesthatfallatthecurrenttreenode.Thesplitsthatmaxi-mizetheinformationgainareselectedandthesplittingpro-cess(stepiiiinAlgorithm)isrepeatedwiththenewsplitsofthedata.Thetreesplittingstopsifapre-speciedmax-imumtreedepthhasbeenreached,ortheinformationgainofthecurrentnodeislargerthanathreshold,orthenumberofsamplesinthecurrentnodeissmall.4.4.GeneralizationErrorofRandomForestsIn[],ithasbeenshownthatanupperboundforthegen-eralizationerrorofarandomforestisgivenbyisthestrengthofthedecisiontreesintheforest,andisthecorrelationbetweenthetrees.Therefore,thegener-alizationerrorofarandomforestcanbereducedbymakingthedecisiontreesstrongerorreducingthecorrelationbe-tweenthetrees.Inourapproach,welearndiscriminativeSVMclassi-ersforthetreenodes.Therefore,comparedtothetradi-tionalrandomforestswherethetreenodesareweakclas-siersofrandomlygeneratedfeatureweights[],ourde-cisiontreesaremuchstronger.Furthermore,sinceweareconsideringanextremelydensefeaturespace,eachdeci-siontreeonlyconsidersarelativelysmallsubsetofimage Method Phoning Playing Reading Riding Riding Running Taking Using Walking Overall instrument bike horse photo computer CVC-BASE 56.2 56.5 34.7 75.1 83.6 86.5 25.4 60.0 69.2 60.8 CVC-SEL 49.8 52.8 34.3 74.2 85.5 85.1 24.9 64.1 72.5 60.4 SURREY-KDA 52.6 53.5 35.9 81.0 89.3 86.5 32.8 59.2 68.6 62.2 UCLEAR-DOSP 47.0 57.8 26.9 78.8 89.7 87.3 32.5 60.0 70.1 61.1 UMCO-KSVM 53.5 43.0 32.0 67.9 68.8 83.0 34.1 45.9 60.4 54.3 OurMethod 45.0 57.4 41.5 81.8 90.5 89.5 37.9 65.0 72.7 64.6 Table1.Comparisonoftheaverageprecision(%)ofourmethodwiththewinnersofPASCALVOC2010actionclassicationchallenge[Eachrowshowstheresultsobtainedfromonemethod.Thebestresultsarehighlightedwithboldfonts.patches.Thismeansthereislittlecorrelationbetweenthetrees.Therefore,ourrandomforestwithdiscriminativede-cisiontreesalgorithmcanachieveverygoodperformanceonne-grainedimageclassication,whereexploringneimagestatisticsdiscriminativelyisimportant.InSec.weshowthestrengthandcorrelationofdifferentsettingsofrandomforestswithrespecttothenumberofdecisiontrees,whichjustiestheabovearguments.Pleasereferto[]fordetailsabouthowtocomputethestrengthandcorrelationvaluesforarandomforest.5.ExperimentsInthissection,weevaluateouralgorithmonthreene-grainedimagedatasets:theactionclassicationdatasetinPASCALVOC2010[](Sec.),actionsofpeople-playing-musical-instrument(PPMI)[](Sec.),andasubordinateobjectcategorizationdatasetof200birdspecies[](Sec.).Experimentalresultsshowthatouralgorithmoutperformsstate-of-the-artmethodsonthesedatasets.Wealsoevaluatethestrengthandcorrelationofthedecisiontreesinourmethod,andcomparetheresultwiththeothersettingsofrandomforeststoshowwhyourmethodcanleadtobetterclassicationperformance(Sec.5.1.PASCALActionClassicationThemostrecentPASCALVOCchallenge[]incorpo-ratedthetaskofrecognizingactionsinstillimages.Theimagesdescribeninecommonhumanactivities:Phoning,Playingamusicalinstrument,Reading,Ridingabicy-cleormotorcycle,Ridingahorse,Running,Takingaphotograph,Usingacomputer,andWalking.Eachpersonthatweneedtoclassifyisindicatedbyaboundingboxandisannotatedwithoneofthenineactionstheyareperforming.Thereare4090training/validationimagesandasimilarnumberoftestingimagesforeachclass.Asin[],weobtainaforegroundimageforeachper-sonbyextendingtheboundingboxofthepersontocontaintheoriginalsizeoftheboundingbox,andresizingitsuchthatthelargerdimensionis300pixels.Wealsoresizetheoriginalimageaccordingly.Thereforeforeachperson,wehaveapersonimageaswellasabackgroundimage. Figure4.Heatmapsthatshowdistributionsoffrequencythatanimagepatchisselectedinourmethod.Theheatmapsareobtainedbyaggregatingimageregionsofallthetreenodesintherandomforestweightedbytheprobabilityofthecorrespondingclass.Redindicateshighfrequencyandblueindicateslowfrequency.Weonlysampleregionsfromtheforegroundandconcate-natethefeatureswitha2-levelspatialpyramidoftheback-ground.Weuse100decisiontreesinourrandomforest.WecompareouralgorithmwiththemethodsonPAS-CALchallenge[]thatachievethebestaverageprecision.TheresultsareshowninTbl..Ourmethodoutperformstheothersintermsofmeanaverageprecision,andachievesthebestresultonsevenofthenineactions.NotethatweachievedthisaccuracybasedononlygrayscaleSIFTde-scriptors,withoutusinganyotherfeaturesorcontextualin-formationlikeobjectdetectors.showsthefrequencyofanimagepatchbeingse- Method BoW Grouplet SPM LLC Ours [22] [13] [20] mAP(%) 22.7 36.7 39.1 41.8 47.0 Table2.MeanAveragePrecision(%mAP)onthe24-classclassi-cationproblemofthePPMIdataset.Thebestresultishighlightedwithboldfonts.ThegroupletusesoneSIFTscale,whilealltheothermethodsusemultipleSIFTscalesdescribedinSec. Instrument BoW Grouplet SPM LLC Ours [22] [13] [20] Bassoon 73.6 78.5 84.6 85.0 86.2 Erhu 82.2 87.6 88.0 89.5 89.8 Flute 86.3 95.7 95.3 97.3 98.6 FrenchHorn 79.0 84.0 93.2 93.6 97.3 Guitar 85.1 87.7 93.7 92.4 93.0 Saxophone 84.4 87.7 89.5 88.2 92.4 Violin 80.6 93.0 93.4 96.3 95.7 Trumpet 69.3 76.3 82.5 86.7 90.0 Cello 77.3 84.6 85.7 82.3 86.7 Clarinet 70.5 82.3 82.7 84.8 90.4 Harp 75.0 87.1 92.1 93.9 92.8 Recorder 73.0 76.5 78.0 79.1 92.8 Average 78.0 85.1 88.2 89.2 92.1 Table3.ComparisonofmeanAveragePrecision(%mAP)ofourmethodandthebaselineresultsonthePPMIbinaryclassicationtasksofpeopleplayingandholdingdifferentmusicalinstruments.Eachcolumnshowstheresultsobtainedfromonemethod.Thebestresultsarehighlightedwithboldfonts.lectedbyourmethod.Foreachactivity,thegureisob-tainedbyconsideringthefeaturesselectedinthetreenodesweightedbytheproportionofsamplesofthisactivityinthisnode.Fromtheresults,wecanclearlyseethedifferenceofdistributionsfordifferentactivities.Forexample,theim-agepatchescorrespondingtohuman-objectinteractionsareusuallyhighlighted,suchasthepatchesofbikesandbooks.Wecanalsoseethattheimagepatchescorrespondingtobackgroundarenotfrequentlyselected.Thisdemonstratesouralgorithmsabilitytodealwithbackgroundclutter.5.2.People-Playing-Musical-Instrument(PPMI)Thepeople-playing-musical-instrument(PPMI)datasetisintroducedin[].Thisdatasetputsemphasisonunder-standingsubtleinteractionsbetweenhumansandobjects.Therearetwelvemusicalinstruments;foreachinstrumentthereareimagesofpeopleplayingtheinstrumentandhold-ingtheinstrumentbutnotplayingit.Weevaluatetheper-formanceofourmethodwith100decisiontreesonthe24-classclassicationproblem.Wecompareourmethodwithmanybaselineresults.Tbl.showsthatwesignicantly Thebaselineresultsareavailablefromthedatasetwebsite (a)ute (b)guitar (c)violinFigure5.(a)Heatmapofthedominantregionsofinterestselectedbyourmethodforplayinguteonimagesofplayingute(toprow)andholdingautewithoutplayingit(bottomrow).(b,c)showssimilarimagesforguitarandviolin,respectively.Refertoforhowtheheatmapsareobtained. # $& '(!)*& '(!+)Figure6.Heatmapforplayingtrumpetclasswiththeweightedaverageareaofselectedimageregionsforeachtreedepth.outperformthebaselineresults.showstheresultofourmethodonthe12binaryclassicationtaskswhereeachtaskinvolvesdistinguishingtheactivitiesofplayingandnotplayingforthesameinstru-ment.DespiteahighbaselineofmAP,ourmethodoutperformsbytoachievearesultofoverall.Furthermore,weoutperformthebaselinemethodsonnineofthetwelvebinaryclassicationtasks.InFig.,wevi-sualizetheheatmapofthefeatureslearnedforthistask.Weobservethattheyshowsemanticallymeaningfulloca-tionsofwherewewouldexpectthediscriminativeregionsofpeopleplayingdifferentinstrumentstooccur.Forexam-ple,forute,theregionaroundthefaceprovidesimportantinformationwhileforguitar,theregiontotheleftofthetorsoprovidesmorediscriminativeinformation.Itisinter-estingtonotethatdespitetherandomizationandthealgo-rithmhavingnopriorinformation,itisabletolocatetheregionofinterestreliably.Furthermore,wealsodemonstratethatthemethodlearnsacoarse-to-neregionofinterestforidentication.Thisissimilartothehumanvisualsystemwhichisbelievedtoana-lyzerawinputinorderfromlowtohighspatialfrequenciesorfromlargeglobalshapestosmallerlocalones[].Fig.showstheheatmapoftheareaselectedbyourclassierasweconsiderdifferentdepthsofthedecisiontree.Weob- ,-,-, -Figure7.Eachrowrepresentsvisualizationsforasingleclassofbirds(fromtoptobottom):boattailedgrackle,brewersparrow,andgoldenwingedwar-bler.Foreachclass,wevisualize:(a)HeatmapforthegivenclassasdescribedinFig.;(b,c)Twoexampleimagesofthecorrespondingclassandthedistributionofimagepatchesselectedforthespe-cicimage.Theheatmapsareobtainedbydescend-ingeachtreeforthecorrespondingimageandonlyconsideringtheimageregionsofthenodesthatthisimagefallsin. Method MKL[ LLC[ Ours Accuracy 19.0% 18.0% 19.2% Table4.ComparisonofthemeanclassicationaccuracyofourmethodandthebaselineresultsontheCaltech-UCSDBirds200dataset.Thebestperformanceisindicatedwithboldfonts.servethatourrandomforestfollowsasimilarcoarse-to-nestructure.Theaverageareaofthepatchesselectedreducesasthetreedepthincreases.Thisshowsthattheclassierrststartswithmoreglobalfeaturesorhighfrequencyfea-turestodiscriminatebetweenmultipleclasses,andnallyzerosinonthespecicdiscriminativeregionsforsomepar-ticularclasses.5.3.Caltech-UCSDBirds200(CUB-200)TheCaltech-UCSDBirds(CUB-200)datasetcontains6,033annotatedimagesof200differentbirdspecies[Thisdatasethasbeendesignedforsubordinateimagecat-egorization.Itisaverychallengingdatasetasthedif-ferentspeciesareverycloselyrelatedandhavesimilarshape/color.Therearearound30imagesperclasswith15fortrainingandtheremainingfortesting.Thetest-trainsplitsarexed(providedonthewebsite).Theimagesarecroppedtotheprovidedboundingboxannotations.Theseregionsareresizedsuchthatthesmallerimagedimensionis150pixels.Ascolorprovidesimpor-tantdiscriminativeinformation,weextractC-SIFTdescrip-tors[]inthesamewaydescribedinSec..Weuse300decisiontreesinourrandomforest.Tbl.theperformanceofouralgorithmagainsttheLLCbase-lineandthestate-of-the-artresult(multiplekernellearning(MKL)[])onthisdataset.OurmethodoutperformsLLCandachievescomparableperformancewiththeMKLap-proach.Wenotethat[]usesmultiplefeaturese.g.ge-ometricblur,gray/colorSIFT,fullimagecolorhistogramsetc.Itisexpectedthatincludingthesefeaturescanfurtherimprovetheperformanceofourmethod.Furthermore,weshowinFig.thatourmethodisabletocapturetheintra-classposevariationsbyfocusingondifferentimageregionsfordifferentimages.5.4.StrengthandCorrelationofDecisionTreesWecompareourmethodagainsttwocontrolsettingsofrandomforestsonthePASCALactiondataset[Densefeature,weakclassier:Foreachimageregionorpairsofregionssampledfromourdensesamplingspace,replacetheSVMclassierinourmethodwithaweakclassierasintheconventionaldecisiontreelearningapproach[],i.e.randomlygenerating100setsoffeatureweightsandselectthebestone.SPMfeature,strongclassier:UseSVMclassierstosplitthetreenodesasinourmethod,buttheimagere-gionsarelimitedtothatfroma4-levelspatialpyramid.Notethatallothersettingsoftheabovetwoapproachesremainunchangedascomparedtoourmethod(asdescribedinSec.).Fig.showsthatonthisdataset,asetofstrongclassierswithrelativelyhighcorrelationcanleadtobetterperformancethanasetofweakclassierswithlowcorrela-tion.Wecanseethattheperformanceofrandomforestscanbesignicantlyimprovedbyusingstrongclassiersinthenodesofdecisiontrees.Comparedtotherandomforeststhatonlysamplespatialpyramidregions,usingthedensesamplingspaceobtainsstrongertreeswithoutsignif-icantlyincreasingthecorrelationbetweendifferenttrees,therebyimprovingtheclassicationperformance.Further-more,theperformanceoftherandomforestsusingdiscrim-inativenodeclassiersconvergeswithasmallnumberofdecisiontrees,indicatingthatourmethodismoreefcientthantheconventionalrandomforestapproach.Inourex-periment,thetwosettingsandourmethodneedasimilaramountoftimetotrainasingledecisiontree.Additionally,weshowtheeffectivenessofrandombi-naryassignmentofclasslabels(Sec.)whenwetrainclassiersforeachtreenode.Hereweignorethisstepand !"#$ %!$ !$ (a)Meanaverageprecision(mAP). %& !"#$ %!$ !$ (b)Strengthofthedecisiontrees. '$ !"#$ %!$ !$ (c)Correlationbetweenthedecisiontrees.Figure8.(a)Wecomparetheclassicationperformance(mAP)obtainedbyourmethoddensefeature,strongclassierwithtwocontrolsettings.PleaserefertoSec.fordetailsofthesesettings.(b,c)Wealsocomparethestrengthofthedecisiontreeslearnedbytheseapproachesandcorrelationbetweenthesetrees(Sec.),whicharehighlyrelatedtothegeneralizationerrorofrandomforests.trainaone-vs-allmulti-classSVMforeachsampledimageregionorpairsofregions.Inthiscasesetsofweightsareobtainedwhenthereareclassesofimagesatthecurrentnode.Thebestsetofweightsisselectedusinginformationgainasbefore.Thissettingleadstodeeperandsignicantlyunbalancedtrees,andtheperformancedecreasesto58.1%with100trees.Furthermore,itishighlyinefcientasitdoesnotscalewellwithincreasingnumberofclasses.6.ConclusionInthiswork,weproposearandomforestwithdiscrimi-nativedecisiontreesalgorithmtoexploreadensesamplingspaceforne-grainedimagecategorization.Experimentalresultsonsubordinateclassicationandactivityclassica-tionshowthatourmethodachievesstate-of-the-artperfor-manceanddiscoversmuchsemanticallymeaningfulinfor-mation.Thefutureworkistojointlytrainalltheinforma-tionthatisobtainedfromthedecisiontrees.Acknowledgement.L.F-F.ispartiallysupportedbyanNSFCAREERgrant(IIS-0845230),anONRMURIgrant,theDARPAVIRATprogramandtheDARPAMindsEyeprogram.B.Y.ispartiallysupportedbytheSAPStan-fordGraduateFellowship.WewouldliketothankCarstenRotherandhiscolleaguesinMicrosoftResearchCam-bridgeforhelpfuldiscussionsaboutrandomforest.WealsowouldliketothankHaoSuandOlgaRussakovskyfortheircommentsonthepaper.References[1]A.Bosch,A.Zisserman,andX.Munoz.Imageclassicationusingrandomforestsandferns.In,2007.[2]S.Branson,C.Wah,B.Babenko,F.Schroff,P.Welinder,P.Perona,andS.Belongie.Visualrecognitionwithhumansintheloop.In,2010.[3]L.Breiman.Randomforests.Mach.Learn.,45:5 32,2001.[4]C.A.CollinandP.A.McMullen.Subordinate-levelcategorizationreliesonhighspatialfrequenciestoagreaterdegreethanbasic-levelcategorization.Percept.Psychophys.,67(2):354 364,2005.[5]V.Delaitre,I.Laptev,andJ.Sivic.Recognizinghumanactionsinstillimages:Astudyofbag-of-featuresandpart-basedrepresentations.In,2010.[6]G.Duan,C.Huang,H.Ai,andS.Lao.Boostingassociatedpairingcomparisonfeaturesforpedestriandetection.InICCVWorkshoponVisualSurveillance,2009.[7]M.Everingham,L.V.Gool,C.K.I.Williams,J.Winn,andA.Zisserman.ThePASCALVisualObjectClassesChallenge2010(VOC2010)Results.[8]L.Fei-Fei,R.Fergus,andA.Torralba.Recognizingandlearningobjectcategories.InICCVTutorial,2005.[9]L.Fei-FeiandP.Perona.Abayesianhierarchicalmodelforlearningnaturalscenecategories.In,2005.[10]P.F.Felzenszwalb,R.B.Girshick,D.McAllester,andD.Ramanan.Objectdetectionwithdiscriminativelytrainedpart-basedmodels.IEEET.PatternAnal.,32(9):1627 1645,2010.[11]A.B.HillelandD.Weinshall.Subordinateclassrecognitionusingrelationalobjectmodels.In,2007.[12]K.E.JohnsonandA.T.Eilers.Effectsofknowledgeanddevelop-mentonsubordinatelevelcategorization.CognitiveDev.,13(4):515 545,1998.[13]S.Lazebnik,C.Schmid,andJ.Ponce.Beyondbagsoffeatures:Spatialpyramidmatchingforrecognizingnaturalscenecategories.,2006.[14]D.G.Lowe.Distinctiveimagefeaturesfromscale-invariantkey-Int.J.Comput.Vision,60(2):91 110,2004.[15]F.Moosmann,B.Triggs,andF.Jurie.Fastdiscriminativevisualcodebooksusingrandomizedclusteringforests.In,2007.[16]A.OlivaandA.Torralba.Modelingtheshapeofthescene:Aholis-ticrepresentationofthespatialenvelope.Int.J.Comput.Vision42(3):145 175,2001.[17]J.Shotton,M.Johnson,andR.Cipolla.Semantictextonforestsforimagecategorizationandsegmentation.In,2008.[18]Z.Tu.Probabilisticboosting-tree:Learningdiscriminativemodelsforclassication,recognition,andclustering.In,2005.[19]K.E.A.vandeSande,T.Gevers,andC.G.M.Snoek.Evaluatingcolordescriptorsforobjectandscenerecognition.IEEET.Pattern,32(9):1582 1596,2010.[20]J.Wang,J.Yang,K.Yu,F.Lv,T.Huang,andY.Gong.Locality-constrainedlinearcodingforimageclassication.In,2010.[21]P.Welinder,S.Branson,T.Mita,C.Wah,F.Schroff,S.Belongie,andP.Perona.Caltech-UCSDbirds200.TechnicalReportCNS-TR-201,Caltech,2010.[22]B.YaoandL.Fei-Fei.Grouplet:Astructuredimagerepresentationforrecognizinghumanandobjectinteractions.In,2010.