/
Nonparametric Scene Parsing via Label Transfer Ce Liu Member IEEE  Jenny Yuen Student Nonparametric Scene Parsing via Label Transfer Ce Liu Member IEEE  Jenny Yuen Student

Nonparametric Scene Parsing via Label Transfer Ce Liu Member IEEE Jenny Yuen Student - PDF document

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
656 views
Uploaded On 2014-12-18

Nonparametric Scene Parsing via Label Transfer Ce Liu Member IEEE Jenny Yuen Student - PPT Presentation

In this paper we propose a novel nonparametric approach for object recognition and scene parsing using a new technology we name labeltransfer For an input image our system first retrieves its nearest neighbors from a large database containing fully ID: 26027

this paper

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Nonparametric Scene Parsing via Label Tr..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

CeLiureceivedtheBSdegreeinautomationandtheMEdegreeinpatternrecognitionfromtheDepartmentofAutomation,TsinghuaUni-versityin1999and2002,respectively.AfterreceivingthePhDdegreefromtheMassachu-settsInstituteofTechnologyin2009,henowholdsaresearcherpositionatMicrosoftRe-searchNewEngland.From2002to2003,heworkedatMicrosoftResearchAsiaasanassistantresearcher.Hisresearchinterestsincludecomputervision,computergraphics,andmachinelearning.Hehaspublishedmorethan20papersinthetopconferencesandjournalsinthesefields.HereceivedaMicrosoftFellowshipin2005,theOutstandingStudentPaperawardattheAdvancesinNeuralInformationProcessingSystems(NIPS)in2006,aXeroxFellowshipin2007,andtheBestStudentPaperawardattheIEEEConferenceonComputerVisionandPatternRecognition(CVPR)2009.HeisamemberoftheJennyYuenreceivedtheBSdegreeincomputerengineeringfromtheUniversityofWashingtonin2006,followedbytheMSdegreeincomputersciencefromtheMassa-chusettsInstituteofTechnologyin2008.SheiscurrentlyworkingtowardthePhDdegreeincomputerscienceattheMassachusettsInsti-tuteofTechnology.ShewasawardedaNationalDefenseScienceandEngineeringfellowshipaswellasaUSNationalScienceFoundationFellowshipin2007.ShereceivedtheBestStudentPaperawardattheIEEEConferenceonComputerVisionandPatternRecognition(CVPR)2009.SheisastudentmemberoftheIEEE.AntonioTorralbareceivedthedegreeintelecommunicationsengineeringfromtheUni-versidadPolitecnicadeCataluna,Spain;hereceivedthePhDdegreeinsignal,image,andspeechprocessingfromtheInstitutNationalPolytechniquedeGrenoble,France.There-after,hespentpostdoctoraltrainingattheBrainandCognitiveScienceDepartmentandtheComputerScienceandArtificialIntelligenceLaboratoryattheMassachusettsInstituteofTechnology(MIT).HeisanassociateprofessorofelectricalengineeringandcomputerscienceintheComputerScienceandArtificialIntelligenceLaboratory(CSAIL)atMIT.HeisamemberoftheIEEE.Formoreinformationonthisoranyothercomputingtopic,pleasevisitourDigitalLibraryatwww.computer.org/publications/dlib.2382IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE,VOL.33,NO.12,DECEMBER2011 Ç parsing,orrecognizingandsegmentingobjectsinanimage,isoneofthecoreproblemsofcomputervision.Traditionalapproachestoobjectrecognitionbeginbyspecifyinganobjectmodel,suchastemplatematching[8],[49],constellations[13],[15],bagsoffeatures[19],[24],[44],[45],orshapemodels[2],[3],[14],etc.Theseapproachestypicallyworkwithafixednumberofobjectcategoriesandrequiretraininggenerativeordiscriminativemodelsforeachcategoryfromtrainingdata.Intheparsingstage,thesesystemstrytoalignthelearnedmodelstotheinputimageandassociateobjectcategorylabelswithpixels,windows, oursystemgeneratesreasonablepredictionsforthepixelsannotatedas“unlabeled.”Theaveragepixel-wiserecogni-tionrateofoursystemis76.67percentbyexcludingthe“unlabeled”class[43].SomefailureexamplesfromoursystemareshowninFig.11whenthesystemfailstoretrieveimageswithsimilarobjectcategoriestothequery,orwhentheannotationisambiguous.Overall,oursystemisabletopredicttherightobjectcategoriesintheinputimagewithasegmentationfittoimageboundary,eventhoughthebestmatchmaylookdifferentfromtheinput,e.g.,2,11,12,and17.Ifwedividetheobjectcategoriesinto,and)and,and)[1],[22],oursystemgeneratesmuchbetterresultsforstuffthanforthings.Therecognitionrateforthetopsevenobjectcategories(allare“stuff”)is82.72percent.Thisisbecauseinourcurrentsystem,weonlyallowonelabelingforeachpixel,andsmallerobjectstendtobeoverwhelmedbythelabelingoflargerobjects.Weplantobuildarecursivesysteminourfutureworktofurtherretrievethingsbasedontheinferredstuff.Forcomparisonpurposes,wedownloadedandexecutedthetexton-boostcodefrom[43]usingthesametrainingandtestdatawiththeMarkovrandomfieldturnedoff.Theoverallpixel-wiserecognitionrateoftheirsystemonourdatasetis51.67percent,andtheper-classratesaredisplayedinFig.12c.ForfairnesswealsoturnedofftheMarkovrandomfieldmodelaswellasspatialpriorsinourframeworkby,andplottedthecorrespondingresultsinFig.12f.Clearly,oursystemoutperforms[43]intermsofbothoverallandper-classrecognitionrate.Similarperformancetotexton-boostisachievedbymatchingcolorinsteadofmatchingdenseSIFTdescriptorsinoursystem,asshowninFig.12b.Therecognitionrateoftheclassdramaticallyincreasesthroughmatchingcolorbecausecoloristhesalientfeatureforthesecategories.However,theperformancedropsforothercolor-variantcategories.Thisresultsupportstheimportanceofmatchingappearance-invariantfeaturesinourlabeltransfersystem.Wealsocomparedtheperformanceofoursystemwithaclassifier-basedsystem[8].Wedownloadedtheircodeandtrainedaclassifierforeachobjectcategoryusingthesametrainingdata.Weconvertedoursystemintoabinaryobjectdetectorforeachclassbyonlyusingtheper-classlikelihoodterm.Theper-classROCcurvesofoursystem(red)andtheirs(blue)areplottedinFig.9.Exceptforfiveobject,and,oursystemoutperformsorequalstheirs.5.1.3ParameterSelectionSincetheSIFTflowmoduleisessentialtooursystem,wefirsttestspatialsmoothnesscoefficientin(4),whichdeterminesmatchingresults.Wecomputetheaveragepixel-wiserecognitionrateasafunctionof,showninFig.13a.WefirstturnofftheMRFmodelinthelabeltransfermoduleby,andfindthatwhen,themaximalrecognitionrateisachieved.Then,weturnontheMRFmodelbysetting,andfindthattoagoodperformanceaswell.Therefore,wefixthroughoutourexperiments.Weinvestigatedtheperformanceofoursystembyvaryingtheparameters,and.Wehavefoundthattheinfluenceofissmallerthanthatofissetsuchthatmostsampleshavenearestneighbors.Wevary.Foreachcombina-tionof),coordinatedescendisusedtofindtheoptimalparameterofbymaximizingtherecognitionrate.WeplottherecognitionrateasafunctionforavarietyofsinFig.13b.Overall,therecognitionrateincreasesasmorenearestneighborsareretrieved(andmorevotingcandidatesareused()since,obviously,multiplecandidatesareneededtotransferlabelstothequery.However,therecognitionratedropsascontinuetoincreaseasmorecandidatesmayintroducenoisetothelabeltransferprocess.Inparticular,therecognitionratedropswhenincreases,suggestingthatsceneretrievaldoesnotonlyserveasawaytoobtainneighborsforSIFTflow,butalsoruleoutsomebadimagesthatSIFTflowwouldotherwisechoose.Themaximumperformanceisobtainedwhen LIUETAL.:NONPARAMETRICSCENEPARSINGVIALABELTRANSFER Fig.9.TheROCcurveofeachindividualpixel-wisebinaryclassifier.Redcurve:Oursystemafterbeingconvertedtobinaryclassifiers;bluecurve:thesystemin[8].WeusedtheconvexhulltomaketheROCcurvesstrictlyconcave.Thenumbern;munderneaththenameofeachplotisthequantityoftheobjectinstancesinthetestandtrainingset,respectively.Forexample,(170,2,124)under“sky”meansthatthereare170testimagescontainingsky,and2,124trainingimagescontainingsky(thereareintotal2,488trainingimagesand200testimages).Oursystemobtainsreasonableperformanceforobjectswithsufficientsamplesinbothtrainingandtestsets,e.g.,,and.WeobservetruncationintheROCcurveswheretherearenotenoughtestsamples,e.g.,,and.Theperformanceispoorforobjectswithoutenoughtrainingsamples,e.g.,crosswalk,and.TheROCdoesnotexistforobjectswithoutanytestsamples,e.g.,,and.Incomparison,oursystemoutperformsorequals[8]forallobjectcategoriesexceptfor,and.Theperformanceof[8]onourdatabaseislowbecausetheobjectshavedrasticallydifferentposesandappearances. isrecognizedasandtheisrecognizedasWhile,inourcurrentevaluationframework,thesepixelsareconsideredmisclassified,thisparsingwouldbeconsideredaccuratewhenevaluatedbyahuman.Amorepreciseevaluationcriterionwouldtakesynonymsintoaccount.AnotherexampleisshowninFig.16(12),wherethewindowsarenotpresentintheparsingresult.Therefore,pixelsarelabeledwrongbecausetheyareclassifiedas,whichisamorefavorablelabelthan,forexample,,aswindowstendtoappearontopofbuildings.Asuperiorevaluationcriterionshouldalsoconsiderco-occurrenceandocclusionrelationships.Weleavetheseitemsasfuturework.6.4NonparametricversusParametricApproachesInthispaper,wehavedemonstratedpromisingresultsofournonparametricsceneparsingsystemusinglabeltransferbyshowinghowitoutperformsexistingrecogni-tion-basedapproaches.However,wedonotbelievethatoursystemaloneistheultimateanswertoimageunder-standingsinceitdoesnotworkwellforsmallobjectssuch,etc.,whichcanbebetterhandledusingdetectors.Moreover,pixel-wiseclassifierssuchastextonboostcanalsobeusefulwhengoodmatchingcannotbeestablishedorgoodnearestneighborscanhardlyberetrieved.Therefore,anaturalfuturestepistocombinethesemethodsforsceneparsingandimageunderstanding.Wehavepresentedanovel,nonparametricsceneparsingsystemtointegrateandtransfertheannotationsfromalargedatabasetoaninputimageviadensescenealignment.Acoarse-to-fineSIFTflowmatchingschemeisproposedtoreliablyandefficientlyestablishdensecorrespondencesbetweenimagesacrossscenes.Usingthedensescenecorrespondences,wewarpthepixellabelsoftheexistingsamplestothequery.Furthermore,weintegratemultiplecuestosegmentandrecognizethequeryimageintotheobjectcategoriesinthedatabase.Promisingresultshavebeenachievedbyourscenealignmentandparsingsystemonchallengingdatabases.Comparedtoexistingapproachesthatrequiretrainingforeachobjectcategory,ournonpara-metricsceneparsingsystemiseasytoimplement,hasonlyafewparameters,andembedscontextualinformationnaturallyintheretrieval/alignmentprocedure.FundingforthisresearchwasprovidedbytheRoyalDutch/ShellGroup,NGANEGI-1582-04-0004,MURIGrantN00014-06-1-0734,USNationalScienceFoundation(NSF)Careeraward(IIS0747120),andaNationalDefenseScienceandEngineeringGraduateFellowship.CeLiuwishestothankProfessorWilliamT.FreemanandProfessorEdwardH.Adelsonforinsightfuldiscussions.[1]E.H.Adelson,“OnSeeingStuff:ThePerceptionofMaterialsbyHumansandMachines,”Proc.SPIE,vol.4299,pp.1-12,1-12,S.Belongie,J.Malik,andJ.Puzicha,“ShapeContext:ANewDescriptorforShapeMatchingandObjectRecognition,”AdvancesinNeuralInformationProcessingSystems,Systems,A.Berg,T.Berg,andJ.Malik,“ShapeMatchingandObjectRecognitionUsingLowDistortionCorrespondence,”Proc.IEEEConf.ComputerVisionandPatternRecognition,Recognition,I.BorgandP.Groenen,ModernMultidimensionalScaling:TheoryandApplications,seconded.Springer-Verlag,2005.2005.S.Branson,C.Wah,B.Babenko,F.Schroff,P.Welinder,P.Perona,andS.Belongie,“VisualRecognitionwithHumansintheLoop,”Proc.EuropeanConf.ComputerVision,2380IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE,VOL.33,NO.12,DECEMBER2011 Fig.16.SomesceneparsingresultsontheSUNdatabasefollowingthesamenotationasinFig.10.NotethattheLabelMeOutdoordatabaseisasubsetoftheSUNdatabase,butthelattercontainsalargerproportionofindoorandactivity-basedscenes.Asitisinallcases,thesuccessofthefinalparsingdependsonthesemanticsimilarityofthequeryin(a)tothewarpedsupportimage(d).Examples(10)and(11)aretwofailureexampleswheremanypeoplearenotcorrectlylabeled. smoothness)intoarobustannotation.Promisingexperi-mentalresultsareachievedonimagesfromtheLabelMedatabase[39].Ourgoalistoexploretheperformanceofsceneparsingthroughthetransferoflabelsfromexistingannotatedimages,ratherthanbuildingacomprehensiveobjectrecognitionsystem.Weshow,however,thattheperformanceofoursystemoutperformsexistingapproaches[8],[43]onourdatabases.OurcodeanddatabasescanbedownloadedatThispaperisorganizedasfollows:InSection2,webrieflysurveytheobjectrecognitionanddetectionliterature.AftergivingasystemoverviewinSection3,wedescribe,indetail,eachcomponentofoursysteminSection4.ThoroughexperimentsareconductedinSection5forevaluation,andin-depthdiscussionisprovidedinSection6.WeconcludeourpaperinSection7.Objectrecognitionisanareaofresearchthathasgreatlyevolvedoverthelastdecade.Manyworksfocusingonsingle-classmodeling,suchasfaces[11],[48],[49],digits,characters,andpedestrians[2],[8],[25],havebeenprovensuccessfuland,insomecases,theproblemshavebeenmostlydeemedassolved.Recenteffortshaveturnedtomainlyfocusingintheareaofmulticlassobjectrecognition.Increatinganobjectdetectionsystem,therearemanybasicbuildingblockstotakeintoaccount;featuredescriptionandextractionisthefirststeppingstone.Examplesofdescrip-torsincludegradient-basedfeaturessuchasSIFT[30]andHOG[8],shapecontext[2],andpatchstatistics[42].Consequently,selectedfeaturedescriptorscanbefurtherappliedtoimagesineitherasparse[2],[16],[19]mannerbyselectingthetopkeypointscontainingthehighestresponsefromthefeaturedescriptor,ordenselybyobservingfeaturestatisticsacrosstheimage[40],[51].Sparsekeypointrepresentationsareoftenmatchedamongpairsofimages.SincethegenericproblemofmatchingtwosetsofkeypointsisNP-hard,approximationalgorithmshavebeendevelopedtoefficientlycomputekeypointmatchesminimizingerrorrates(e.g.,thepyramidmatchkernel[19]andvocabularytrees[32],[33]).Ontheotherhand,denserepresentationshavebeenhandledbymodelingdistributionsofthevisualfeaturesoverneighbor-hoodsintheimageorintheimageasawhole[24],[40],[51].Wechosethedenserepresentationinthepaperduetorecentadvancesindenseimagematching[28],[29].Atahigherlevel,wecanalsodistinguishtwotypesofobjectrecognitionapproaches:approachesthatconsistoflearninggenerative/discriminativemodels,andapproachesthatrelyonimageretrievalandmatching.Intheparametricfamilywecanfindnumeroustemplate-matchingmethods,whereclassifiersaretrainedtodiscriminatebetweenanimagewindowcontaininganobjectorabackground[8].However,thesemethodsassumethatobjectsaremostlyrigidandaresusceptibletolittleornodeformation.Toaccountforarticulatedobjects,constellationmodelshavebeendesignedtomodelobjectsasensemblesofparts[13],[14],[15],[50],consideringspatialinformation[7],depthorderinginformation[53],andmultiresolutionmodes[35].Recently,anewideaofintegratinghumansintheloopviacrowdsourcingforvisualrecognitionofspecializedclassessuchasplantsandanimalspecieshasemerged[5];thismethodintegratesthedescriptionofanobjectinlessthan20discriminativequestionsthathumanscanansweraftervisuallyinspectingtheimage.IntherealmofnonparametricmethodswefindsystemssuchasVideoGoogle[44],asystemthatallowsuserstospecifyavisualqueryofanobjectinavideoandsubse-quentlyretrieveinstancesofthesameobjectacrossthemovie.Anothernonparametricsystemistheonein[38],whereapreviouslyunknownqueryimageismatchedagainstadenselylabeledimagedatabase;thenearestneighborsareusedtobuildalabelprobabilitymapforthequery,whichisfurtherusedtopruneoutobjectdetectorsofclassesthatareunlikelytotakeplaceintheimage.Nonparametricmethodshavealsobeenwidelyusedinwebdatatoretrievesimilarimages.Forexample,in[17],acustomizeddistancefunctionisusedataretrievalstagetocomputethedistancebetweenaqueryimageandimagesinthetrainingset,whichsubse-quentlycastvotestoinfertheobjectclassofthequery.Inthesamespirit,ournonparametriclabeltransfersystemavoidsmodelingobjectappearancesexplicitlyasoursystemparsesaqueryimageusingtheannotationofsimilarimagesinatrainingdatabaseanddenseimagecorrespondences.Recently,severalworkshavealsoconsideredcontextualinformationinobjectdetectionstocleanandreinforceindividualresults.Amongcontextualcuesthathavebeenusedareobject-levelco-occurrences,spatialrelationships[6],[9],[18],[31],[36],and3Dscenelayout[23].Foramoredetailedandcomprehensivestudyandbenchmarkofcontextualworks,wereferto[10].Insteadofexplicitlymodelingcontext,ourmodelincorporatescontextimplicitlyasobjectco-occurrencesandspatialrelationshipsareretainedinlabeltransfer.Anearlierversionofourworkappearedat[27];inthispaper,wewillexplorethelabel-transferframeworkin-depthwithmorethoroughexperimentsandinsights.Otherrecentpapershavealsointroducedsimilarideas.Forinstance,in[46],oversegmentationisperformedtothequeryimageandsegment-basedclassifierstrainedonthenearestneighborsareappliedtorecognizeeachsegment.In[37],sceneboundariesarediscoveredbythecommonedgessharedbynearestneighbors. LIUETAL.:NONPARAMETRICSCENEPARSINGVIALABELTRANSFER Fig.1.Foraqueryimage(a),oursystemfindsthetopmatches(b)(threeareshownhere)usingsceneretrievalandaSIFTflowmatchingalgorithm[28],[29].Theannotationsofthetopmatches(c)aretransferredandintegratedtoparsetheinputimage,asshownin(d).Forcomparison,theground-truthuserannotationof(a)isshownin(e). truthannotations;pleaserefertoSection4.1forthedetails.Thisupperboundisan83.79percentrecognitionrate.Tofurtherunderstandourdata-drivensystem,weevaluatedtheperformanceofthesystemasafunctionoftheratiooftrainingsampleswhilefixingthetestset.Foreachfixedratio,weformedasmalltrainingdatabaserandomlydrawingsamplesfromtheoriginaldatabaseandevaluatedtheperformanceofthesystemunderthisdatabase.Thisexperimentwasperformed15timesforeachratiotoobtainameanandstandarddeviationofitsperformance,showninFig.13c.Clearly,therecognitionratedependsonthesizeofthetrainingdatabase.Usingthelast10datapointsforextrapolation,wefoundthatifweincreasethetrainingdataby10times(correspondingto1onthehorizontalaxis),therecognitionratemayincreaseto84.16percentNote,however,thatthislinearextrapolationdoesnotconsiderpotentialsaturationissuesasitcanbeobservedwhenmorethan10percentoftrainingsampleswereused.Thisindicatesthatthetrainingquantityisreasonableforthisdatabase.Anotheraspectweevaluatedisthecapacitytodetectparsingresults.Forthispurpose,wererankthetestimageusingthreemetrics:recognitionrate(theidealmetric;evaluatedwithrespecttogroundtruth),theparsingobjectivein(5)afterenergyminimization,andtheaverageGIST-baseddistancetothenearestneighbors.Afterranking,wecomputedtheaccumulatedaveragerecognitionrateasafunctionoftheratiooftestingsamples,asshowninFig.13b.Ifweusetheparsingobjectiveasametric,forexample,thentheaveragerecognitionratecanbegreaterthan80percentwhenonlythetop75percentparsingresultsarepicked.Thesystemcanrejecttherest25percentwithlowscores.5.2SUNDatabaseWefurtherevaluatedtheperformanceofoursystemontheSUNdatabase[52],whichcontains9,556imagesofboth LIUETAL.:NONPARAMETRICSCENEPARSINGVIALABELTRANSFER Fig.12.Westudytheperformanceofoursystemindepth.(a)Oursystemwiththeparametersoptimizedforpixel-wiserecognitionrate.(b)Oursystem,matchingRGBinsteadofmatchingdenseSIFTdescriptors.(c)Theperformanceof[43]withtheMarkovrandomfieldcomponentturnedoff,trainedandtestedonthesamedatasetsas(a).In(d),(e),(f),weshowtheimportanceofSIFTflowmatchingandtheMRFforlabeltransferbyturningthemonandoff.In(g)and(h),weshowthesystemperformanceaffectedbyothersceneretrievalmethods.Theperformancein(h)showstheupperlimitofoursystem,byadoptingidealsceneretrievalusingground-truthannotation(ofcourse,theground-truthannotationisnotavailainpractice).Seetextformoredetails. Fig.13.(a):Recognitionrateasafunctionofthespatialsmoothnesscoefficientundertwosettingsfor.(b):Recognitionrateasafunctionofnumberofnearestneighborsandthenumberofvotingcandidates.Clearly,priorandspatialsmoothnesshelpimprovetherecognitionrate.ThefactthatthecurvedropsdownaswefurtherincreaseindicatesthatSIFTflowmatchingcannotreplacesceneretrieval.(c):Recognitionrateasafunctionofthelogtrainingratiowhilethetestsetisfixed.Asubsetoftrainingsamplesarerandomlydrawnfromtheentiretrainingsetaccordintothetrainingratiototesthowtheperformancedependsonthesizeofthedatabase.(d):RecognitionrateasafunctionoftheproportionofthetoprankedtestimagesaccordingtometricsincludingGIST,theparsingobjectivein(5),andtherecognitionrate(withground-truthannotation).Theblack,dashedcurvewithrecognitionrateassortingmetricistheidealcase,andtheparsingobjectiveisbetterthanGIST.Thesecurvessuggeststhesystemissomewhatcapableofdistinguishinggoodparsingresultsfrombadones.3.ThisextrapolationisdifferentfrommovingtoalargerdatabaseinSection5.2,whereindoorscenesareincluded.ThisnumberisanticipatedonlywhenimagessimilartotheLMOdatabaseareadded. ,and.ThespatialpriorsoftheseobjectcategoriesaredisplayedatthebottomofFig.3,wherewhitedenoteszeroprobabilityandthesaturationofcolorisdirectlyproportionaltoitsprobability.Notethat,consistentwithcommonknowledge,occupiestheupperpartoftheimagegridandoccupiesthelowerpart.Furthermore,thereareonlylimitedsamplesforthe,andInthissection,wewilldescribeeachmoduleofournonparametricsceneparsingsystem.4.1SceneRetrievalTheobjectiveofsceneretrievalistoretrieveasetofnearestneighborsinthedatabaseforagivenqueryimage.Thereexistseveralwaysfordefininganearestneighborset.ThemostcommondefinitionconsistsoftakingtheKclosestpointstothequery(-NN).Anothermodel,-NN,widelyusedintexturesynthesis[12],[26],considersalloftheneighborswithintimestheminimumdistancefromthequery.WegeneralizethesetwotypestoK;-NN,anddefineitasx;yÞðx;yargmindistx;y-NNisreducedto-NN.As-NNisreducedto-NN.However,K;representationgivesustheflexibilitytodealwiththedensityvariationofthegraph,asshowninFig.5.Wewillshowhowaffectstheperformanceintheexperimentalsection.Inpractice,wefoundthatisagoodparameterandwewilluseitthroughourexperiments.Nevertheless,dramaticimprovementofK;-NNover-NNisnotexpectedassparsesamplesarefewinourdatabases.Wehavenotyetdefinedthedistancefunctionbetweentwoimages.Measuringimagesimilarities/dis-tancesisstillanactiveresearcharea;asystematicstudyofimagefeaturesforscenerecognitioncanbefoundin[52].Inthispaper,threedistancesareused:euclideandistanceofGIST[34],spatialpyramidhistogramintersectionofHOGvisualwords[24],andspatialpyramidhistogramintersec-tionoftheground-truthannotation.FortheHOGdistance,weusethestandardpipelineofcomputingHOGfeaturesonadensegridandquantizingfeaturestovisualwordsoverasetofimagesusingk-meansclustering.Thegroundtruth-baseddistancemetricisusedtoestimateanupperboundofoursystemforevaluationpurposes.BoththeHOGandthegroundtruthdistancesarecomputedinthesamemanner.Thegroundtruthdistanceiscomputedbybuildinghisto-gramsofpixel-wiselabels.Toincludespatialinformation,thehistogramsarecomputedbydividinganimageintowindowsandconcatenatingthefourhistogramsintoasinglevector.Histogramintersectionisusedtocomputethegroundtruthdistance.WeobtaintheHOGdistancebyreplacingpixel-wiselabelswithHOGvisualwords.InFig.4,weshowtheimportanceofthedistancemetricasitdefinestheneighborhoodstructureofthelargeimagedatabase.Werandomlyselected200imagesfromtheLMOdatabaseandcomputedpair-wiseimagedistancesusingGIST(top)andtheground-truthannotation(bottom).Then,weusemultidimensionalscaling(MDS)[4]tomaptheseimagestopointsona2Dgridforvisualization.AlthoughtheGISTdescriptorisabletoformareasonablymeaningfulimagespacewheresemanticallysimilarimagesareclustered,theimagespacedefinedbytheground-truthannotationtrulyrevealstheunderlyingstructuresoftheimagedatabase.Thiswillbefurtherexaminedintheexperimentalsection.4.2SIFTFlowforDenseSceneAlignmentAsourgoalistotransferthelabelsofexistingsamplestoparseaninputimage,itisessentialtofindthedensecorrespondenceforimagesacrossscenes.Inourpreviouswork[29],wehavedemonstratedthatSIFTflowiscapableofestablishingsemanticallymeaningfulcorrespondencesamongtwoimagesbymatchinglocalSIFTdescriptors.WefurtherextendedSIFTflowintoahierarchicalcomputa-tionalframeworktoimprovetheperformance[27].Inthissection,wewillprovideabriefexplanationofthealgorithm;foradetaileddescription,wereferto[28].Similarlytoopticalflow,thetaskofSIFTflowistofinddensecorrespondencebetweentwoimages.Letx;ycontainthespatialcoordinateofapixel,andbetheflowvectorat.Denoteastheper-pixelSIFTdescriptor[30]fortwoimages,allthespatialneighborhood(afour-neighborsystemisused).TheenergyfunctionforSIFTflowisdefinedas:ÞþðÞjþjÞjÞþðwhichcontainsadatatermsmalldisplacementterm,andsmoothnessterm(a.k.a.spatialregularization).Thedatatermin(2)constrainstheSIFTdescriptorstobematchedalongwiththeflowvector.Thesmalldisplacementtermin(3)constrainstheflowvectorstobeassmallaspossiblewhennootherinformationisavailable.Thesmoothnesstermin(4)constrainstheflowvectorsofadjacentpixelstobesimilar.Inthisobjectivefunction,truncatedL1normsareusedinboththedatatermandthesmoothnesstermtoaccountformatchingoutliersandflowdiscontinuities,withthethreshold,respectively.WhileSIFTflowhasdemonstratedthepotentialforaligningimagesacrossscenes[29],theoriginalimplementa-tionscalespoorlywithrespecttotheimagesize.InSIFTflow,apixelinoneimagecanliterallymatchtoanyotherpixelinanotherimage.Supposetheimagehaspixels,thenthetimeandspacecomplexityofthebeliefpropagationalgorithmtoestimatetheSIFTflowis.Asreported LIUETAL.:NONPARAMETRICSCENEPARSINGVIALABELTRANSFER 2.SIFTdescriptorsarecomputedateachpixelusingaThewindowisdividedintocells,andimagegradientswithineachcellarequantizedintoa8-binhistogram.Therefore,thepixel-wiseSIFTfeatureisa128Dvector. Thecoreideaofournonparametricsceneparsingsystemisrecognition-by-matching.Toparseaninputimage,wematchthevisualobjectsintheinputimagetotheimagesinadatabase.Ifimagesinthedatabaseareannotatedwithobjectcategorylabelsandifthematchingissemanticallymeaningful,i.e.,correspondsto,thenwecansimplytransferthelabelsoftheimagesinthedatabasetoparsetheinput.Nevertheless,weneedtodealwithmanypracticalissuesinordertobuildareliablesystem.Fig.2showsthepipelineofoursystem,whichconsistsofthefollowingthreealgorithmicmodules:Sceneretrieval:Givenaqueryimage,usetechniquestofindasetofnearestneighborsthatsharesimilarsceneconfiguration(includingobjectsandtheirrelationships)withthequery.Densescenealignment:Establishdensescenecorre-betweenthequeryimageandeachoftheretrievednearestneighbors.ChoosethenearestneighborswiththetopmatchingscoresasLabeltransfer:Warptheannotationsfromthevotingcandidatestothequeryimageaccordingtoestimateddensecorrespondence.ReconcilemultiplelabelingandimposespatialsmoothnessunderaMarkovrandomfield(MRF)model.Althoughwearegoingtochooseconcretealgorithmsforeachmoduleinthispaper,anyalgorithmthatfitstothemodulecanbepluggedintoournonparametricsceneparsingsystem.Forexample,weuseSIFTflowfordensescenealignment,butitwouldalsosufficetousesparsefeaturematchingandthenpropagatesparsecorrespon-dencestoproducedensecounterparts.Akeycomponentofoursystemisalarge,dense,andannotatedimagedatabase.Inthispaper,weusetwosetsofdatabases,bothannotatedusingtheLabelMeonlineannotationtool[39],tobuildandevaluateoursystem.ThefirstistheLabelMeOutdoor(LMO)database[27],containing2,688fullyannotatedimages,mostofwhichareoutdoorscenesincludingstreet,beach,mountains,fields,andbuildings.ThesecondistheSUNdatabase[52],containing9,566fullyannotatedimages,coveringbothindoorandoutdoorscenes;infact,LMOisasubsetofSUN.WeusetheLMOdatabasetoexploreoursystemin-depth,andalsoreporttheresultsontheSUNdatabase.Beforejumpingintothedetailsofoursystem,itishelpfultolookatthestatisticsoftheLMOdatabase.The2,688imagesinLMOarerandomlysplitinto2,488fortrainingand200fortesting.Wechosethetop33objectcategorieswiththemostlabeledpixels.Thepixelsthatarenotlabeled,orlabeledasotherobjectcategories,aretreatedasthe34thcategory:“unlabeled.”TheperpixelfrequencycountoftheseobjectcategoriesinthetrainingsetisshownatthetopofFig.3.ThecolorofeachbaristheaverageRGBvalueofthecorrespondingobjectcategoryfromthetrainingdatawithsaturationandbrightnessboostedforvisualizationpurposes.Thetop10objectcategoriesare2370IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE,VOL.33,NO.12,DECEMBER20111.Othersceneparsingandimageunderstandingsystemsalsorequiresuchadatabase.Wedonotrequiremorethanothers. Fig.2.Systempipeline.Therearethreekeyalgorithmiccomponents(rectangles)inoursystem:sceneretrievaldensescenealignment,and.Theovalsdenotedatarepresentations. Fig.3.Top:Theper-pixelfrequencycountsoftheobjectcategoriesinourdataset(sortedindescendingorder).ThecolorofeachbaristheaverageRGBvalueofeachobjectcategoryfromthetrainingdatawithsaturationandbrightnessboostedforvisualization.Bottom:Thespatialpriorsoftheobjectcategoriesinthedatabase.Whitemeanszeroandthesaturatedcolormeanshighprobability. Becausetheregularityofthedatabaseisthekeytothesuccess,weremovetheSIFTflowmatching,i.e.,settheflowvectortobezeroforeverypixel,andobtainanaveragerecognitionrateof61.23percentwithoutMRFand67.96percentwithMRF,showninFigs.12dand12f,respectively.ThisresultissignificantbecauseSIFTflowisthebottleneckofthesystemintermsofspeed.Afastimplementationofoursystemconsistsofremovingthedensescenealignmentmodule,andsimplyperformingagrid-to-gridlabeltransfer(thelikelihoodterminthelabeltransfermodulestillcomesfromSIFTdescriptordistance).Howwoulddifferentsceneretrievaltechniquesaffectoursystem?OtherthantheGISTdistanceusedforretrievingnearestneighborsfortheresultsinFig.12,wealsousethespatialpyramidhistogramintersectionofHOGvisualwordsandoftheground-truthannotation,withthecorrespondingper-classrecognitionratedisplayedinFigs.12gand12h,respectively.Forthisdatabase,GISTperformsslightlybetterthanHOGvisualwords.Wealsoexploreanupperboundofthelabeltransferframeworkintheidealscenarioofhavingaccesstoperfectscenematching.Inparticular,weretrievethenearestneighborsforeachimageusingtheirground2376IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE,VOL.33,NO.12,DECEMBER2011 Fig.11.Sometypicalfailures.Oursystemfailswhennogoodmatchescanberetrievedinthedatabase.In(2),forexample,sincethebestmatchesdonotcontainriver,theinputimageismistakenlyparsedasasceneofgrass,tree,andmountainin(e).Theground-truthannotationisin(f).Thefailuremayalsocomefromambiguousannotations,forinstancein(3),wherethesystemoutputsfieldforthebottompart,whereastheground-truthannotationismountain. Fig.10.Somesceneparsingresultsoutputfromoursystem.(a):Queryimage,(b):thebestmatchfromnearestneighbors,(c):theannotationofthebestmatch,(d):thewarpedversionof(b)accordingtotheSIFTflowfield,(e):theinferredper-pixelparsingaftercombiningmultiplevotingcandidates,(f):thegroundtruthannotationof(a).Thedarkgraypixelsin(f)are“unlabeled.”Noticehowoursystemgeneratesareasonableparsingevenforthese“unlabeled”pixels. significantlyfaster,butalsoachieveslowerenergiesmostofthetimecomparedtotheordinarymatchingalgorithm.SomeSIFTflowexamplesareshowninFig.8,wheredenseSIFTflowfields(Fig.8f)areobtainedbetweenthequeryimages(Fig.8a)andthenearestneighbors(Fig.8c).ItistrialtoverifythatthewarpedSIFTimages(Fig.8h)basedontheSIFTflows(Fig.8f)lookverysimilartotheSIFTimages(Fig.8b)oftheinputs(Fig.8a),andthattheSIFTflowfields(Fig.8f)arepiecewisesmooth.TheessenceofSIFTflowismanifestedinFig.8g,wherethesameflowfieldisappliedtowarptheRGBimageofthenearestneighbortothequery.SIFTflowistryingtohallucinatethestructureofthequeryimagebysmoothlyshufflingthepixelsofthenearestneighbors.Becauseoftheintrinsicsimilaritieswithineachobjectcategories,itisnotsurprisingthat,throughaligningimagestructures,objectsofthesamecategoriesareoftenmatched.Inaddition,itisworthnotingthatoneobjectinthenearestneighborcancorrespondtomultipleobjectsinthequerysincetheflowisasymmetric.Thisallowsreuseoflabelstoparsemultipleobjectinstances.4.3SceneParsingthroughLabelTransferNowthatwehavealargedatabaseofannotatedimagesandatechniqueforestablishingdensecorrespondencesacrossscenes,wecantransfertheexistingannotationstoparseaqueryimagethroughdensescenealignment.Foragivenqueryimage,weretrieveasetofK;nearestneighborsourdatabaseusingGISTmatching[34].WethencomputetheSIFTflowfromthequerytoeachnearestneighbor,andusetheachievedminimumenergy(definedin(4))toreranktheK;-nearestneighbors.Wefurtherselectthetoprerankedretrievals()tocreateourvotingcandidateset.Thisvotingsetwillbeusedtotransferitscontainedannotationsintothequeryimage.ThisprocedureisillustratedinFig.7.Underthissetup,sceneparsingcanbeformulatedasthefollowinglabeltransferproblem:ForaqueryimageitscorrespondingSIFTimage,wehaveasetofvoting,where,andaretheSIFTimage,annotation,andSIFTflowfield(from)ofthethvotingcandidate,respectively.isanintegerimageistheindexofobjectcategoryfor.WewanttoobtaintheannotationforthequeryimagebytransferringtothequeryimageaccordingtothedensecorrespondenceWebuildaprobabilisticMarkovrandomfieldmodeltointegratemultiplelabels,priorinformationofobjectcategory,andspatialsmoothnessoftheannotationtoparse.Similarlytothatof[43],theposteriorprobabilityisdefinedas:I;s;isthenormalizationconstantoftheprobability.Thisposteriorcontainsthreecomponents,i.e.,likelihood,prior,andspatialsmoothness.termisdefinedasÞÞ¼,istheindexsetofthevotingcandidateswhoselabelisafterbeingwarpedtopixelissettobethevalueofthemaximumdifferenceofSIFTfeature:indicatesthepriorprobabilitythatobjectcategoryappearsatpixel.Thisisobtainedfromcountingtheoccurrenceofeachobjectcategoryateachlocationinthetrainingset:loghististhespatialhistogramofobjectcategorytermisdefinedtobiastheneighboringpixelsintohavingthesamelabelintheeventthatnootherinformationisavailable,andtheprobabilitydependsontheedgeoftheimage:Thestrongerluminanceedge,themorelikelyitisthattheneighboringpixelsmayhavedifferent Noticethattheenergyfunctioniscontrolledbyfourthatdecidethemodeofthemodelthatcontroltheinfluenceofspatialpriorandsmoothness.Oncetheparametersarefixed,weagainusetheBP-Salgorithmtominimizetheenergy.Thealgorithmconvergesintwosecondsonaworkstationwithtwoquad-core2.67GHzIntelXeonCPUs.Asignificantdifferencebetweenourmodelandthatin[43]isthatwehavefewerparametersbecauseofthenonparametricnatureofourapproach,whereasclassifiersweretrainedin[43].Inaddition,colorinformationisnotincludedinourmodelatthepresentasthecolordistributionforeachobjectcategoryisdiverseinourdatabases.Extensiveexperimentswereconductedtoevaluateoursystem.WeshallfirstreporttheresultsonasmallscaledatabasewhichwewillrefertoastheLabelMeOutdoor(LMD)databaseinSection5.1;thisdatabasewillaidusforanin-depthexplorationofourmodel.Furthermore,wewill LIUETAL.:NONPARAMETRICSCENEPARSINGVIALABELTRANSFER Fig.7.Foraqueryimage,wefirstfindaK;-nearestneighborsetinthedatabaseusingGISTmatching[34].ThenearestneighborsarererankedusingSIFTflowmatchingscores,andformatopcandidateset.Theannotationsaretransferredfromthevotingcandidatestoparsethequeryimage. NonparametricSceneParsingviaLabelTransferCeLiu,,JennyYuen,StudentMemberIEEE,andAntonioTorralba,—Whiletherehasbeenalotofrecentworkonobjectrecognitionandimageunderstanding,thefocushasbeenoncarefullyestablishingmathematicalmodelsforimages,scenes,andobjects.Inthispaper,weproposeanovel,nonparametricapproachforobjectrecognitionandsceneparsingusinganewtechnologywenamelabeltransfer.Foraninputimage,oursystemfirstretrievesits M.J.Choi,J.J.Lim,A.Torralba,andA.Willsky,“ExploitingHierarchicalContextonaLargeDatabaseofObjectCate-Proc.IEEEConf.ComputerVisionandPatternRecogni-Recogni-D.Crandall,P.Felzenszwalb,andD.Huttenlocher,“SpatialPriorsforPart-BasedRecognitionUsingStatisticalModels,”Proc.IEEEConf.ComputerVisionandPatternRecognition,,N.DalalandB.Triggs,“HistogramsofOrientedGradientsforHumanDetection,”Proc.IEEEConf.ComputerVisionandPatternRecognition,,C.Desai,D.Ramanan,andC.Fowlkes,“DiscriminativeModelsforMulti-ClassObjectLayout,”Proc.IEEEInt’lConf.ComputerComputerS.K.Divvala,D.Hoiem,J.H.Hays,A.A.Efros,andM.Hebert,“AnEmpiricalStudyofContextinObjectDetection,”Proc.IEEEConf.ComputerVisionandPatternRecognition,,G.Edwards,T.Cootes,andC.Taylor,“FaceRecognitionUsingActiveAppearanceModels,”Proc.EuropeanConf.ComputerVision,Vision,A.A.EfrosandT.Leung,“TextureSynthesisbyNon-ParametricSampling,”Proc.IEEEInt’lConf.ComputerVision,Vision,P.Felzenszwalb,D.McAllester,andD.Ramanan,“ADiscrimina-tivelyTrained,Multiscale,DeformablePartModel,”Proc.IEEEConf.ComputerVisionandPatternRecognition,,P.FelzenszwalbandD.Huttenlocher,“PictorialStructuresforObjectRecognition,”Int’lJ.ComputerVision,vol.61,no.1,pp.55-79,2005.2005.R.Fergus,P.Perona,andA.Zisserman,“ObjectClassRecognitionbyUnsupervisedScale-InvariantLearning,”Proc.IEEEConf.ComputerVisionandPatternRecognition,,R.Fergus,P.Perona,andA.Zisserman,“ObjectClassRecognitionbyUnsupervisedScale-InvariantLearning,”Proc.IEEEConf.ComputerVisionandPatternRecognition,,A.Frome,Y.Singer,andJ.Malik,“ImageRetrievalandClassificationUsingLocalDistanceFunctions,”Proc.AdvancesinNeuralInformationProcessingSystems,Systems,C.Galleguillos,B.McFee,S.Belongie,andG.R.G.Lanckriet,“Multi-ClassObjectLocalizationbyCombiningLocalContextualInteractions,”Proc.IEEEConf.ComputerVisionandPatternRecognition,,K.GraumanandT.Darrell,“PyramidMatchKernels:Discrimi-nativeClassificationwithSetsofImageFeatures,”Proc.IEEEInt’lConf.ComputerVision,Vision,A.GuptaandL.S.Davis,“BeyondNouns:ExploitingPrepositionsandComparativeAdjectivesforLearningVisualClassifiers,”EuropeanConf.ComputerVision,Vision,J.HaysandA.A.Efros,“SceneCompletionUsingMillionsofPhotographs,”ACMTrans.Graphics,vol.26,no.3,2007.2007.G.HeitzandD.Koller,“LearningSpatialContext:UsingStufftoFindThings,”Proc.EuropeanConf.ComputerVision,Vision,D.Hoiem,A.Efros,andM.Hebert,“PuttingObjectsinPerspective,”Proc.IEEEConf.ComputerVisionandPatternRecognition,,S.Lazebnik,C.Schmid,andJ.Ponce,“BeyondBagsofFeatures:SpatialPyramidMatchingforRecognizingNaturalSceneCate-Proc.IEEEConf.ComputerVisionandPatternRecognition,vol.2,pp.2169-2178,2006.2006.Y.LeCun,L.Bottou,Y.Bengio,andP.Haffner,“Gradient-BasedLearningAppliedtoDocumentRecognition,”Proc.IEEE,vol.86,no.11,pp.2278-2324,Nov.1998.1998.L.Liang,C.Liu,Y.Q.Xu,B.N.Guo,andH.Y.Shum,“Real-TimeTextureSynthesisbyPatch-BasedSampling,”ACMTrans.vol.20,no.3,pp.127-150,July2001.2001.C.Liu,J.Yuen,andA.Torralba,“NonparametricSceneParsing:LabelTransferviaDenseSceneAlignment,”Proc.IEEEConf.ComputerVisionandPatternRecognition,,C.Liu,J.Yuen,andA.Torralba,“SIFTFlow:DenseCorrespon-denceacrossDifferentScenesandItsApplications,”IEEETrans.PatternAnalysisandMachineIntelligence,vol.33,no.5,pp.978-994,May2011.2011.C.Liu,J.Yuen,A.Torralba,J.Sivic,andW.T.Freeman,“SIFTFlow:DenseCorrespondenceacrossDifferentScenes,”Proc.EuropeanConf.ComputerVision,Vision,D.G.Lowe,“DistinctiveImageFeaturesfromScale-InvariantKeypoints,”Int’lJ.ComputerVision,vol.60,no.2,pp.91-110,91-110,K.P.Murphy,A.Torralba,andW.T.Freeman,“UsingtheForesttoSeetheTrees:AGraphicalModelRelatingFeatures,Objects,andProc.AdvancesinNeuralInformationProcessingSystems,Systems,D.NisterandH.Stewenius,“ScalableRecognitionwithaVocabularyTree,”Proc.IEEEConf.ComputerVisionandPatternRecognition,,S.ObdrzalekandJ.Matas,“Sub-LinearIndexingforLargeScaleObjectRecognition,”Proc.BritishMachineVisionConf.,Conf.,A.OlivaandA.Torralba,“ModelingtheShapeoftheScene:AHolisticRepresentationoftheSpatialEnvelope,”Int’lJ.Computervol.42,no.3,pp.145-175,2001.2001.D.Park,D.Ramanan,andC.Fowlkes,“MultiresolutionModelsforObjectDetection,”Proc.EuropeanConf.ComputerVision,Vision,A.Rabinovich,A.Vedaldi,C.Galleguillos,E.Wiewiora,andS.Belongie,“ObjectsinContext,”Proc.IEEEInt’lConf.ComputerComputerB.C.Russell,A.A.Efros,J.Sivic,W.T.Freeman,andA.Zisserman,“SegmentingScenesbyMatchingImageComposites,”Proc.AdvancesinNeuralInformationProcessingSystems,Systems,B.C.Russell,A.Torralba,C.Liu,R.Fergus,andW.T.Freeman,“ObjectRecognitionbySceneAlignment,”Proc.AdvancesinNeuralInformationProcessingSystems,,B.C.Russell,A.Torralba,K.P.Murphy,andW.T.Freeman,“LabelMe:ADatabaseandWeb-BasedToolforImageAnnota-Int’lJ.ComputerVision,vol.77,nos.1-3,pp.157-173,2008.2008.S.Savarese,J.Winn,andA.Criminisi,“DiscriminativeObjectClassModelsofAppearanceandShapebyCorrelatons,”IEEEConf.ComputerVisionandPatternRecognition,Recognition,G.Shakhnarovich,P.Viola,andT.Darrell,“FastPoseEstimationwithParameterSensitiveHashing,”Proc.IEEEInt’lConf.ComputerComputerE.ShechtmanandM.Irani,“MatchingLocalSelf-SimilaritiesacrossImagesandVideos,”Proc.IEEEConf.ComputerVisionandPatternRecognition,ion,J.Shotton,J.Winn,C.Rother,andA.Criminisi,“TextonboostforImageUnderstanding:Multi-ClassObjectRecognitionandSeg-mentationbyJointlyModelingTexture,Layout,andContext,”Int’lJ.ComputerVision,vol.81,no.1,pp.2-23,2009.2009.J.SivicandA.Zisserman,“VideoGoogle:ATextRetrievalApproachtoObjectMatchinginVideos,”Proc.IEEEInt’lConf.ComputerVision,Vision,E.Sudderth,A.Torralba,W.T.Freeman,andW.Willsky,“DescribingVisualScenesUsingTransformedDirichletPro-Proc.AdvancesinNeuralInformationProcessingSystems,Systems,J.TigheandS.Lazebnik,“Superparsing:ScalableNonparametricImageParsingwithSuperpixels,”Proc.EuropeanConf.ComputerComputerA.Torralba,R.Fergus,andW.T.Freeman,“80MillionTinyImages:ALargeDatasetforNon-ParametricObjectandSceneRecognition,”IEEETrans.PatternAnalysisandMachineIntelligence,vol.30,no.11,pp.1958-1970,Nov.2008.2008.M.TurkandA.Pentland,“FaceRecognitionUsingEigenfaces,”Proc.IEEEConf.ComputerVisionandPatternRecognition,Recognition,P.ViolaandM.Jones,“RapidObjectDetectionUsingaBoostedCascadeofSimpleFeatures,”Proc.IEEEConf.ComputerVisionandPatternRecognition,ion,M.Weber,M.Welling,andP.Perona,“UnsupervisedLearningofModelsforRecognition,”Proc.EuropeanConf.ComputerVision,Vision,J.Winn,A.Criminisi,andT.Minka,“ObjectCategorizationbyLearnedUniversalVisualDictionary,”Proc.IEEEInt’lConf.ComputerVision,Vision,J.Xiao,J.Hays,K.Ehinger,A.Oliva,andA.Torralba,“SUNDatabase:Large-ScaleSceneRecognitionfromAbbeytoZoo,”Proc.IEEEConf.ComputerVisionandPatternRecognition,Recognition,Y.Yang,S.Hallman,D.Ramanan,andC.Fowlkes,“LayeredObjectDetectionforMulti-ClassSegmentation,”Proc.IEEEConf.ComputerVisionandPatternRecognition, LIUETAL.:NONPARAMETRICSCENEPARSINGVIALABELTRANSFER fewclusters.Thisphenomenonisconsistentwithhumanperception,drawingaclearseparationbetweenoutdoorandindoorimages.SomesceneparsingresultsaredisplayedinFig.16inthesameformatasFig.10.SincetheSUNdatabaseisasupersetoftheLabelMeOutdoordatabase,theselectionofresultsisslightlybiasedtowardindoorandactivityscenes.Overall,oursystemperformsreasonablywellinparsingthesechallengingimages.Wealsoplottheper-classperformanceinFig.15.InFig.15a,weshowthepixel-wisefrequencycountofthetop100objectcategories.SimilarlytoLMO,thispriordistributionisheavilybiasedtowardstuff-likeclasses,e.g.,buildingfloor,and.InFig.15b,theperformanceisachievedwhentheground-truthannotationisusedforsceneretrieval.Again,theaverage64.45percentrecognitionraterevealstheupperlimitandthepotentialofoursystemintheidealizedcaseofperfectnearestneighborretrieval.InFigs.15cand15d,theperformanceusingHOGandGISTfeaturesforsceneretrievalisplotted,suggestingthattheHOGvisualwordsfeaturesoutperformGISTforthislargerdatabase.ThisisconsistentwiththediscoverythattheHOGfeatureisthebestamongasetoffeatures,includingGIST,inscenerecognitionintheSUNdatabase[52].Overall,therecognitionrateontheSUNdatabaseislowerthanthatontheLMOdatabase.Apossibleexplanationforthisphenomenonisthatindoorscenescontainlessregularitycomparedtooutdoorones,andthereare515objectcategoriesinSUN,whereasthereareonly33categoriesinLMO.6.1LabelTransfer:AnOpen,Database-DrivenFrameworkforImageUnderstandingAuniquecharacteristicsofournonparametricsceneparsingsystemisitsopenness:Tosupportmoreobjectcategories,onecansimplyaddmoreimagesofthenewcategoriesintothedatabasewithoutrequiringadditionaltraining.Thisisanadvantageoverclassicallearning-basedsystemswherealloftheclassifiershavetoberetrainedwhennewobjectcategoriesareinsertedintothedatabase.Althoughthereisnoparametricmodel(probabilisticdistributionsorclassifiers)ofobjectappearancesinoursystem,theabilitytorecognizeobjectsdependsonreliableimagematchingacrossdifferentscenes.Whengoodmatchesareestablishedbetweenobjectsinthequeryandobjectsinthenearestneighborsintheannotateddatabase,theknownannotationnaturallyexplainsthequeryimageaswell.WechoseSIFTflow[28]toestablishasemanticallymeaningfulcorrespondencebetweentwodifferentimages.Nonetheless,thismodulecaneasilybesubstitutedbyotherorbetterscenecorrespondencemethods.Althoughcontextisnotexplicitlymodeledinoursystem,ourlabeltransfer-basedsceneparsingsystemnaturallyembedscontextually-coherentlabelsets.Thenearestneigh-borsretrievedinthedatabaseandrerankedbySIFTflowscoresmostlybelongtothesametypeofscenecategory,implicitlyensuringcontextualcoherence.UsingFig.16(9)asanexample,wecanseethateventhoughthereflectionofthemountainhasbeenmisclassifiedto,theparsingresultiscontext-coherent.6.2TheRoleofSceneRetrievalOurnonparametricsceneparsingsystemlargelydependsonthesceneretrievaltechniquethroughwhichthenearestneighborsofthequeryimageinthelargedatabaseareobtained.Wehavetriedtwopopulartechniques,GISTandHOGvisualwords,andhavefoundthatGIST-basedretrievalyieldshigherperformanceintheLMOdatabase,whereasHOGvisualwordstendtoretrievebetterneighborsintheSUNdatabase.Wealsoshowtheupperboundperformanceofoursystembyusingtheground-truthannotationforsceneretrieval.Thisupperboundprovidesanintuitionoftheefficacyofoursystemgivenanidealsceneretrievalsystem.Therecentadvancesinthisareabycombiningmultiplekernels[52]pointoutpromis-ingdirectionsforsceneretrieval.6.3BetterEvaluationCriterionPresently,weuseasimplecriterion,pixel-wiserecognitionratetomeasuretheperformanceofoursceneparsingsystem.Apixeliscorrectlyrecognizedonlywhentheparsingresultisexactlythesameastheground-truthannotation.However,humanannotationcanbeambiguous.Forinstance,intheparsingexampledepictedinFig.16(9),thepixel-wiserecognitionrateislowbecausethe LIUETAL.:NONPARAMETRICSCENEPARSINGVIALABELTRANSFER Fig.15.Theper-classrecognitionrateofrunningoursystemontheSUNdatabase.(a)Pixel-wisefrequencyhistogramofthetop100objectcategories.(b):Theper-classrecognitionratewhentheground-truthannotationisusedforsceneretrieval.Thisistheupperlimitperformancethcanbeachievedbyoursystem.(c)and(d):Per-classrecognitionrateusingHOGandGISTforsceneretrieval,respectively.TheHOGvisualwordsfeaturesgeneratebetterresultsforthislargerdatabase. in[29],thecomputationtimeforimageswithansearchingneighborhoodis50seconds.TheoriginalimplementationofSIFTflowwouldrequiremorethan2hourstoprocessapairofimagesinourdatabasewithamemoryusageof16GBtostorethedataterm.Toaddresstheperformancedrawback,acoarse-to-fineSIFTflowmatchingschemewasdesignedtosignificantlyimprovetheperformance.AsillustratedinFig.6,thebasicideaconsistsofestimatingtheflowatacoarselevelofimagegrid,andthengraduallypropagatingandrefiningtheflowfromcoarsetofine;pleasereferto[28]fordetails.Asaresult,thecomplexityofthiscoarse-to-finealgorithmis,asignificantspeedupcomparedto.Thematchingbetweentwoimagestake31secondsonaworkstationwithtwoquad-core2.67GHzIntelXeonCPUsand32GBmemory,inaC++implementation.Wealsodiscoveredthatthecoarse-to-fineschemenotonlyruns2372IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE,VOL.33,NO.12,DECEMBER2011 Fig.4.Thestructureofadatabasedependsonimagedistancemetric.Top:TheK;-NNgraphoftheLabelMeOutdoordatabasevisualizedbyscaledMDSusingGISTfeatureasdistance.Bottom:TheK;-NNgraphofthesamedatabasevisualizedusingthepyramidhistogramintersectionofground-truthannotationasdistance.Left:RGBimages;right:annotationimages.Noticehowtheground-truthannotationemphasizestheunderlyingstructureofthedatabase.In(c)and(d),weseethattheimagecontentchangesfromurban,streets(right),tohighways(middle),andtonaturescenes(left)aswepanfromrighttoleft.EighthundredimagesarerandomlyselectedfromLMOforthisvisualization. Fig.6.Anillustrationofourcoarse-to-finepyramidSIFTflowmatching.Thegreensquaredenotesthesearchingwindowforateachpyramid.Forsimplicity,onlyoneimageisshownhere,whereisonareonimage.Thedetailsofthealgorithmcanbefoundin[28]. Fig.5.Animagedatabasecanbenonuniformasillustratedbysomerandom2Dpoints.Thegreennode(A)issurroundeddenselybyneighbors,whereastherednode(B)residesinasparsearea.Ifweuse-NN(),thensomesamples(orangenodes)farawayfromthequery(B)canbechosenasneighbors.If,instead,weuse-NNandchoosetheradiusasshowninthepicture,thentherecanbetoomanyneighborsforasamplesuchas(A).Thecombination,K;-NN,shownasgray-edges,providesagoodbalanceforthesetwocriteria. indoorandoutdoorscenes.Thisdatabasecontainsatotalof515objectcategories;thepixelfrequencycountsforthetop100categoriesaredisplayedinFig.15a.Thedatacorpusisrandomlysplitinto8,556imagesfortrainingand1,000fortesting.ThestructureofthedatabaseisvisualizedinFig.14usingthesametechniquetoplotFig.5,wheretheimagedistanceismeasuredbytheground-truthannotation.Noticetheclearseparationofindoor(left)andoutdoor(right)scenesinthisdatabase.Moreover,theimagesarenotevenlydistributedinthespace;theytendtoresidearounda2378IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE,VOL.33,NO.12,DECEMBER2011 Fig.14.VisualizationoftheSUNdatabase[52]using1,200randomimages.NoticethattheSUNdatabaseislargerbutnotnecessarilydenserthantheLMOdatabase.Weusethespatialpyramidhistogramintersectiondistanceoftheground-truthannotationtomeasurethedistancebetweentheimagesandprojectthemtoa2Dspaceusingscaledmultidimensionalscaling.Clearly,theimagesareclusteredtoindoor(left)andoutdoor(right)scenes,andthereissmoothtransitioninbetween.Intheoutdoorcluster,weobservethechangefrom,toaswemovefromtoptobottom.Pleasevisithttp://people.csail.mit.edu/celiu/LabelTransfer/toseethefullresolutionofthegraphs. reportresultsontheSUNdatabase,alargerandmorechallengingdatasetinSection5.2.5.1LabelMeOutdoorDatabaseAsmentionedinSection3,theLMOdatabaseconsistsof2,688outdoorimages,whichhavebeenrandomlysplitinto2,466trainingand200testimages.Theimagesaredenselylabeledwith33objectcategoriesusingtheLabelMeonlineannotationtool.OursceneparsingsystemisillustratedinFig.8.ThesystemretrievesaK;-nearestneighborsetforthequeryimage(Fig.8a),andfurtherselectscandidatescontainingminimumSIFTmatchingscores.Forillustrationpurposes,wesethere.TheoriginalRGBimage,SIFTimage,andannotationofthevotingcandidatesareshowninFigs.8c,8d,and8e,respectively.TheSIFTflowfieldisvisualizedinFig.8fusingthesamevisualiza-tionschemeasin[29],wherehueindicatesorientationandsaturationindicatesmagnitude.Afterwewarpthevotingcandidatesintothequerywithrespecttotheflowfield,thewarpedRGB(Fig.8g)andSIFTimage(Fig.8h)areveryclosetothequeryFig.8aandFig.8b,respectively.CombiningthewarpedannotationsinFig.8i,thesystemoutputstheparsingofthequeryinFig.8j,whichisclosetotheground-truthannotationinFig.8k.5.1.1EvaluationCriterionWeuseaveragepixel-wiserecognitionrate(similartoprecisionortruepositive)toevaluatetheperformanceofoursystem,computedas where,forpixelinimage,theground-truthannotationisandsystemoutputis;forunlabeledpixels,.Notationistheimagelatticefortestimageisthenumberoflabeledpixelsforimage(somepixelsareunlabeled).Wealsocomputetheper-classaveragerate ;L:5.1.2ResultsandComparisonsSomelabeltransferresultsareshowninFig.10.TheinputimagefromthetestsetisdisplayedinFig.10a.Weshowthebestmatch,itscorrespondingannotation,andthewarpedbestmatchinFigs.10b,10c,and10d,respectively.Whilethefinallabelingconstitutestheintegrationofthetopmatches,thebestmatchcanprovidethereaderanintuitionoftheprocessandfinalresult.Noticehowthewarpedimage(Fig.10d)lookssimilartotheinput(Fig.10a),indicatingthatSIFTflowsuccessfullymatchesimagestructures.ThesceneparsingresultsoutputbyoursystemarelistedinFig.10ewithparametersetting.Theground-truthuserannotationislistedinFig.10f.NoticethatthegraypixelsinFig.10fare“unlabeled,”butoursystemdoesnotgenerate“unlabeled”output.Forsamples1,5,6,8,and9,2374IEEETRANSACTIONSONPATTERNANALYSISANDMACHINEINTELLIGENCE,VOL.33,NO.12,DECEMBER2011 Fig.8.Systemoverview.Foraqueryimage,oursystemusessceneretrievaltechniquessuchas[34]tofindK;-nearestneighborsinourdatabase.Weapplycoarse-to-fineSIFTflowtoalignthequeryimagetothenearestneighbors,andobtaintopasvotingcandidates(here).(c),(d),(e):TheRGBimage,SIFTimageanduserannotationofthevotingcandidates.(f):TheinferredSIFTflowfield,visualizedusingthecolorschemeshownontheleft(hue:orientation;saturation:magnitude).(g),(h),and(i)arethewarpedversionof(c),(d),(e)withrespecttotheSIFTflowin(f).Noticethesimilaritybetween(a)and(g),(b)and(h).Oursystemcombinesthevotingfrommultiplecandidatesandgeneratessceneparsingin(j)byoptimizingtheposterior.(k):Theground-truthannotationof(a). C.LiuiswithMicrosoftResearchNewEngland,OneMemorialDrive,Cambridge,MA02142andalsowiththeComputerScienceandArtificialIntelligenceLaboratory(CSAIL),MassachusettsInstituteofTechnology,Cambridge,MA02139.E-mail:celiu@microsoft.com.