We address the problem of understanding an indoor scene from a single image in terms of recovering the layouts of the faces oor ceiling walls and furniture A major challenge of this task arises from the fact that most indoor scenes are cluttered by ID: 42190
Download Pdf The PPT/PDF document "Discriminative Learning with Latent Vari..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
DiscriminativeLearningwithLatentVariablesforClutteredIndoorSceneUnderstandingHuayanWang,StephenGould,andDaphneKollerComputerScienceDepartment,StanfordUniversity,CA,USAElectricalEngineeringDepartment,StanfordUniveristy,CA,USAAbstract.Weaddresstheproblemofunderstandinganindoorscenefromasingleimageintermsofrecoveringthelayoutsofthefaces(oor,ceiling,walls)andfurniture.Amajorchallengeofthistaskarisesfromthefactthatmostindoorscenesareclutteredbyfurnitureanddecora-tions,whoseappearancesvarydrasticallyacrossscenes,andcanhardlybemodeled(orevenhand-labeled)consistently.Inthispaperwetacklethisproblembyintroducinglatentvariablestoaccountforclutters,sothattheobservedimageisjointlyexplainedbythefaceandclutterlay-outs.Modelparametersarelearnedinthemaximummarginformulation,whichisconstrainedbyextrapriorenergytermsthatdenetheroleofthelatentvariables.Ourapproachenablestakingintoaccountandin-ferringindoorclutterlayoutswithouthand-labelingofthecluttersinthetrainingset.Yetitoutperformsthestate-of-the-artmethodofHedauetal.[4]thatrequiresclutterlabels.1IntroductionInthispaper,wefocusonholisticunderstandingofindoorscenesintermsofrecoveringthelayoutsofthemajorfaces(oor,ceiling,walls)andfurniture(Fig.1).Theresultingrepresentationcouldbeusefulasastronggeometricconstraintinavarietyoftaskssuchasobjectdetectionandmotionplanning.Ourworkisinspiritofrecentworkonholisticsceneunderstanding,butfocusesonindoorscenes.Forparameterizingtheglobalgeometryofanindoorscene,weadopttheapproachofHedauetal.[4],whichmodelsaroomasabox.Specically,giventheinferredthreevanishingpoints,wecangenerateaparametricfamilyofboxescharacterizingthelayoutsoftheoor,ceilingandwalls.Theproblemcanbeformulatedaspickingtheboxthatbesttstheimage.However,amajorchallengearisesfromthefactthatmostindoorscenesareclutteredbyalotoffurnitureanddecorations.Theyoftenobscurethegeometricstructureofthescene,andalsooccludeboundariesbetweenwallsandtheoor.Appearancesandlayoutsofclutterscanvarydrasticallyacrossdierentindoorscenes,soitisextremelydicult(ifnotimpossible)tomodelthemconsistently.Moreover,hand-labelingofthefurnitureanddecorationsfortrainingcanbeanextremelytime-consuming(,delineatingachairbyhand)andambiguoustask.Forexample,shouldwindowsandtherugbelabeledasclutter?K.Daniilidis,P.Maragos,N.Paragios(Eds.):ECCV2010,PartIV,LNCS6314,pp.497510,2010.Springer-VerlagBerlinHeidelberg2010 DiscriminativeLearningwithLatentVariables499thevanishingpoints,andusingstruct-SVMtopickthebestbox.However,theyusedsupervisedclassicationofsurfacelabels[6]toidentifyclutters(furniture),andusedthetrainedsurfacelabelclassiertoiterativelyrenetheboxlayoutestimation.Specically,theyusetheestimatedboxlayouttoaddfeaturestosupervisedsurfacelabelclassication,andusetheclassicationresulttolowertheweightsofclutterimageregionsinestimatingtheboxlayout.Thustheirmethodrequirestheusertocarefullydelineatethecluttersinthetrainingset.Incontrast,ourlatentvariableformulationdoesnotrequireanylabelofclutters,yetstillaccountsfortheminaprincipledmannerduringlearningandinference.Wealsodesignarichersetofjointfeatureaswellasamoreecientinferencemethod,bothofwhichhelpboostourperformance.Incorporatingimagecontexttoaidcertainvisiontasksandtoachieveholisticsceneunderstandinghavebeenreceivingincreasingconcernandeortsrecently[3,5,6].Ourpaperisanotherworkinthisdirectionthatfocusesonindoorscenes,whichdemonstratesomeuniqueaspectsofduetothegeometricandappearanceconstraintsoftheroom.Latentvariableshasbeenexploitedinthecomputervisionliteratureinvarioustaskssuchasobjectdetection,recognitionandsegmentation.Theycanbeusedtorepresentvisualconceptssuchasocclusion[11],objectparts[2],andimage-speciccolormodels[9].Introducinglatentvariablesintostruct-SVMwasshowntobeeectiveinseveralapplications[12].Itisalsoaninterestingaspectinourworkthatlatentvariablesareusedindirectcorrespondencewithaconcretevisualconcept(cluttersintheroom),andwecanvisualizetheinferenceresultonlatentvariablesviarecoveredfurnitureanddecorationsintheroom.2ModelWebeginbyintroducingnotationstoformalizeourproblem.Weusetodenotetheinputvariable,whichisanimageofanindoorscene;todenotetheoutputvariable,whichistheboxcharacterizingthemajorfaces(oor,walls,ceiling)oftheroom;andtodenotethelatentvariables,whichspecifytheclutterlayoutsofthescene.Forrepresentingthefacelayoutsvariableweadopttheideaof[4].Mostindoorscenesarecharacterizedbythreedominantvanishingpoints.Giventhepositionofthesepoints,wecangenerateaparametricfamilyofboxes.Specif-ically,takingasimilarapproachasin[4]werstdetectlonglinesintheimage,thenndthreedominantgroupsoflinescorrespondingtothreevanishingpoints.Inthispaperweomitthedetailsofthesepreprocessingsteps,whichcanbefoundin[4]and[8].AsshowninFig.2,wecomputetheaverageorientationofthelinescorrespondingtoeachvanishingpoint,andnamethevanishingpointcorrespond-ingtomostlyhorizontallinesas;theonecorrespondingtomostlyverticallinesas;andtheotheroneasAcandidateboxspecifyingthefacelayoutsofthescenecanbegener-atedbysendingtworaysfrom,tworaysfrom,andconnectingthefour 500H.Wang,S.Gould,andD.Koller Fig.2.Lower-Left:Wehave3groupsoflines(showninR,G,B)correspondingtothe3vanishingpointsrespectively.Therearealsooutlierlines(showninyellow)whichdonotbelongtoanygroup.Upper-Left:Acandidateboxspecifyingtheboundariesbetweentheceiling,wallsandoorisgenerated.Right:Candidateboxes(inyellowframes)generatedinthiswayandthehand-labeledgroundtruthboxlayout(ingreenframe).intersectionswith.Weuserealparameterstospecifythepositionofthefourrayssentfromand.Thusthepositionofthevanishingpointsandthevalueofcompletelydetermineaboxhypothesisassigningeachpixelafacelabel,whichhasvepossiblevaluesceilingleft-wallright-wallfront-wallßoor.Notethatsomeofthefacelabelscouldbeabsent;forexampleonemightonlyobserveright-wallfront-wallandßoorinanimage.Inthatcase,somevalueofwouldgiverisetoaraythatdoesnotintersectwiththeextentoftheimage.Thereforewecanrepresenttheoutputvariablebyonly4dimensionsthankstothestronggeometricconstraintofthevanishingpoints.Onecanalsothinkofasthefacelabelsforallpixels.Wealsodeneabasedistribution)overtheoutputspaceestimatedbyttingamultivariateGaussianwithdiagonalcovarianceviamaximumlikelihoodtothelabelboxesinthetrainingset.Thebasedistributionisusedinourinferencemethod.Tocompactlyrepresenttheclutterlayoutvariable,werstcomputeanover-segmentationoftheimageusingmean-shift[1].Eachimageissegmentedintoanumber(typicallylessthanahundred)ofregions,andforeachregionweassignittoeitherclutternon-clutter.Thusthelatentvariableisabinary Therecouldbedierentdesignchoicesforparameterizingthepositionofaraysentfromavanishingpoint.Weusethepositionofitsintersectionwiththeimagecentralline(useverticalandhorizontalcentrallineforandrespectively).Notethatresidesinaconneddomain.Forexample,giventhepriorknowledgethatthecameracannotbeabovetheceilingorbeneaththeoor,thetworayssentmustbeondierentsidesof.Similarconstraintsalsoapplyto DiscriminativeLearningwithLatentVariables501vectorwiththesamedimensionalityasthenumberofregionsintheimagethatresultedfromtheover-segmentation.Wenowdenetheenergyfunctionthatrelatestheimage,theboxandtheclutterlayouts:(1)isajointfeaturemappingthatcontainsarichsetoffeaturesmeasuringthecompatibilitybetweentheobservedimageandthebox-clutterlayouts,takingintoaccountimagecuesfromvariousaspectsincludingcolor,texture,perspectiveconsistency,andoveralllayout.containstheweightsforthefeaturesthatneedstobelearned.isanenergytermthatcapturesourpriorknowledgeontheroleofthelatentvariables.Specically,itmeasurestheappearanceconsistencyofthemajorfaces(oorandwalls)whenthecluttersaretakenout,andalsotakesintoaccounttheoverallclutternessofeachface.Intuitively,itdenesthelatentvariables(clutter)tobethingsthatappearsinconsistentlyineachofthemajorfaces.DetailsaboutandareintroducedinSection3.3.Theproblemofrecoveringthefaceandclutterlayoutscanbeformulatedas:)=argmax(2)3LearningandInference3.1LearningGiventhetrainingsetwithhand-labeledboxlayouts,welearntheparametersdiscriminativelybyadaptingthelargemarginformulationofstruct-SVM[10,12], 2w2+C i,0and(3)maxmax (4)where)isthelossfunctionthatmeasuresthedierencebetweenthecan-didateoutputandthegroundtruth.Weusepixelerrorrate(thepercentageofpixelsthatarelabeleddierentlybythetwoboxlayouts)asthelossfunction.encodesthepriorknowledge,itisxedtoconstrainthelearningprocessofmodelparameters.Withouttheslackvariablestheconstraints(4)essen-tiallystatethat,foreachtrainingimage,anycandidateboxlayoutcannotbetterexplaintheimagethanthegroundtruthlayout.Maximizingthecom-patibilityfunctionoverthelatentvariablesgivestheclutterlayoutsthatbestexplaintheimageandboxlayoutsunderthecurrentmodelparameters.Sincethemodelcanneverfullyexplaintheintrinsiccomplexityofreal-worldimages,wehavetoslackentheconstraintsbytheslackvariables,whicharescaledbythe DiscriminativeLearningwithLatentVariables503inferenceproblems(2),(5)and(6)cannotbesolvedanalytically.In[4]therewasnolatentvariable,andthespaceofisstilltractableforsimplediscretization,sotheconstraintsforstruct-SVMcanbepre-computedforeachtrainingimagebeforethemainlearningprocedure.Howeverinourproblemweareconfrontingthecombinatorialcomplexityofand,whichmakesitimpossibletopre-computeallconstraints.Forinferringgiven,weuseiteratedconditionalmodes(ICM)[13].Namely,weiterativelyvisitallsegments,andipasegment(betweenclutterandnon-clutter)ifitincreasetheobjectivevalue,andwestoptheprocessifnosegmentisippedinlastiteration.Toavoidlocaloptimawestartfrommultiplerandominitializations.Forinferringbothand,weusestochastichillclimbingforandthealgorithmisshowninAlgorithm2.Thetest-timeinferenceprocedure(2)ishandlesimilarlyasthelossaugmentedinference(5)butwithadierentobjective.Wecanusealooserconvergencecriterionfor(5)tospeeduptheprocessasithastobeperformedmultipletimesinlearning.TheoverallinferenceprocessisshowninAlgorithm2. Algorithm2.StochasticHill-ClimbingforInference 1:Input:2:Output:anumberofrandomseeds4:samplefromargmax)byICMrepeatrepeat8:perturbaparameterofaslongasitincreasestheobjectiveuntilconvergenceargmax)byICMuntilconvergenceendfor Inexperimentswealsocomparetoanotherinferencemethodthatdoesnotmakeuseofthecontinuousparameterizationof.Specicallyweindependentlygeneratealargenumberofcandidateboxesfrom),inferthelatentvariableforeachofthem,andpicktheonewiththelargestobjectivevalue.Thisissimilartotheinferencemethodusedin[4],inwhichtheyindependentlyevaluateallhypothesisboxesgeneratedfromauniformdiscretizationoftheoutputspace.3.3PriorsandFeaturesFormakinguseofcolorandtextureinformation,weassigna21dimensionalappearancevectortoeachpixel,includingHSVvalues(3),RGBvalues(3),Gaussianlterin3scalesonall3Labcolorchannels(9),Sobellterin2directionsand2scales(4),andLaplacianlterin2scales(2).Eachdimensionisnormalizedforeachimagetohavezeromeanandunitvariance. DiscriminativeLearningwithLatentVariables505Table1.Quantitativeresults.Row1:pixelerrorrate.Row2&3:thenumberoftestimages(outof105)withpixelerrorrateunder20%&10%.Column1([6])Hoiemetal.sregionlabelingalgorithm.Column2([4]w/o):Hedauetal.smethodwithoutclutterlabel.Column3([4]w/):Hedauetal.smethodwithclutterlabel(iterativelyrenedbysupervisedsurfacelabelclassication[6]).Therst3columnsaredirectlycopiedfrom[4].Column4(Oursw/o):Ourmethod(withoutclutterlabel).Column5(w/oprior):Ourmethodwithoutthepriorknowledgeconstraint.Column6(:Ourmethodwithlatentvariablesxedtobezeros(assumingnoclutter).Column7(GT):Ourmethodwithlatentvariablesxedtobehand-labeledcluttersinlearning.Column8(UB):Ourmethodwithlatentvariablesxedtobehand-labeledcluttersinbothlearningandinference.Inthiscasethetestingphaseisactuallycheatingbymakinguseofthehand-labeledclutters,sotheresultscanonlyberegardedassomeupperbound.Thedeviationsintheresultsareduetotherandomizationinbothlearningandinference.Theyareestimatedovermultiplerunsoftheentireprocedure. [6][4]w/o[4]w/Oursw/ow/oprior=GTUB Pixel28.9%26.5%21.2%20.10.5%21.50.7%22.20.4%24.90.5%19.20.6%20%45734636710%225320237 4ExperimentalResultsForexperimentsweusethesamedatastasusedin[4].Thedatasetconsistsof314images,andeachimagehashand-labeledboxandclutterlayouts.Theyalsoprovidedthetraining-testsplit(209fortraining,105fortest)onwhichtheyreportedresultsin[4].Forcomparisonweusethesametraining-testsplitandachieveapixel-error-rateof20.1%withoutclutterlabels,comparingto26.5%in[4]withoutclutterlabelsand21.2%withclutterlabels.Detailedcompar-isonsareshowninTable1(thelastfourcolumnsareexplainedinthefollowingsubsections).Inordertovalidatetheeectsofpriorknowledgeinconstrainingthelearningprocess,wetakeoutthepriorknowledgebyaddingthetwotermsandordinaryfeaturesandtrytolearntheirweights.TheperformanceofrecoveringboxlayoutsinthiscaseisshowninTable1,column5.Althoughthedierencebetweencolumn4and5(Table1)issmall,therearemanycaseswhererecoveringmorereasonablecluttersdoeshelpinrecoveringthecorrectbox-layout.SomeexamplesareshowninFigure3,wherethe1stand2ndcolumn(fromleft)aretheboxandclutterlayoutsrecoveredbythelearnedmodelwithpriorconstraints,andthe3rdand4thcolumnaretheresultoflearningwithoutpriorconstraints.Forexample,inthecaseofthe3rdrow(Fig.3),theboundarybetweentheßoorandthefront-wall(thewallontheright)iscorrectlyrecoveredeventhoughitislargelyoccludedbythebed,whichiscorrectlyinferredasclutter,andthe Thedatasetisavailableathttps://netfiles.uiuc.edu/vhedau2/www/groundtruth.zip 508H.Wang,S.Gould,andD.Koller Inferred ox ltInferred cltter leled cltter l Fig.4.Sampleresultsforcomparingtherecoveredcluttersbyourmethodandthehand-labeledcluttersinthedataset.The1stand2ndcolumnarerecoveredboxandclutterlayoutsbyourmethod.The3rdcolumn(right)isthehand-labeledclutterlayouts.Ourmethodusuallyrecoversmoreobjectsasclutterthanpeoplewouldbothertodelineatebyhand.Forexample,therugwithadierentappearancefromtheoorinthe2ndimage,paintingsonthewallinthe1st,4th,5th,6thimage,andthetreeinthe5thimage.Therearealsomajorpiecesoffurniturethataremissinginthehand-labelsbutrecoveredbyourmethod,suchasthecabinetandTVinthe1stimage,everythinginthe3rdimage,andthesmallsofainthe5thimage. DiscriminativeLearningwithLatentVariables509 2.5 3 3.5 4 4.5 5 5.5 0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 log10 (# calls of )Pixel Error Rate 5 10 15 20 25 30 35 0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 ations in learningPixel Error Rate Baseline Fig.5.Left:ComparisonbetweentheinferencemethoddescribedinAlgorithm2andthebaselineinferencemethodthatevaluateshypothesesindependently.Right:Empiricalconvergenceevaluationforthelearningprocedure.isaround6-7%(outofthe20.1%)ofthepixelerrorduetoincorrectvanishingpointdetectionresultsWecompareourinferencemethod(Algorithm2)tothebaselinemethod(eval-uatinghypothesesindependently)describedinSection3.2.Fig.5(Left)showstheaveragepixelerrorrateovertestsetversusthenumberofcallstothejointfeaturemappinginlogscale,whichcouldbeviewedasameasureofrunningtime.Thedierencebetweenthetwocurvesisactuallyhugeasweareplottinginlog-scale.Forexample,forreachingthesameerrorrateof0.22thebaselinemethodwouldtakeroughly10timesmorecallstoAswehaveintroducedmanyapproximationsintothelearningprocedureoflatentstruct-SVM,itishardtotheoreticallyguaranteetheconvergenceofthelearningalgorithm.InFig.5(Right)weshowtheperformanceofthelearnedmodelontestsetversusthenumberofiterationsinlearning.Empiricallythelearningprocedureapproximatelyconvergesinasmallnumberofiterations,althoughwedoobservesomeuctuationduetotherandomizedapproximationusedinthelossaugmentedinferencestepoflearning.5ConclusionInthispaperweaddressedtheproblemofrecoveringthegeometricstructureaswellasclutterlayoutsfromasingleimage.Weusedlatentvariablestoaccountforindoorclutters,andintroducedpriortermstodenetheroleoflatentvariablesandconstrainthelearningprocess.Theboxandclutterlayoutsrecoveredbyourmethodcanbeusedasageometricconstraintforsubsequenttaskssuch Theerrorrateof6-7%isestimatedbyassumingaperfectmodelthatalwayspicksthebestboxgeneratedfromthevanishingpointdetectionresult,andperformingstochastichill-climbingtoinfertheboxusingtheperfectmodel. 510H.Wang,S.Gould,andD.Kollerasobjectdetectionandmotionplanning.Forexample,theboxlayoutsuggestsrelativedepthinformation,whichconstrainsthescaleoftheobjectswewouldexpecttodetectinthescene.Ourmethod(withoutclutterlabels)outperformsthestate-of-the-artmethod(withclutterlabels)inrecoveringtheboxlayoutonthesamedataset.Andwearealsoabletorecovertheclutterlayoutswithouthand-labelingoftheminthetrainingset.AcknowledgementsThisworkwassupportedbytheNationalScienceFoundationunderGrantNo.RI-0917151,theOceofNavalResearchundertheMURIprogram(N000140710747)andtheBoeingCorporation.References1.Comaniciu,D.,Meer,P.:Meanshift:arobustapproachtowardfeaturespaceanal-ysis.IEEETransactionsonPAMI24(5)(2002)2.Felzenszwalb,P.,Girshick,R.,McAllester,D.,Ramanan,D.:Objectdetectionwithdiscriminativelytrainedpartbasedmodels.IEEETransactionsonPAMI(2010)(toappear)3.Gould,S.,Fulton,R.,Koller,D.:Decomposingasceneintogeometricandseman-ticallyconsistentregions.In:ICCV(2009)4.Hedau,V.,Hoiem,D.,Forsyth,D.:Recoveringthespatiallayoutofclutteredroom.In:ICCV(2009)5.Heitz,G.,Koller,D.:Learningspatialcontext:Usingstutondthings.In:Forsyth,D.,Torr,P.,Zisserman,A.(eds.)ECCV2008,PartI.LNCS,vol.5302,pp.30 43.Springer,Heidelberg(2008)6.Hoiem,D.,Efros,A.,Hebert,M.:Recoveringsurfacelayoutfromanimage.IJCV75(1)(2007)7.Joachims,T.,Finley,T.,Yu,C.-N.:Cutting-PlaneTrainingofStructuralSVMs.MachineLearning77(1),27 59(2009)8.Rother,C.:Anewapproachtovanishingpointdetectioninarchitecturalenviron-ments.In:IVC,vol.20(2002)9.Shotton,J.,Winn,J.,Rother,C.,Criminisi,A.:TextonBoostforimageunderstand-ing:multi-classobjectrecognitionandsegmentationbyjointlymodelingtexture,layout,andcontext.IJCV(2007)10.Tsochantaridis,I.,Joachims,T.,Hofmann,T.,Altun,Y.,Singer,Y.:Largemarginmethodsforstructuredandinterdependentoutputvariables.JMLR6,1453 1484(2005)11.Vedaldi,A.,Zisserman,A.:Structuredoutputregressionfordetectionwithpartialocclusion.In:NIPS(2009)12.Yu,C.-N.,Joachims,T.:LearningstructuralSVMswithlatentvariable.In:ICML(2009)13.Besag,J.:Onthestatisticalanalysisofdirtypictures(withdiscussions).JournaloftheRoyalStatisticalSociety,SeriesB48,259 302(1986)