/
Discriminative Learning with Latent Variables for Cluttered Indoor Scene Understanding Discriminative Learning with Latent Variables for Cluttered Indoor Scene Understanding

Discriminative Learning with Latent Variables for Cluttered Indoor Scene Understanding - PDF document

calandra-battersby
calandra-battersby . @calandra-battersby
Follow
473 views
Uploaded On 2015-03-07

Discriminative Learning with Latent Variables for Cluttered Indoor Scene Understanding - PPT Presentation

We address the problem of understanding an indoor scene from a single image in terms of recovering the layouts of the faces oor ceiling walls and furniture A major challenge of this task arises from the fact that most indoor scenes are cluttered by ID: 42190

address the problem

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Discriminative Learning with Latent Vari..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

DiscriminativeLearningwithLatentVariablesforClutteredIndoorSceneUnderstandingHuayanWang,StephenGould,andDaphneKollerComputerScienceDepartment,StanfordUniversity,CA,USAElectricalEngineeringDepartment,StanfordUniveristy,CA,USAAbstract.Weaddresstheproblemofunderstandinganindoorscenefromasingleimageintermsofrecoveringthelayoutsofthefaces(”oor,ceiling,walls)andfurniture.Amajorchallengeofthistaskarisesfromthefactthatmostindoorscenesareclutteredbyfurnitureanddecora-tions,whoseappearancesvarydrasticallyacrossscenes,andcanhardlybemodeled(orevenhand-labeled)consistently.Inthispaperwetacklethisproblembyintroducinglatentvariablestoaccountforclutters,sothattheobservedimageisjointlyexplainedbythefaceandclutterlay-outs.Modelparametersarelearnedinthemaximummarginformulation,whichisconstrainedbyextrapriorenergytermsthatde“netheroleofthelatentvariables.Ourapproachenablestakingintoaccountandin-ferringindoorclutterlayoutswithouthand-labelingofthecluttersinthetrainingset.Yetitoutperformsthestate-of-the-artmethodofHedauetal.[4]thatrequiresclutterlabels.1IntroductionInthispaper,wefocusonholisticunderstandingofindoorscenesintermsofrecoveringthelayoutsofthemajorfaces(”oor,ceiling,walls)andfurniture(Fig.1).Theresultingrepresentationcouldbeusefulasastronggeometricconstraintinavarietyoftaskssuchasobjectdetectionandmotionplanning.Ourworkisinspiritofrecentworkonholisticsceneunderstanding,butfocusesonindoorscenes.Forparameterizingtheglobalgeometryofanindoorscene,weadopttheapproachofHedauetal.[4],whichmodelsaroomasabox.Speci“cally,giventheinferredthreevanishingpoints,wecangenerateaparametricfamilyofboxescharacterizingthelayoutsofthe”oor,ceilingandwalls.Theproblemcanbeformulatedaspickingtheboxthatbest“tstheimage.However,amajorchallengearisesfromthefactthatmostindoorscenesareclutteredbyalotoffurnitureanddecorations.Theyoftenobscurethegeometricstructureofthescene,andalsooccludeboundariesbetweenwallsandthe”oor.Appearancesandlayoutsofclutterscanvarydrasticallyacrossdierentindoorscenes,soitisextremelydicult(ifnotimpossible)tomodelthemconsistently.Moreover,hand-labelingofthefurnitureanddecorationsfortrainingcanbeanextremelytime-consuming(,delineatingachairbyhand)andambiguoustask.Forexample,shouldwindowsandtherugbelabeledasclutter?K.Daniilidis,P.Maragos,N.Paragios(Eds.):ECCV2010,PartIV,LNCS6314,pp.497–510,2010.Springer-VerlagBerlinHeidelberg2010 DiscriminativeLearningwithLatentVariables499thevanishingpoints,andusingstruct-SVMtopickthebestbox.However,theyusedsupervisedclassi“cationofsurfacelabels[6]toidentifyclutters(furniture),andusedthetrainedsurfacelabelclassi“ertoiterativelyre“netheboxlayoutestimation.Speci“cally,theyusetheestimatedboxlayouttoaddfeaturestosupervisedsurfacelabelclassi“cation,andusetheclassi“cationresulttolowertheweightsofclutterŽimageregionsinestimatingtheboxlayout.Thustheirmethodrequirestheusertocarefullydelineatethecluttersinthetrainingset.Incontrast,ourlatentvariableformulationdoesnotrequireanylabelofclutters,yetstillaccountsfortheminaprincipledmannerduringlearningandinference.Wealsodesignarichersetofjointfeatureaswellasamoreecientinferencemethod,bothofwhichhelpboostourperformance.Incorporatingimagecontexttoaidcertainvisiontasksandtoachieveholisticsceneunderstandinghavebeenreceivingincreasingconcernandeortsrecently[3,5,6].Ourpaperisanotherworkinthisdirectionthatfocusesonindoorscenes,whichdemonstratesomeuniqueaspectsofduetothegeometricandappearanceconstraintsoftheroom.Latentvariableshasbeenexploitedinthecomputervisionliteratureinvarioustaskssuchasobjectdetection,recognitionandsegmentation.Theycanbeusedtorepresentvisualconceptssuchasocclusion[11],objectparts[2],andimage-speci“ccolormodels[9].Introducinglatentvariablesintostruct-SVMwasshowntobeeectiveinseveralapplications[12].Itisalsoaninterestingaspectinourworkthatlatentvariablesareusedindirectcorrespondencewithaconcretevisualconcept(cluttersintheroom),andwecanvisualizetheinferenceresultonlatentvariablesviarecoveredfurnitureanddecorationsintheroom.2ModelWebeginbyintroducingnotationstoformalizeourproblem.Weusetodenotetheinputvariable,whichisanimageofanindoorscene;todenotetheoutputvariable,whichistheboxŽcharacterizingthemajorfaces(”oor,walls,ceiling)oftheroom;andtodenotethelatentvariables,whichspecifytheclutterlayoutsofthescene.Forrepresentingthefacelayoutsvariableweadopttheideaof[4].Mostindoorscenesarecharacterizedbythreedominantvanishingpoints.Giventhepositionofthesepoints,wecangenerateaparametricfamilyofboxesŽ.Specif-ically,takingasimilarapproachasin[4]we“rstdetectlonglinesintheimage,then“ndthreedominantgroupsoflinescorrespondingtothreevanishingpoints.Inthispaperweomitthedetailsofthesepreprocessingsteps,whichcanbefoundin[4]and[8].AsshowninFig.2,wecomputetheaverageorientationofthelinescorrespondingtoeachvanishingpoint,andnamethevanishingpointcorrespond-ingtomostlyhorizontallinesas;theonecorrespondingtomostlyverticallinesas;andtheotheroneasAcandidateboxŽspecifyingthefacelayoutsofthescenecanbegener-atedbysendingtworaysfrom,tworaysfrom,andconnectingthefour 500H.Wang,S.Gould,andD.Koller Fig.2.Lower-Left:Wehave3groupsoflines(showninR,G,B)correspondingtothe3vanishingpointsrespectively.TherearealsooutlierŽlines(showninyellow)whichdonotbelongtoanygroup.Upper-Left:AcandidateboxŽspecifyingtheboundariesbetweentheceiling,wallsand”oorisgenerated.Right:Candidateboxes(inyellowframes)generatedinthiswayandthehand-labeledgroundtruthboxlayout(ingreenframe).intersectionswith.Weuserealparameterstospecifythepositionofthefourrayssentfromand.Thusthepositionofthevanishingpointsandthevalueofcompletelydetermineaboxhypothesisassigningeachpixelafacelabel,whichhas“vepossiblevaluesceilingleft-wallright-wallfront-wallßoor.Notethatsomeofthefacelabelscouldbeabsent;forexampleonemightonlyobserveright-wallfront-wallandßoorinanimage.Inthatcase,somevalueofwouldgiverisetoaraythatdoesnotintersectwiththeextentoftheimage.Thereforewecanrepresenttheoutputvariablebyonly4dimensionsthankstothestronggeometricconstraintofthevanishingpoints.Onecanalsothinkofasthefacelabelsforallpixels.Wealsode“neabasedistribution)overtheoutputspaceestimatedby“ttingamultivariateGaussianwithdiagonalcovarianceviamaximumlikelihoodtothelabelboxesinthetrainingset.Thebasedistributionisusedinourinferencemethod.Tocompactlyrepresenttheclutterlayoutvariable,we“rstcomputeanover-segmentationoftheimageusingmean-shift[1].Eachimageissegmentedintoanumber(typicallylessthanahundred)ofregions,andforeachregionweassignittoeitherclutternon-clutter.Thusthelatentvariableisabinary TherecouldbedierentdesignchoicesforparameterizingthepositionŽofaraysentfromavanishingpoint.Weusethepositionofitsintersectionwiththeimagecentralline(useverticalandhorizontalcentrallineforandrespectively).Notethatresidesinacon“neddomain.Forexample,giventhepriorknowledgethatthecameracannotbeabovetheceilingorbeneaththe”oor,thetworayssentmustbeondierentsidesof.Similarconstraintsalsoapplyto DiscriminativeLearningwithLatentVariables501vectorwiththesamedimensionalityasthenumberofregionsintheimagethatresultedfromtheover-segmentation.Wenowde“netheenergyfunctionthatrelatestheimage,theboxandtheclutterlayouts:(1)isajointfeaturemappingthatcontainsarichsetoffeaturesmeasuringthecompatibilitybetweentheobservedimageandthebox-clutterlayouts,takingintoaccountimagecuesfromvariousaspectsincludingcolor,texture,perspectiveconsistency,andoveralllayout.containstheweightsforthefeaturesthatneedstobelearned.isanenergytermthatcapturesourpriorknowledgeontheroleofthelatentvariables.Speci“cally,itmeasurestheappearanceconsistencyofthemajorfaces(”oorandwalls)whenthecluttersaretakenout,andalsotakesintoaccounttheoverallclutternessofeachface.Intuitively,itde“nesthelatentvariables(clutter)tobethingsthatappearsinconsistentlyineachofthemajorfaces.DetailsaboutandareintroducedinSection3.3.Theproblemofrecoveringthefaceandclutterlayoutscanbeformulatedas:)=argmax(2)3LearningandInference3.1LearningGiventhetrainingsetwithhand-labeledboxlayouts,welearntheparametersdiscriminativelybyadaptingthelargemarginformulationofstruct-SVM[10,12], 2w2+C i,0and(3)maxmax (4)where)isthelossfunctionthatmeasuresthedierencebetweenthecan-didateoutputandthegroundtruth.Weusepixelerrorrate(thepercentageofpixelsthatarelabeleddierentlybythetwoboxlayouts)asthelossfunction.encodesthepriorknowledge,itis“xedtoconstrainthelearningprocessofmodelparameters.Withouttheslackvariablestheconstraints(4)essen-tiallystatethat,foreachtrainingimage,anycandidateboxlayoutcannotbetterexplaintheimagethanthegroundtruthlayout.Maximizingthecom-patibilityfunctionoverthelatentvariablesgivestheclutterlayoutsthatbestexplaintheimageandboxlayoutsunderthecurrentmodelparameters.Sincethemodelcanneverfullyexplaintheintrinsiccomplexityofreal-worldimages,wehavetoslackentheconstraintsbytheslackvariables,whicharescaledbythe DiscriminativeLearningwithLatentVariables503inferenceproblems(2),(5)and(6)cannotbesolvedanalytically.In[4]therewasnolatentvariable,andthespaceofisstilltractableforsimplediscretization,sotheconstraintsforstruct-SVMcanbepre-computedforeachtrainingimagebeforethemainlearningprocedure.Howeverinourproblemweareconfrontingthecombinatorialcomplexityofand,whichmakesitimpossibletopre-computeallconstraints.Forinferringgiven,weuseiteratedconditionalmodes(ICM)[13].Namely,weiterativelyvisitallsegments,and”ipasegment(betweenclutterandnon-clutter)ifitincreasetheobjectivevalue,andwestoptheprocessifnosegmentis”ippedinlastiteration.Toavoidlocaloptimawestartfrommultiplerandominitializations.Forinferringbothand,weusestochastichillclimbingforandthealgorithmisshowninAlgorithm2.Thetest-timeinferenceprocedure(2)ishandlesimilarlyasthelossaugmentedinference(5)butwithadierentobjective.Wecanusealooserconvergencecriterionfor(5)tospeeduptheprocessasithastobeperformedmultipletimesinlearning.TheoverallinferenceprocessisshowninAlgorithm2. Algorithm2.StochasticHill-ClimbingforInference 1:Input:2:Output:anumberofrandomseeds4:samplefromargmax)byICMrepeatrepeat8:perturbaparameterofaslongasitincreasestheobjectiveuntilconvergenceargmax)byICMuntilconvergenceendfor Inexperimentswealsocomparetoanotherinferencemethodthatdoesnotmakeuseofthecontinuousparameterizationof.Speci“callyweindependentlygeneratealargenumberofcandidateboxesfrom),inferthelatentvariableforeachofthem,andpicktheonewiththelargestobjectivevalue.Thisissimilartotheinferencemethodusedin[4],inwhichtheyindependentlyevaluateallhypothesisboxesgeneratedfromauniformdiscretizationoftheoutputspace.3.3PriorsandFeaturesFormakinguseofcolorandtextureinformation,weassigna21dimensionalappearancevectortoeachpixel,includingHSVvalues(3),RGBvalues(3),Gaussian“lterin3scalesonall3Labcolorchannels(9),Sobel“lterin2directionsand2scales(4),andLaplacian“lterin2scales(2).Eachdimensionisnormalizedforeachimagetohavezeromeanandunitvariance. DiscriminativeLearningwithLatentVariables505Table1.Quantitativeresults.Row1:pixelerrorrate.Row2&3:thenumberoftestimages(outof105)withpixelerrorrateunder20%&10%.Column1([6])Hoiemetal.sregionlabelingalgorithm.Column2([4]w/o):Hedauetal.smethodwithoutclutterlabel.Column3([4]w/):Hedauetal.smethodwithclutterlabel(iterativelyre“nedbysupervisedsurfacelabelclassi“cation[6]).The“rst3columnsaredirectlycopiedfrom[4].Column4(Oursw/o):Ourmethod(withoutclutterlabel).Column5(w/oprior):Ourmethodwithoutthepriorknowledgeconstraint.Column6(:Ourmethodwithlatentvariables“xedtobezeros(assumingnoclutterŽ).Column7(GT):Ourmethodwithlatentvariables“xedtobehand-labeledcluttersinlearning.Column8(UB):Ourmethodwithlatentvariables“xedtobehand-labeledcluttersinbothlearningandinference.InthiscasethetestingphaseisactuallycheatingŽbymakinguseofthehand-labeledclutters,sotheresultscanonlyberegardedassomeupperbound.Thedeviationsintheresultsareduetotherandomizationinbothlearningandinference.Theyareestimatedovermultiplerunsoftheentireprocedure. [6][4]w/o[4]w/Oursw/ow/oprior=GTUB Pixel28.9%26.5%21.2%20.10.5%21.50.7%22.20.4%24.90.5%19.20.6%20%–––45734636710%–––225320237 4ExperimentalResultsForexperimentsweusethesamedatastasusedin[4].Thedatasetconsistsof314images,andeachimagehashand-labeledboxandclutterlayouts.Theyalsoprovidedthetraining-testsplit(209fortraining,105fortest)onwhichtheyreportedresultsin[4].Forcomparisonweusethesametraining-testsplitandachieveapixel-error-rateof20.1%withoutclutterlabels,comparingto26.5%in[4]withoutclutterlabelsand21.2%withclutterlabels.Detailedcompar-isonsareshowninTable1(thelastfourcolumnsareexplainedinthefollowingsubsections).Inordertovalidatetheeectsofpriorknowledgeinconstrainingthelearningprocess,wetakeoutthepriorknowledgebyaddingthetwotermsandordinaryfeaturesandtrytolearntheirweights.TheperformanceofrecoveringboxlayoutsinthiscaseisshowninTable1,column5.Althoughthedierencebetweencolumn4and5(Table1)issmall,therearemanycaseswhererecoveringmorereasonablecluttersdoeshelpinrecoveringthecorrectbox-layout.SomeexamplesareshowninFigure3,wherethe1stand2ndcolumn(fromleft)aretheboxandclutterlayoutsrecoveredbythelearnedmodelwithpriorconstraints,andthe3rdand4thcolumnaretheresultoflearningwithoutpriorconstraints.Forexample,inthecaseofthe3rdrow(Fig.3),theboundarybetweentheßoorandthefront-wall(thewallontheright)iscorrectlyrecoveredeventhoughitislargelyoccludedbythebed,whichiscorrectlyinferredasclutterŽ,andthe Thedatasetisavailableathttps://netfiles.uiuc.edu/vhedau2/www/groundtruth.zip 508H.Wang,S.Gould,andD.Koller Inferred ox ltInferred cltter leled cltter l Fig.4.Sampleresultsforcomparingtherecoveredcluttersbyourmethodandthehand-labeledcluttersinthedataset.The1stand2ndcolumnarerecoveredboxandclutterlayoutsbyourmethod.The3rdcolumn(right)isthehand-labeledclutterlayouts.OurmethodusuallyrecoversmoreobjectsasclutterŽthanpeoplewouldbothertodelineatebyhand.Forexample,therugwithadierentappearancefromthe”oorinthe2ndimage,paintingsonthewallinthe1st,4th,5th,6thimage,andthetreeinthe5thimage.Therearealsomajorpiecesoffurniturethataremissinginthehand-labelsbutrecoveredbyourmethod,suchasthecabinetandTVinthe1stimage,everythinginthe3rdimage,andthesmallsofainthe5thimage. DiscriminativeLearningwithLatentVariables509 2.5 3 3.5 4 4.5 5 5.5 0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 log10 (# calls of )Pixel Error Rate 5 10 15 20 25 30 35 0.18 0.2 0.22 0.24 0.26 0.28 0.3 0.32 0.34 0.36 0.38 ations in learningPixel Error Rate Baseline Fig.5.Left:ComparisonbetweentheinferencemethoddescribedinAlgorithm2andthebaselineinferencemethodthatevaluateshypothesesindependently.Right:Empiricalconvergenceevaluationforthelearningprocedure.isaround6-7%(outofthe20.1%)ofthepixelerrorduetoincorrectvanishingpointdetectionresultsWecompareourinferencemethod(Algorithm2)tothebaselinemethod(eval-uatinghypothesesindependently)describedinSection3.2.Fig.5(Left)showstheaveragepixelerrorrateovertestsetversusthenumberofcallstothejointfeaturemappinginlogscale,whichcouldbeviewedasameasureofrunningtime.Thedierencebetweenthetwocurvesisactuallyhugeasweareplottinginlog-scale.Forexample,forreachingthesameerrorrateof0.22thebaselinemethodwouldtakeroughly10timesmorecallstoAswehaveintroducedmanyapproximationsintothelearningprocedureoflatentstruct-SVM,itishardtotheoreticallyguaranteetheconvergenceofthelearningalgorithm.InFig.5(Right)weshowtheperformanceofthelearnedmodelontestsetversusthenumberofiterationsinlearning.Empiricallythelearningprocedureapproximatelyconvergesinasmallnumberofiterations,althoughwedoobservesome”uctuationduetotherandomizedapproximationusedinthelossaugmentedinferencestepoflearning.5ConclusionInthispaperweaddressedtheproblemofrecoveringthegeometricstructureaswellasclutterlayoutsfromasingleimage.Weusedlatentvariablestoaccountforindoorclutters,andintroducedpriortermstode“netheroleoflatentvariablesandconstrainthelearningprocess.Theboxandclutterlayoutsrecoveredbyourmethodcanbeusedasageometricconstraintforsubsequenttaskssuch Theerrorrateof6-7%isestimatedbyassumingaperfectmodelthatalwayspicksthebestboxgeneratedfromthevanishingpointdetectionresult,andperformingstochastichill-climbingtoinfertheboxusingtheperfectmodel. 510H.Wang,S.Gould,andD.Kollerasobjectdetectionandmotionplanning.Forexample,theboxlayoutsuggestsrelativedepthinformation,whichconstrainsthescaleoftheobjectswewouldexpecttodetectinthescene.Ourmethod(withoutclutterlabels)outperformsthestate-of-the-artmethod(withclutterlabels)inrecoveringtheboxlayoutonthesamedataset.Andwearealsoabletorecovertheclutterlayoutswithouthand-labelingoftheminthetrainingset.AcknowledgementsThisworkwassupportedbytheNationalScienceFoundationunderGrantNo.RI-0917151,theOceofNavalResearchundertheMURIprogram(N000140710747)andtheBoeingCorporation.References1.Comaniciu,D.,Meer,P.:Meanshift:arobustapproachtowardfeaturespaceanal-ysis.IEEETransactionsonPAMI24(5)(2002)2.Felzenszwalb,P.,Girshick,R.,McAllester,D.,Ramanan,D.:Objectdetectionwithdiscriminativelytrainedpartbasedmodels.IEEETransactionsonPAMI(2010)(toappear)3.Gould,S.,Fulton,R.,Koller,D.:Decomposingasceneintogeometricandseman-ticallyconsistentregions.In:ICCV(2009)4.Hedau,V.,Hoiem,D.,Forsyth,D.:Recoveringthespatiallayoutofclutteredroom.In:ICCV(2009)5.Heitz,G.,Koller,D.:Learningspatialcontext:Usingstuto“ndthings.In:Forsyth,D.,Torr,P.,Zisserman,A.(eds.)ECCV2008,PartI.LNCS,vol.5302,pp.30…43.Springer,Heidelberg(2008)6.Hoiem,D.,Efros,A.,Hebert,M.:Recoveringsurfacelayoutfromanimage.IJCV75(1)(2007)7.Joachims,T.,Finley,T.,Yu,C.-N.:Cutting-PlaneTrainingofStructuralSVMs.MachineLearning77(1),27…59(2009)8.Rother,C.:Anewapproachtovanishingpointdetectioninarchitecturalenviron-ments.In:IVC,vol.20(2002)9.Shotton,J.,Winn,J.,Rother,C.,Criminisi,A.:TextonBoostforimageunderstand-ing:multi-classobjectrecognitionandsegmentationbyjointlymodelingtexture,layout,andcontext.IJCV(2007)10.Tsochantaridis,I.,Joachims,T.,Hofmann,T.,Altun,Y.,Singer,Y.:Largemarginmethodsforstructuredandinterdependentoutputvariables.JMLR6,1453…1484(2005)11.Vedaldi,A.,Zisserman,A.:Structuredoutputregressionfordetectionwithpartialocclusion.In:NIPS(2009)12.Yu,C.-N.,Joachims,T.:LearningstructuralSVMswithlatentvariable.In:ICML(2009)13.Besag,J.:Onthestatisticalanalysisofdirtypictures(withdiscussions).JournaloftheRoyalStatisticalSociety,SeriesB48,259…302(1986)