/
IEEE International Conference on Computer Vision ICCV N ovember  Barcelona Spain IEEE International Conference on Computer Vision ICCV N ovember  Barcelona Spain

IEEE International Conference on Computer Vision ICCV N ovember Barcelona Spain - PDF document

debby-jeon
debby-jeon . @debby-jeon
Follow
532 views
Uploaded On 2014-12-12

IEEE International Conference on Computer Vision ICCV N ovember Barcelona Spain - PPT Presentation

Latent Structured Models for Human Pose Estimation Catalin Ionescu Fuxin Li Cristian Sminchisescu Faculty of Mathematics and Natural Sciences University of Bonn Georgia Institute of Technology Institute for Mathematics of the Romanian Academy catali ID: 22354

Latent Structured Models for

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "IEEE International Conference on Compute..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

IEEEInternationalConferenceonComputerVision(ICCV),November2011,Barcelona,Spain. LatentStructuredModelsforHumanPoseEstimationCatalinIonescu1;3,FuxinLi2,CristianSminchisescu1;31FacultyofMathematicsandNaturalSciences,UniversityofBonn2GeorgiaInstituteofTechnology,3InstituteforMathematicsoftheRomanianAcademyfcatalin.ionescu,fuxin.li,cristian.sminchisescug@ins.uni-bonn.deAbstractWepresentanapproachforautomatic3Dhumanposereconstructionfrommonocularimages,basedonadiscrim-inativeformulationwithlatentsegmentationinputs.Wead-vancetheeldofstructuredpredictionandhumanposere-constructiononseveralfronts.First,byworkingwithapoolofgure-groundsegmenthypotheses,thepredictionprob-lemisformulatedintermsofcombinedlearningandinfer-enceoversegmenthypothesesand3Dhumanarticularcon-gurations.Besidesconstructingtractableformulationsforthecombinedsegmentselectionandposeestimationprob-lem,weproposenewaugmentedkernelsthatcanbetteren-codecomplexdependenciesbetweenoutputvariables.Fur-thermore,weprovideprimallinearre-formulationsbasedonFourierkernelapproximations,inordertoscale-upthenon-linearlatentstructuredpredictionmethodology.TheproposedmodelsareshowntobecompetitiveintheHu-manEvabenchmarkandarealsoillustratedinaclipcol-lectedfromaHollywoodmovie,wherethemodelcaninferhumanposesfrommonocularimagescapturedincomplexenvironments.1.IntroductionReconstructingthetridimensionalhumanposeandmo-tionofpeopleintheofce,onthestreet,oroutdoorsbasedonimagesacquiredwithasingle(orevenmultiple)videocamera(s)isoneoftheopenproblemsincomputervision.Thedifcultiescompound:peoplehavemanydegreesoffreedom,deformandarticulate,andtheirappearancevarieswidelyastheyspanasignicantrangeofbodyproportionsandclothing.Whenanalyzingpeopleinrealisticimagingconditions,thebackgroundscannotbecontrolledandpeo-plecanalsobeoccludedbyotherobjectsorpeople.The2Dto3Dimagingrelationsandtheirambiguitiesinthearticulatedcaseunderpoint-wisejointcorrespon-dencesarerelativelywellunderstoodbothinthemonocularandinthemulti-camerasetting;severalclassesofmeth-odsexist:generativesamplingstrategies[8,5],discrimi-nativemethodsbasedonmulti-valuedpredictors[1,24],aswellasmethodsthatcombinediscriminativepredictionandverication[20].Techniquestoreducesearchcomplex-itytolow-dimensionalnon-linearmanifoldshavegainedpopularity[23,25,22],partlyforefciencyconsiderations,butalsoasmeanstohandlemeasurementambiguitiesormissingdataresultingfromocclusion.Outputmanifoldas-sumptionsarejustonewaytomodelstructure.Moregen-erallywouldbetousestructuredkernels[13].Recentlythereisatrendtowardsoperationinrealisticenvironmentswherepeopleareviewedagainstmorecom-plexbackgroundsandinamorediversesetofposes.Han-dlingsuchenvironmentswouldbeinprinciplepossiblebymeansofintegratedsystemsthatjointlycombinepersonde-tection(orlocalization)and3Dreconstruction[13,3,2].Howeverhumandetectors[10]cannotalwaysdetectgen-eralhumanposes,andevenwhentheydo,gure-groundneedstoberesolvedfromtheboundingboxbefore3Dposecanbepredictedreliably(itisagreedthatpredictorsbasedonsilhouettesgeneralizerelativelywellifthein-putqualityisgoodandthetrainingdistributionmatchessufcientlywellthesetofhumanposestypicaloftheproblemdomain[1,24,20]).Anotherapproachwouldbetousemoredetailed2Dhumanpart-basedmodelsforlocalization[19,2,11,9].But2Dpictorialmodelsusuallycomewitharelativelyhighfalsepositiverateandencounterdifcultieslocalizingpeopleviewedundersharp3Deffects(foreshortening).Closesttoourgoalofsimultaneouslyob-tainingsegmentationsand3DposesisPosecut[4],aposeestimationmethodwhichalternatesbetweenttinga3Dskeletalmodeltoasilhouette(standardgenerativettingoralignment),andre-estimatingthesilhouette(solvingabi-naryMRF)usingthepredictedskeletonboundaryasashapeprior.Inthispaperwepursueaconstraineddiscriminativeap-proachthatjointlyestimatesaqualityselectorfunctionoverasetoflatentgure/groundsegmentations(obtainedusingparametricmax-owandrankedbasedonmid-levelproper-ties)andastructuredpredictor.Weintroduceseveralnovelelementsthattouchuponstructuralmodeling,efciency,1 Figure1.Illustrationofourhumanlocalizationand3Dposereconstructionframework.Givenanimage(left),weextractasetofputativegure-groundsegmentationsbyapplyingconstraintsatdifferentlocationsandspatialscalesin(monocular)images,usingtheConstrainedParametricMin-Cuts(CPMC)algorithm[6].Themethodistunedtowardssegmentsthatexhibitgenericmid-levelobjectregularities(convexity,boundarycontinuity),butusesnopersonpriors.DifferentsegmentsextractedbyCPMCareshowninimages2,3and5,6respectively.Givensegmentdatatogetherwith3Dhumanposeinformation,wewilljointlylearnalatentstructuredmodelthatknowsbothhowtoselectrelevantsegmentsandhowtopredictaccurate3Dhumanposes,conditionedonthatinputselection.Overall,wecontributewithalatentformulationbasedonKernelDependencyEstimation,withnovelstructuredkernelsthatbetterencodecorrelationsamongthehumanbodyparts,andwithFourierkernelapproximationsthatenablelineartrainingforlargedatasets.Automatic3Dreconstructionsobtainedbyourmethodareshowninimages4and7,withrenderingsbasedonsyntheticgraphicsmodels.andexperimentalrealism.First,alargesetofgure-groundsegmentsisgeneratedforeachimage,atmultiplelocationsandscalesusingCon-strainedParametricMin-Cuts(CPMC),arecentsegmenta-tionprocedurethathasproventobeeffectiveinanumberofimagesegmentation[6]andlabelingtasks[15].Wecasttheproblemofautomaticlocalizationand3Dhumanposeestimationascombinedinferenceoversegmentsandhumanarticularcongurationsthatgiveoptimalprediction.Second,wegiveanovelyetgeneralandtractablefor-mulationforlatentstructuredmodels(withlatentkernelde-pendencyestimationl-KDEasaspecialcase),asapplica-bletolocalizationandcontinuousstateestimation.Wealsoproposeaugmentedkernelsforhumanposeandshowthatthesearequantitativelybetteratrepresentingcomplexinter-dependenciesamongthebodypartsthantheirunstructuredcounterparts.Thecomputationalcostofusingaccuratenon-linearkernelsisoneofthemainchallengesinscalingthemethodologytolargedatasets.Ourthirdcontributionisthereforetheformulationofla-tentstructuredmodelslikel-KDEorstructuredSVM[13]intheframeworkofrandomFourierapproximations[18,26,16].Weshowthatunderasuitablechangeofvariables,thecalculationsbecomelinear,withprimalformulations,aswellasgradientcalculationsexpresseddirectlyinthelinearspaceinducedbytheFourierrepresentation.Were-portquantitativestudiesintheHumanEvabenchmark[21]andalsoshowthatoursegmentation-basedlatentstructuredmodelsproducepromising3Dreconstructionresultsforpeopleinavarietyofcomplexposes,andlmedagainstmorecomplicatedbackgroundsthanpreviously.2.Segmentation-basedPosePredictionOurgoalistoinvestigatethecontinuousstructuredpre-dictionproblem,withhumanposeestimationasaspecialcase,undermultiple,imperfectinputhypotheses.Inputseg-mentsthatpartlyalignwiththepersonboundariescontainimportantcuesforprediction.However,amongimperfectsegments,itisnotobviouswhichmeasurementwouldbeagoodindicatoroftheusefulnessofasegmentforposeprediction.Segmentswiththesame70%(pixelunionoverintersection)overlaptothegroundtruthcanbeverydiffer-ent:somemaymisslimbsoftheperson,somemaymisstheheadandsomemaycoverthepersonandsomebackground.Conceptually,poseestimationerrorswilldifferamongdif-ferentimperfectsegments,anobservationconrmedbyourexperiments.Becauseaclear-cutdenitionofsegmentqualityremainselusive,inthispaperweseektolearnataskspecicfunc-tionfromdata—onethatquantieswhetheracorrectposecanbepredictedfromasegment.Inputqualityisinthisrespectlatent,andcanonlybeindirectlyinferredbytheex-tentaninputiseffectiveforprediction.Wecastthejointsegmentselectionandposepredictionproblemasestimat-ingtwofunctions,aposepredictorfandasegmentqual-ityselectorg.Duringtestingtime,thequalityfunctiongselectsthemostsuitablesegment,thenfperformsoutputpredictiononthissegment.Namely,y=f(I)=f(argmaxhg(rIh))(1)whereyisaD-dimensionalvectorof3Drelativejointposi-tions,Icaneitherbeanimage,oraregion-of-interestinanimage,assumedsegmentedintoNgure-groundhypothe-sesSI=frI1;:::;rINg.Byabuseofnotation,wealsouserIitorepresenttheddimensionaldescriptorextractedonthesegment;yIdenotestheDdimensionalgroundtruthposeforimageIandsIisthegroundtruthsegment.Learningisperformedbyoptimizingfandgjointly.Aconceptualformulationcanbe:minf;gXIXh2Ig(rIh)kf(rIh)yIk2+gkgk2+fkfk2s.t.g(rIh)0maxh2Ig(rIh)18I(2)wherekgkandkfkarenorms(e.g.,RKHSnorms[12]) topreventover-tting.TheformulationissimilarinspirittotheMultipleInstanceLearningProblem(e.g.,[17]).Minimizationoftheobjectivefunctionrequiresthatg(rIh)kf(rIh)yI)k2issmall.Therefore,thequalityfunc-tiongshouldgivehigherscorestosegmentsthathavelowpredictionerror.Tomaketheformulationpractical,gneedstobealwayspositive.Also,inordertoavoiddegeneraciese.g.,g(rIh)0atleastonesegmentineachimagemusthaveagoodscore(g(rIh)�0).Wetakealogisticformulationandselectg(rIh)=1 1+exp(w�gg(rIh))toensurepositiveness.Howeveritisnotobvioushowtosetaconvexconstrainttoavoidthesede-generacies.Forexampleaconstraintlike:Ph2Ig(rIh)1isnotconvex.Wechooseinsteadtointroduceconstraintsonthesum:Ph2Iw�gg(rIh).Sinceasmallerw�gg(rIh)impliesalargerg(rIh),makingthesumsmallmakessomeg(rIh)large.Weusethebinomiallog-likelihoodlossL(x)=log(1+2exp(x))asasoftpenaltyonlargevaluesofPh2Iw�gg(rIh).Puttingittogether,theoptimizationproblemcanberecastasfollows:minf;wgXIXh2I1 1+exp(w�gg(rIh))kf(rIh)yIk2+sXIlog(1+2exp(Xh2Iw�gg(rIh)))+gkwgk2+fkfk2(3)whereg,fandsareregularizationparameters.Aquitesubtlepointintheoptimizationisthebalancebe-tweentherstandthesecondtermintheobjective.Asar-guedbefore,thesecondtermtendstodriveg(rIh)large.Butalargegvalueforallsegmentsleadstolargetotalweightsandahigherweightedpredictionerror.Balancingthetwotermsintheoptimizationprovidesanappropriatequalityfunctiongwhichoutputslargevaluesforsegmentsmoresuitableforprediction,andsmallvaluesotherwise.Tosolvetheproblem,weusealternatingminimizationonfandg(Algorithm1).Estimationoffandgthencanbedoneviaanystandardapproaches.WehaveexperimentedwithbothkernelmodelsandlinearmodelsbasedonrandomFourierfeatures(describedinSec.2.2).TheestimationofgisanunconstrainedoptimizationwhichcanbesolvedusinganLBFGSalgorithmprovided,e.g.intheminFuncpackage1.Theoptimizationforfisaweightedregression,asthesecondtermdoesnottakeeffect.Moreexpensiveoptionsforlearningf,includingastructuredSVMmodel,aregiveninacompaniontechnicalreport[14].2.1.KernelDependencyEstimationComplexinferenceproblemswithinterdependentinputsandoutputsrequiremodelsthatcanrepresentcorrelationsbothintheinputandintheoutputexplicitly.Anefcient 1http://people.cs.ubc.ca/schmidtm/Software/minFunc.html Algorithm1Algorithmforlatentstructuredlearning. 1:fInitializationoffusinggroundtruthsegmentsg2:minfPIkf(sI)yIk2+fkfk23:whilenotconvergeddo4:fLearnthesegmentrankerg5:minwgPIPh2I1 1+exp(w�gg(rIh))kf(rIh)yIk2+sPIlog(1+2exp(Ph2Iw�gg(rIh)))+gkwgk26:g(rIh)=1 1+exp(w�gg(rIh))7:fLearntheposepredictorwiththenewweightsg8:minwfPIPh2Ig(rIh)kwf�X(rIh)Y(yI)k2+fkwfk29:f(rIh)=argminykwf�X(rIh)Y(y)k210:endwhile solutiontomodeldependenciesiskerneldependencyesti-mation(KDE)wheremulti-dimensionalcontinuousoutputsarenon-linearlydecorrelatedbykernelPCA.Predictionisthenredirectedtothekernelprincipalcomponentsubspace.Sinceinthisnewspacedimensionsareorthogonal,regres-sioncanbeperformedindependentlyoneachlatentdimen-sion.WeuseKDEasanalternativetoindependent-outputregressionforlearningoff.Basedon(3)weconsideraweightedKDEmodelwithweightsg(rIh)assignedtoeachsegmentminwfXIXh2Ig(rIh)kwf�X(rIh)Y(yI)k2+fkwfk2(4)whereXRd!Rmisanon-linearmapappliedtotheinputsandYRD!RnisanorthogonalkernelPCAembeddingofthetargets.Remarkably,thelearningproblem(4)canbesolvedinclosedform[7]withparametersgivenbywf=(M�XMX+fI)1M�XMY(5)whereisthediagonalmatrixwithelementsg(rIh),MY=[Y(y1):::Y(yN)]�theNnmatrixofoutputsandsimilarlyMXtheNmmatrixofinputfeatures.Givenaninputandamodel,theoutputiscomputedbysolvingforthepre-imageoftheprojectionintheoriginalposespace.Inthiscasewecomputetheminimum`2dis-tancebetweenthereconstructionandtheoutputinthelatent(feature)spaceyh=f(rIh)=argminykwf�X(rIh)Y(y)k2(6)whereintestingthesegmentisselectedbasedon(1).TheKDEregressionframeworkissimpleandefcientinbothtrainingandtesting.Sincenoinferenceisperformedduringtraining,onlynstandardregressors(followingker-nelPCAonoutputs)needtobeestimated.ThetrainingissignicantlyfasterthanalternativessuchasstructuralSVM(seeourtechnicalreport[14]).Inferenceisalsosimpler sincethesegmentqualityselectorcanbeevaluatedef-ciently.2.2.RandomFourierApproximationsRandomFourierapproximations(RF)[18,26,16]pro-videanefcientmethodologytocreateexplicitfeaturetransformsfornon-linearkernelmethods.InRF,explicitfeaturevectorsarecreatedforexamplessothattheirin-nerproductsareMonteCarloapproximationsofthekernel.Thisisimportantsincelinearmethodstypicallyscalelin-earlyinthenumberoftrainingexamples,whereaskernelmethodsscaleatleastquadratically.Therefore,RFenablesaccuratetrainingwithlargedatasetsinmanycircumstances.Thealgorithmforthechangeofrepresentationhastwosteps:i)Generatenrandomsamples\r1;:::;\rnfromadis-tributiondependentontheFouriertransformofthekernel;ii)Forallexamples,computetherandomprojection[16]Z(u)=cos(q\r1(u)+b1);:::;cos(q\rn(u)+bn)�(7)whereq\r(u)isaninnerproductfunctiondependingonthekernelandbiareuniformrandomsamplesdrawnfrom[02.ThekernelsusedinthispaperaretheGaussianker-nel,whereq\r(u)=\r�u,forreal-valuedposesandtheskewedchi-squarekernelfornonnegative-valuedhistogramdescriptorsoftheinput[16],withq\r(u)=log(\r�u+c),wherecisthekernelparameter.Fordetailssee[18,26,16].AfterapplyingtheRFfeaturetransform(7)learningissimplyperformedaslinearleastsquaresregression(orlin-eardependencyestimation)ontheRFfeaturematrix.TheRFmethodologycanbeappliedtoourKDEmodelinastraightforwardmanneri.e.byusingX=ZXandY=ZY.Inferenceinthemodelremainsnon-convex,evenforFourierembeddingsandweoptimizelocallyus-ingaBFGSquasi-Newtonmethod.Foroptimization,thegradientoftheinferencefunctionw.r.t.theinputsneedstobecomputed.Thegradientindirectionj,isobtainedanalyticallybydifferentiatingthefeaturemap@ZY(y) @yj=Pnk=1@q\rk(y) @yjsin(q\rk(y)+bk)togive:rj1 2kwf�X(r)Y(y)k2==@ZY(y) @yj(wf�ZX(r)ZY(y))==Pnk=1@q\rk(y) @yjsin(q\rk(y)+bk)(wf(k)ZX(r)Z(k)Y(y))(8)Weusesuperscripts(k),todenotethecolumnofamatrix.3.AugmentedOutputInthissection,ourstructuredapproachisextendedtopredictaugmentedoutputsinadditiontotheposevector.Thishelpsregularizetheresulttovalidposes.Wedrawfromtheintuitionthatforavalidhumanpose,thepropor-tionoflimblengthsarexed.Onestraightforwardapproach Figure2.Illustrationoftheinferencewiththeaugmentedkernel.wouldbetoincludetheseasconstraintsintheinferenceproblem.However,whentheposeisrepresentedas3Djointcoordinates,oneconvexapproachtoenforcethisconstraintcouldbetousebothl1distancesandl1constraints,e.g.,jkyiyjk1kylykk1jL.Inferencewithsuchnon-smoothandcomplicatedconstraintsisneithereasynorveryelegant.Instead,weproposetoaddauxiliaryvariablestotheoriginaloutput.Theloglimblengthratios(LLLR)ya=logyiyj Vijklykyloftherelevantlimbsareusedasadditionaloutputdimensions,whereVijklareempiricalaveragelimblengthratios.Thenkerneldependencyestimationisper-formedonthecombinedoutput~y=yyaconsistingof3Djointpositionsandauxiliaryvariables.Duringin-ference,thefollowingproblemissolved:argmin~y;hkwf�X(rh)Y(~y)k2+kyak2(9)Fig.2illustratestheneedforregularizationduringin-ference.KDEembedsfromanambientoutputspacetoahigh-dimensionalspace,wheretheoriginalposesemergeasamanifoldY(y).However,thepredictionwf�X(rh)canbeanypointinspace.Inferenceboilsdowntondingtheclosestpre-imageintheambientspace,i.e.,mapbacktothemanifoldusingthenon-linearmapY(y).Forin-stance,theprojectionofpredictedwf�X(rh)isshownindarkblue.However,sincethemanifoldextendstore-gionswithnotrainingexamples(extrapolationareas),thepre-imagecalculationmayalsonotproduceavalidpose,e.g.wf�X(r2)andwf�X(r3)ing.2.ThecorrelationbetweentheaugmentedoutputsandtheposeiscapturedbythenonlineartransformY(theywillbecoupledwithanyorthogonalbasisgeneratedfromkernelPCA),thusregu-larizationontheaugmentedoutputsdirectlytranslatesintoregularizationoftheambientposevariables.Thisbiasesthepre-imagemaptoregionsofvalidoutputs.Fromthetrainingposedata,wecompiledallthepossiblelimblengthratiosandplotthemeanandstandarddeviationing.3.Itcanbeseenthatmanylimbpairshavexedra-tiosthatdonotchangemuchacrossdifferentsubjects.We Figure3.Empiricalmeasuresforthelogoflimblengthratio(LLLR)onthetestsetgroundtruthdatashowstructure:mean(a)andstandarddeviation(b).Inthelowerrowweshowl1errorofsamemeasure(LLLR)betweenpredictionsandthegroundtruthtestdatawithKDEa(augmented)(c)andKDE(standard)model(d)(samecolorscale).KDEapredictsratiosclosertothegroundtruth,whichpartlyexplainsitsbetteroverallperformance.choosesuchconsistentlimbratios,andpredicta162out-putyaasconstraintonthepose.Theaugmentedoutputapproachisnotconnedtoa3Drepresentationoftheposebasedonjointpositions.Ifjointanglesneedtobepredicted,physicalanglelimitsorotherangleconstraints(e.g.direc-tionalpreference)canbeincluded.4.ExperimentsWerunexperimentsinsupportofourthreemaincontri-butions.First,weshowthatcoupledlatentsegmentselec-tionandposepredictionperformsbetterthantwoindepen-dentclassicationandposeestimationstages.Second,weshowthatastructuredmodellikeKDEboostsposepredic-tionperformanceespeciallyinconjunctionwithimprovedaugmentedkernels.Finally,weshowthatrandomFouriertechniquesextendtoastructuredlearningframeworkwithnontrivialinferenceliketheposeprediction,atnosigni-cantlossinperformanceandwiththeaddedbenetofscal-ability.DatasetandImageProcessing.Wepresentasetofcom-prehensiveexperimentsintheHumanEva-Idataset[21].Thedatasetcontains5motions(Box,Gestures,Jog,Throw-CatchandWalking)from4subjects.Accurateimageandposedata(3Djointlocations)areavailablefor3subjectsand4motions(wediscardThrowCatchsincegroundtruth 0 1000 2000 3000 4000 5000 6000 45 50 55 60 65 KRR -init LinKDE KDE LinKDEa KDEa Figure4.Performanceasafunctionofapproximationdimension-alityininputspace.TheoutputkerneloftheKDEiskeptxedthroughouttheexperimentat2000dimensionswhiletheaug-mentedkernelisapproximatedwith4000dimensions.posescouldnotbeaccuratelyobtained).Together,thein-formationfromthe3colorcameras,offersarelativelylargenumberofimagesandassociated3Dposedatathatcanbeusedfortrainingandtesting—intotalaround34,000ex-amples(dividedroughlyequallyinatrainandatestset).Groundtruthsegmentstoevaluatepredictorbaselinesareobtainedfrombackgroundsubtractionusingtheoriginalcodeprovidedwiththedataset.Weadditionallyremoveshadowsinordertoobtainaccurateboundingboxesandsegments.Notethatwedonotexcludedatawhereback-groundsubtractioncannotbeperfectlyobtained,weonlydiscardinvalidposedata.Performanceonthisdatasetismeasured,inthestandardway,asmean3Djointerrors,inmm.OurmethodrequiresgeneratingapoolofsegmentsforwhichweusetheCPMCalgorithm[6]withstandardpa-rameters.Segmentationonthisdatasetwasquitesuccess-fulwithameanoverlap(intersectionoverunion)ofthebestsegmentat72%.TheimagefeaturesusedarepyramidblockSIFTwith3levels(2x2,4x4and8x8cells)with9gradi-entbinorientations(0-180degrees,unsigned)extractedonthesegmentsupportintheimage(notjustitsboundaries).Thesefeaturesaresensitivetotheaccuracyofthebound-ingboxenclosingthesegmentsomorphologicaloperationswereappliedtoremoveverythin,protrudingregionsfromsegments.PosePredictiononGround-truthInputs.Tocalibratethebaselineperformanceofourregressor,werstshowre-sultsonbothgroundtruthsegmentsaswellasCPMC'sbestoverlappingsegment(table1).Inallexperimentsthatil-lustrateKDE,KDEa,LinKDEandLinKDEaforinference,initializationwasperformedusingthesameKernelRidgeRegressionmodel(KRR)trainedtopredictoutputsinde-pendently.ItcanbeseenthatKDEimprovesoverKRRbutitisalsoclearthatitdoesn'tquitemanagetogetthepro-portionsofthelimbsright.NoticetheboostinperformanceachievedbyKDEawhentheconstraintswereenforced.Fig-ure3shows,ontherstcolumn,thedistributionofthelimb Motion KRR KDE KDEa LinKDE LinKDEa Box 91.36 82.50 75.95 82.88 76.33 Gestures 68.96 66.89 60.24 65.35 61.70 Jog 62.09 50.26 44.92 51.64 46.54 Walking 63.75 52.40 44.18 53.57 46.41 Allmotions 70.11 65.50 61.28 67.79 63.70 KRR KDE KDEa LinKDE LinKDEa 86.53 78.60 71.77 78.92 70.36 81.69 76.05 71.56 76.81 72.17 54.17 45.73 40.16 45.93 41.13 55.27 44.48 37.73 45.66 38.79 65.24 63.93 58.48 63.03 59.90 Table1.Meanperjointerror(inmm)fordifferentposepredictionmodelsonHumanEvaIdatasubjects1-3anddatafromallcolorcameras.Bothactivityspecicmodelsandactivityindependentmodelsarereported.Thelefttablerepresentstheresultsonthesegmentwithbestoverlapwithgroundtruthforbothtrainingandtesting.Therighttableshowsthesameresultsongroundtruthsegmentsforcomparison.ratios,and,onthesecond,thedifferencebetweenthelimbratiosobtainedfrompredictionsbasedonKDEaandKDE,withrespecttothegroundtruth.ByapplyingtherandomFouriermethodologytostruc-turedmodelslikeKDE,KDEawith400and6000Fourierembeddingdimensionsoninputsweobtainaperformancecomparablewiththekernelcounterparts(withthesamepa-rameters).Noticehoweverthatlinearmethodsscalemuchbetterinthesizeofthedataset,bothintrainingandintest-ing.Weseethatindeed,evenforcomplexmodelslikeKDEandKDEatheapproximationholdswell(g.4).CombinedAutomaticModel.Inordertoassesstheef-fectivenessofoursegmentselectionprocedurewehavede-visedasetofbaselines.Theexperimentswereconductedonthebest5segmentsbasedonoverlap.Ouraimwas(i)toseeifoverlapisagoodmeasureforselectingsegmentsifthenalgoalisposepredictionand(ii)toassesstheper-formanceofourselectionprocedure.Resultsareshownintable2.Thersttwocolumnsshowthe3Dposerecon-structionerrorusingthesegmentselectionmodeldescribedinsection2.1,withLinKDEandLinKRRforposepredic-tion.Inthisexperimentweuseonly2000Fourierfeaturesfortheinputapproximation.Thisexplainsthesomewhatlowerperformancecomparedtoourpreviousresults.Theinitializationoftheposepredictionmodels(1ststepofthealgorithm)isdoneusingthegroundtruthsegments.Wehaveconsideredinitializationusingallsegments,withthegroundtruthoverlapasasegmentrankingfunctiong(rIh),butfoundthattoworklesswell.Thereforetherestoftheresultsshownintable2areallusingLinKDEforposepre-diction.Toassesstheperformanceofapurelydetection-basedapproachtosegmentselectionweusethebestoverlaptothegroundtruthassegmentselector.Thisisasensiblecandi-dateforcomparisonswithadetectionmethodsinceitistheperfectoverlap-basedselector.ThethirdcolumninTable2showstheperformanceoftheposepredictionmodel,testedonthesegmentswithbestoverlaptothegroundtruth.Therelativelylowqualityresultinmostcasesmeansoverlapisnotnecessarilyagoodselectioncriterionifonerequiresac-curateposeprediction.Ingeneral,weusuallyonlyhavesegmentswithgroundtruthoverlapofaround70%evenincontrolledenvironments.Theresultsshowthatthesegment Motion LinKRR LinKDE Overlap Best Mean Box 97.44 84.61 105.31 69.12 84.59 Gestures 65.17 60.42 101.99 53.01 61.56 Jog 64.40 52.46 61.51 42.39 55.32 Walking 71.82 56.12 67.97 44.67 58.10 Table2.Posepredictionerrorresults(inmm)forthecombinedmodelforsegmentselectionandposeestimation.WealsoshowasimpleLinKRRmodelbaseline.LinKDEisthecombinedmodelbasedonlinearFourierembeddings.Thirdcolumn(Overlap)givesposepredictionresultsbasedonthegroundtruthoverlapselection.`Best'isthehighestaccuracyresultsresultthatcanbeachievedbytheLinKDEmodelifthegroundtruthposewereknown.`Mean'givesanotherbaselinescenariowherethebest5segmentspredicttheoutputbyequalvoting.havingtheverybestoverlapscorewasnotthemostinfor-mativeforpose.The4thcolumnoftable2representsthebestresultwecanachieve.Thisiscomputedbyusingthegroundtruthposetoselectsegmentsgivingthelowestposepredictionerror.ThegoodresultsinthiscolumnindicatethatourweightedKDEmodeliswelltrainedandabetterselectionprocedurecouldimproveitfurther.Thelastcolumnshowsthemeanerror,foramodelwherethebest5segmentspre-dicttheoutputbyequalvoting.ComputationalEfciency.Weapproximatetwodiffer-entkernels,oneovertheinputsandtheotheroneoverout-puts.Theeffectofthisapproximationistotransformacom-plexitythatisquadraticinthesizeofthetrainingsetintoalinearone.Traininginthiscaseconsistsoftwosteps:com-putationoftheFourierfeaturesforbothinputs(ofdimen-sionalitym)andoutputs(ofdimensionalityn)andsolvingaregressionproblembetweenthetwo.LetNbethenumberoftrainingexamples,NmandNn.ThecomplexityofLinKDEisO(Nm)+O(m2n+mn2)(weconsiderma-trixinversiontohavequadraticcomplexityforsimplicity).ContrastthiswithO(N2)forthestandardkernelmethod.Fortrainingsetsbeyond10,000examples,matrixinversionbecomesdifculttoperform.Moreover,forthecombinedsegmentselectionandposeestimationmodelpresentedinx2ofthepaper,ourmodelincreases5fold,since5segmentsareconsideredperimage.Thematrixinversionforourran-domFourierformulationisindependentofN(thoughcon- Figure5.Qualitativesegmentationand3DposereconstructionsonaclipcollectedfromaHollywoodmovie.WeuseLinKDEwithsameinputfeaturesandparametersasintheHumanEvaevaluation,butwithjointangleoutputs.UsingourMocapsystemwehavecreatedatrainingsetof4000examples,outofwhich450werequalitativelysimilartotheonesinthevideoandtheremainingonesweredifferentsittingandstandingposes.Thepurposeofthisexperimentistodemonstratethepotentialofthemethodinanun-instrumentedenvironmentandformorechallengingposes.structingthematrixisalinearoperationinthenumberofexamples).Inanexperimentshownintable2,weassesstheimpactoftheinputkernelapproximation(herewextheoutputapproximationto2000dimensions)ontrainingandtest-ingtimes.Perhapssurprisingly,thetestingtimedecreaseswiththedimensionalityoftheinputapproximation,anef-fectthatcanbeexplainedbytheincreasedgradientaccu-racyforhigher-dimensionalmodels.ThisisalsotrueforthekernelversionbutthefunctionevaluationismorecostlyinvolvingtheinputmatrixwhichisO(N2).Wealsostudytheimpactoftheoutputapproximationdi-mensionalityontrainingandtestingtimes.Table4showsourresultsforadatasetofwalkingmotions,andusingthebestsegmentswithrespecttooverlaptothegroundtruth.Inputdimensionalityisxedthroughouttheexperimentat4000.Trainingtimeisalmostindependentoftheoutputapproximation.Inferencetimehoweverisaffected,androughlydoubleswhengoingfrom500to6000dimensions.Inferenceinthedeterministicmodelisfast,butnoticethesmalltrainingset,whichweuseinordertobeabletoalsoperformexactcalculations.Thetrendclearlyreversesforlargerdatasetswhereaboveacertainsizeexactcalculationsbecomeunfeasible.5.ConclusionWehavepresentedasegmentation-basedframeworkforautomatic3Dhumanposereconstructioninmonocularimages,basedonalatentvariableformulationforper-sonlocalizationintheimage(byselectingovermultiplegure-groundhypotheses)and3Darticularposeprediction.Learningthelatentmodelisformulatedasanalternatingoptimization.WegivenewformulationsforlatentkerneldependencyestimationandshowthatsuchamethodologycanbemadescalablewhilepreservingaccuracybymeansoflinearFourierapproximations.Wealsointroduceaugmentedstructuredkernelstoimprovethequalityofoutputprediction.InextensivequantitativeexperimentswedemonstratethatourmodelcanjointlyselectaccuratesegmentsandprovidespromisingautomaticposepredictionresultsintheHumanEvabenchmark.WealsoshowresultsinaclipcollectedfromaHollywoodmovie,wheremorecomplexhumanposeswerereconstructed.Theaugmentedoutputprovidesapracticalapproachtoincorporateposeconstraints.Theforceoneneedstoexerttomaintainaposecanalsobecomputedwithaphysicalmodel[5]andusedasapriorforposeprediction.Suchconstraintshavegoodpotentialtoimproverealismandwillbeinvestigatedinfuturework. Model 500 700 1000 3000 4000 6000 KDE LinKDE-train 13 16 28 409 782 3167 109028 LinKDE-test 5739 5996 5574 5441 3831 2244 8811 LinKDE-testacc 73.56 73.03 71.99 68.28 67.32 66.58 65.57 Table3.Runningtimeasafunctionofapproximationdimensionalityoftheinputkernel(inseconds).Theresultsareobtainedontheentiredataset(allmotions),withcomputationsrunonaXeon1corewith48GBRAM.ClearlytrainingtimeisprohibitiveusingthedeterministicmethodbutmanageableevenforhighapproximationdimensionsusingtheFouriermethodology.Theobserveddecreaseincomputationtimecanbecausedbyamoreaccurategradientwhichmakestheoptimizationmorestable. Model 500 700 1000 3000 5000 6000 KDE/KDEa LinKDE-train 614.53 619.11 623.34 630.18 628.07 634.02 2219.12 LinKDE-test 285.22 288.13 289.37 437.95 613.03 698.71 642.36 LinKDEa-train 1524.99 1521.11 1520.84 1529.99 1528.16 1512.87 2224.12 LinKDEa-test 2409.01 2197.47 2525.41 2256.02 2782.98 3021.54 2444.29 Table4.Computationtimeresults(inseconds)fortrainingandtestingusingourtwoLinKDEmodelswithdifferentnumberofrandomFourierfeaturesapproximatingtheoutputkernel.IntherightmostcolumnweshowtheperformanceofstandardKDE.Resultsarereportedfora4Kdatasetofwalkingmotions,withcalculationsperformedonaquadcorePentiumprocessor.ForLinKDEa,thekerneliscomputedasasumof2approximations,wheretheapproximationoftheaugmentedcomponenthas2000Fourierdimensions.Theinputkernelwasapproximatedusing4000Fourierdimensions.Acknowledgements:ThisworkwassupportedbyCNCSIS-UEFISCDI,underPNII-RU-RC-2/2009,andbytheEC,MCEXT-025481.WethankDragosPapavaatIMARforsupportwithvisualizationandmotioncapture.References[1]A.AgarwalandB.Triggs.Alocalbasisrepresentationforestimatinghumanposefromclutteredimages.InACCV,2006.[2]M.Andriluka,S.Roth,andB.Schiele.PeopleTracking-by-DetectionandPeople-Detection-by-Tracking.InCVPR,2008.[3]L.BourdevandJ.Malik.Poselets:Bodypartdetectorstrainedusing3dhumanposeannotations.InICCV,2009.[4]M.Bray,P.Kohli,andP.Torr.Posecut:Simultaneousseg-mentationandand3dposeestimationofhumansusingdy-namicgraphcuts.InECCV,2006.[5]M.BrubakerandD.Fleet.TheKneedWalkerforHumanPoseTracking.InCVPR,2008.[6]J.CarreiraandC.Sminchisescu.ConstrainedParametricMin-CutsforAutomaticObjectSegmentation.InCVPR,2010.[7]C.Cortes,M.Mohri,andJ.Weston.Ageneralregressiontechniqueforlearningtransductions.InICML,pages153–160,NewYork,NY,USA,2005.ACM.[8]J.Deutscher,A.Blake,andI.Reid.ArticulatedBodyMotionCapturebyAnnealedParticleFiltering.InCVPR,2000.[9]M.EichnerandV.Ferrari.Wearefamily:JointPoseEsti-mationofMultiplePersons.InECCV,2010.[10]P.Felzenszwalb,R.Girshick,D.McAllester,andD.Ra-manan.Objectdetectionwithdiscriminativelytrainedpart-basedmodels.PAMI,32:1627–1645,2010.[11]V.Ferrari,M.Marin,andA.Zisserman.PoseSeach:retriev-ingpeopleusingtheirpose.InCVPR,2009.[12]T.Hofmann,B.Sch¨olkopf,andA.J.Smola.Kernelmethodsinmachinelearning,Jan2008.[13]C.Ionescu,L.Bo,andC.Sminchisescu.StructuralSVMforVisualLocalizationandContinuousStateEstimation.InICCV,2009.[14]C.Ionescu,F.Li,andC.Sminchisescu.Fourierstructuredlearning.Technicalreport,INS,UniversityofBonn,2011.[15]F.Li,J.Carreira,andC.Sminchisescu.ObjectRecognitionasRankingHolisticFigure-GroundHypotheses.InCVPR,2010.[16]F.Li,C.Ionescu,andC.Sminchisescu.RandomFourierapproximationsforskewedmultiplicativehistogramkernels.InLNCS(DAGM),September2010.[17]F.LiandC.Sminchisescu.ConvexMultipleInstanceLearn-ingbyEstimatingLikelihoodRatio.InNIPS,2010.[18]A.RahimiandB.Recht.Randomfeaturesforlarge-scalekernelmachines.InNIPS,2007.[19]B.Sapp,A.Toshev,andB.Taskar.CascadedModelsforArticulatedPoseEstimation.InECCV,2010.[20]L.Sigal,A.Balan,andM.J.Black.Combineddiscrimi-nativeandgenerativearticulatedposeandnon-rigidshapeestimation.InNIPS,2007.[21]L.SigalandM.J.Black.Humaneva:Synchronizedvideoandmotioncapturedatasetforevaluationofarticulatedhu-manmotion.volume1,2006.[22]L.Sigal,R.Memisevic,andD.Fleet.SharedKernelInfor-mationEmbeddingforDiscriminativeInference.InCVPR,2010.[23]C.SminchisescuandA.Jepson.GenerativeModelingforContinuousNon-LinearlyEmbeddedVisualInference.InICML,pages759–766,Banff,2004.[24]C.Sminchisescu,A.Kanaujia,andD.Metaxas.BM3E:DiscriminativeDensityPropagationforVisualTracking.PAMI,2007.[25]R.Urtasun,D.Fleet,A.Hertzmann,andP.Fua.Priorsforpeopletrackinginsmalltrainingsets.InICCV,2005.[26]A.VedaldiandA.Zisserman.Efcientadditivekernelsviaexplicitfeaturemaps.InCVPR,2010.