/
2J.Tompsonetal.gledepthcameraforreal-timetracking.Webelievethekeytech 2J.Tompsonetal.gledepthcameraforreal-timetracking.Webelievethekeytech

2J.Tompsonetal.gledepthcameraforreal-timetracking.Webelievethekeytech - PDF document

jane-oiler
jane-oiler . @jane-oiler
Follow
382 views
Uploaded On 2015-09-28

2J.Tompsonetal.gledepthcameraforreal-timetracking.Webelievethekeytech - PPT Presentation

PREPRINT PREPRINT 4JTompsonetal Fig4AlgorithmPipelineForDatasetCreationSincethisdatasetcreationstageisperformedofinewedonotrequireittobefastenoughforinteractiveframeratesThereforeweusedahighq ID: 143070

PREPRINT PREPRINT 4J.Tompsonetal. Fig.4:AlgorithmPipelineForDatasetCreationSincethisdatasetcreationstageisperformedofine wedonotrequireittobefastenoughforinteractiveframerates.Thereforeweusedahigh-q

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "2J.Tompsonetal.gledepthcameraforreal-ti..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

PREPRINT 2J.Tompsonetal.gledepthcameraforreal-timetracking.Webelievethekeytech-nicalcontributionofthisworkisthecreationofanovelpipelineforfastposeinference,whichisapplicabletoawidevarietyofar-ticulableobjects.AnoverviewofthispipelineisshowninFigure1.Asasingleexample,trainingoursystemonanopen-sourcelinear-blend-skinningmodelofahandwith42degreesoffree-domtakeslessthan10minutesofhumaneffort(18,000framesat30fps),followedbytwodaysofautonomouscomputationtime.Trackingandposeinferenceforaperson'shandcanthenbeper-formedinreal-timeusingasingledepthcamera.Throughoutourexperiments,thecameraissituatedinfrontoftheuseratapproxi-matelyeye-levelheight.Thetrainedsystemcanbereadilyusedtopuppeteerrelatedobjectssuchasalienhands,orrealrobotlinkages,andasaninputto3Duserinterfaces[Steinetal.2012].2.RELATEDWORKAlargebodyofliteratureisdevotedtoreal-timerecoveryofposeformarkerlessarticulableobjects,suchashumanbodies,clothes,andman-madeobjects.Astheprimarycontributionofourworkisafastpipelineforrecoveryoftheposeofhumanhandsin3D,wewilllimitourdiscussiontothemostrelevantpriorwork.Manygroupshavecreatedtheirowndatasetofground-truthla-belsandimagestoenablereal-timeposerecoveryofthehumanbody.Forexample,Wangetal.[Wangetal.2011]usetheCyber-GloveIIMotionCapturesystemtoconstructadatasetoflabeledhandposesfromusers,whicharere-renderedasacoloredglovewithknown-texture.Asimilarcoloredgloveiswornbytheuseratrun-time,andtheposeisinferredinreal-timebymatchingtheimagedgloveinRGBtotheirdatabaseoftemplates[WangandPopovi´c2009].Inlaterwork,theCyberGlovedatawasrepurposedforposeinferenceusingtemplatematchingondepthimages,with-outacoloredglove.Wangetal.haverecentlycommercializedtheirhand-trackingsystem(whichisnowproprietaryandismanagedby3GearSystems[3Gear2014])andnowusesaPrimeSenseTMdepthcameraorientedabovethetabletorecognizealargerangeofpos-sibleposes.Thisworkdiffersfrom3Gear'sinanumberofways:1)weattempttoperformcontinuousposeestimationratherthanrecognitionbymatchingintoastaticanddiscretedatabaseand2)weorientthecamerafacingtheuserandsooursystemisoptimizedforadifferentsetofhandgestures.AlsorelevanttoourworkisthatofShottonetal.[Shottonetal.2011],whousedrandomizeddecisionforeststorecovertheposeofmultiplebodiesfromasingleframebylearningaper-pixelclassi-cationofthedepthimageinto38differentbodyparts.Theirtrain-ingexamplesweresynthesizedfromcombinationsofknownposesandbodyshapes.Insimilarwork,Keskinetal.[Keskinetal.2011]createdarandomizeddecisionforestclassierspecializedforhu-manhands.Lackingadatasetbasedonhumanmotioncapture,theysynthesizedadatasetfromknownposesinAmericanSignLan-guage,andexpandedthedatasetbyinterpolatingbetweenposes.Owingtotheirprescribedgoalofrecognizingsignlanguagesignsthemselves,thisapproachproveduseful,butwouldnotbefeasibleinourcaseaswerequireunrestrictedhandposestoberecovered.Inafollowonwork[Keskinetal.2012],Keskinetal.presentedanovelshapeclassicationforestarchitecturetoperformper-pixelpartclassication.Severalothergroupshaveuseddomain-knowledgeandtempo-ralcoherencetoconstructmethodsthatdonotrequireanydatasetfortrackingtheposeofcomplicatedobjects.Forexample,Wieseetal.[Weiseetal.2009]deviseareal-timefacialanimationsystemforrangesensorsusingsalientpointstodeducetransformationsonanunderlyingfacemodelbyframingitasenergyminimization.Inrelatedwork,Lietal.[Lietal.2013]showedhowtoextendthistechniquetoenableadaptationtotheuser'sownfacialexpressionsinanonlinefashion.Melaxetal.[Melaxetal.2013]demonstrateareal-timesystemfortrackingthefullposeofahumanhandbyttingconvexpolyhedradirectlytorangedatausinganapproachinspiredbyconstraint-basedphysicssystems.Ballanetal.[Ballanetal.2012]showhowtothighpolygonhandmodelstomultiplecameraviewsofapairofhandsinteractingwithasmallsphere,us-ingacombinationoffeature-basedtrackingandenergyminimiza-tion.Incontrasttoourmethod,theirapproachreliesuponinter-framecorrespondencestoprovideoptical-owandgoodstartingposesforenergyminimization.EarlyworkbyRehgandKanade[RehgandKanade1994]demonstratedamodel-basedtrackingsystemthattsahigh-degreeoffreedomarticulatedhandmodeltogreyscaleimagedataus-inghand-designed2Dfeatures.Zhaoetal.[Zhaoetal.2012]useacombinationofIRmarkersandRGBDcapturetoinferofine(atoneframepersecond)theposeofanarticulatedhandmodel.Similartothiswork,Oikonomidisetal.[Oikonomidisetal.2011]demonstratetheutilityofParticleSwarmOptimization(PSO)fortrackingsingleandinteractinghandsbysearchingforparametersofanexplicit3Dmodelthatreducethereconstructionerrorofaz-bufferrenderedmodelcomparedtoanincomingdepthimage.TheirworkreliesheavilyontemporalcoherenceassumptionsforefcientinferenceofthePSOoptimizer,sincetheradiusofconvergenceoftheiroptimizerisnite.Unfortunately,temporalcoherencecan-notbereliedonforrobustreal-timetrackingsincedroppedframesandfastmovingobjectstypicallybreakthistemporalcoherencyassumption.Incontrasttotheirwork,whichusedPSOdirectlyforinteractivetrackingontheGPUat4-15fps,ourworkshowsthatwithrelaxedtemporalcoherenceassumptionsinanofinesetting,PSOisaninvaluableofinetoolforgeneratinglabeleddata.Toourknowledge,thereisnopublishedpriorworkonusingConvNetstorecovercontinuous3Dposeofhumanhandsfromdepthdata.However,severalgroupshaveshownConvNetscanre-covertheposeofrigidandnon-rigid3Dobjectssuchasplastictoys,facesandevenhumanbodies.Forexample,LeCunetal.[Le-Cunetal.2004]usedConvNetstodeducethe6DOFposeof3Dplastictoysbyndingalow-dimensionalembeddingwhichmapsRGBimagestoasixdimensionalspace.Osadchyetal.[Osadchyetal.2005]useasimilarformulationtoperformposedetectionoffacesviaanon-linearmappingtoalow-dimensionalmanifold.Tay-loretal.[Tayloretal.2011]usecrowd-sourcingtobuildadatabaseofsimilarhumanposesfromdifferentsubjectsandthenuseCon-vNetstoperformdimensionalityreductiontoalow-dimensionalmanifold,wheresimilaritybetweentrainingexamplesispreserved.Lastly,Jiuetal.[Jiuetal.2013]useConvNetstoperformper-pixelclassicationsofdepthimages(whoseoutputissimilarto[Shot-tonetal.2011])inordertoinferhumanbodypose,buttheydonotevaluatetheperformanceoftheirapproachonhandposerecogni-tion.Couprieetal.[Couprieetal.2013]useConvNetstoperformimagesegmentationofindoorscenesusingRGB-Ddata.Thesig-nicanceoftheirworkisthatitshowsthatConvNetscanperformhighlevelreasoningfromdepthimagefeatures.3.BINARYCLASSIFICATIONForthetaskofhand-backgrounddepthimagesegmentationwetrainedanRDFclassiertoperformper-pixelbinarysegmenta-tiononasingleimage.TheoutputofthisstageisshowninFig-ure2.Decisionforestsarewell-suitedfordiscreteclassicationofACMTransactionsonGraphics,Vol.,No.,Article,Publicationdate:. PREPRINT 4J.Tompsonetal. Fig.4:AlgorithmPipelineForDatasetCreationSincethisdatasetcreationstageisperformedofine,wedonotrequireittobefastenoughforinteractiveframerates.Thereforeweusedahigh-quality,linear-blend-skinning(LBS)model[Sari´c2011](showninFigure3)asanalternativetothesimpleball-and-cylindermodelofOikonomidisetal.AfterreducingtheLBSmodel'sfacecounttoincreaserenderthroughput,themodelcon-tains1,722verticesand3,381trianglefaces,whereasthehighden-sitysourcemodelcontained67,606faces.WhileLBSfailstoaccu-ratelymodeleffectssuchasmuscledeformationandskinfolding,itrepresentsmanygeometricdetailsthatball-and-stickmodelscan-not.Tomitigatetheeffectsofself-occlusionweusedthreesensors(atviewpointsseparatedbyapproximately45degreessurroundingtheuserfromthefront),withattachedvibrationmotorstoreduceIR-patterninterference[Butleretal.2012]andwhoserelativeposi-tionsandorientationswerecalibratedusingavariantoftheIterativeClosestPoint(ICP)algorithm[Horn1987].WhileweuseallthreecameraviewstottheLBSmodelusingthealgorithmdescribedabove,weonlyusedepthimagestakenfromthecentercameratotraintheConvNet.Thecontributionsfromeachcamerawereaccu-mulatedintoanoveralltnessfunctionF(C),whichincludestwoaprioriterms((C)andP(C))tomaintainanatomicallycorrectjointanglesaswellasadata-dependantterm(Is;C)fromeachcamera'scontribution.Thetnessfunctionisasfollows:F(C)=3Xs=1(Is;C)+(C)+P(C)(2)WhereIsisthessensor'sdepthimageandCisa42-dimensionalcoefcientvectorthatrepresentsthe6DOFpositionandorienta-tionofthehandaswellas36internaljointangles(showninFigure3).P(C)isaninterpenetrationterm(foragivenpose)usedtoinvalidateanatomicallyincorrecthandposesandiscalculatedbyaccumulatingtheinterpenetrationdistancesofaseriesofboundingspheresattachedtothebonesofthe3Dmodel.Wedeneinterpen-etrationdistanceassimplythesumofoverlapbetweenallpairsofinterpenetratingboundingspheres.(C)enforcesasoftconstraintthatcoefcientvaluesstaywithinapredeterminedrange(CminandCmax): (C)=nPk=1wk[max(Ck�Ckmax;0)+max(Ckmin�Ck;0)] Where,wkisaper-coefcientweightingtermtonormalizepenaltycontributionsacrossdifferentunits(sinceweareincludingerrortermsforangleanddistanceinthesameobjectivefunction).CminandCmaxweredeterminedexperimentallybyttinganuncon-strainedmodeltoadiscretesetofposeswhichrepresentthefullrangeofmotionforeachjoint.Lastly(Is;C)ofEquation(2),measuresthesimilaritybetweenthedepthimageIsandthesyn-theticposerenderedfromthesameviewpoint:(Is;C)=Xu;vmin(jIs(u;v)�Rs(C;u;v)j;dmax)Where,Is(u;v)isthedepthatpixel(u;v)ofsensors,Rs(C;u;v)isthesyntheticdepthgiventheposecoefcientCanddmaxisamaximumdepthconstant.TheresultofthisfunctionisaclampedL1-normpixel-wisecomparison.Itshouldbenotedthatwedonotincludeenergytermsthatmeasurethesilhouettesimilarityaspro-posedbyOikonomidisetal.sincewefoundthatwhenmultiplerangesensorsareusedthesetermsarenotnecessary.5.FEATUREDETECTIONWhileNeuralNetworkshavebeenusedforposedetectionofalim-itedsetofdiscretehandgestures(forinstancediscriminatingbe-tweenaclosedstandanopenpalm)[Nagietal.2011;NowlanandPlatt1995],toourknowledgethisistherstworkthathasattemptedtousesuchnetworkstoperformdensefeatureextrac-tionofhumanhandsinordertoinfercontinuouspose.Todothisweemployamulti-resolution,deepConvNetarchitectureinspiredbytheworkofFarabetetal.[Farabetetal.2013]inordertoper-formfeatureextractionof14salienthandpointsfromasegmentedhandimage.ConvNetsarebiologicallyinspiredvariantsofmulti-layeredperceptrons,whichexploitspatialcorrelationinnaturalim-agesbyextractingfeaturesgeneratedbylocalizedconvolutionker-nels.Sincedepthimagesofhandstendtohavemanyrepeatedlocalimagefeatures(forinstancengertips),ConvNetsarewellsuitedtoperformfeatureextractionsincemulti-layeredfeaturebankscansharecommonfeatures,therebyreducingthenumberofrequiredfreeparameters.Werecastthefullhand-poserecognitionproblemasaninterme-diatecollectionofeasierindividualhand-featurerecognitionprob-lems,whichcanbemoreeasilylearnedbyConvNets.Inearlyex-perimentswefoundinferringmappingsbetweendepthimagespaceandposespacedirectly(forinstancemeasuringdepthimagegeom-etrytoextractajointangle),yieldedinferiorresultstolearningwithintermediatefeatures.WehypothesizethatonereasonforthiscouldbethatlearningintermediatefeaturesallowsConvNetstoconcen-tratethecapacityofthenetworkonlearninglocalfeatures,andondifferentiatingbetweenthem.UsingthisframeworktheConvNetisalsobetterabletoimplicitlyhandleocclusions;bylearningcom-pound,high-levelimagefeaturestheConvNetisabletoinfertheapproximatepositionofanoccludedandotherwiseunseenfeature(forinstance,whenmakingast,hiddennger-tiplocationscanbeinferredbytheknucklelocations).WetrainedtheConvNetarchitecturetogenerateanoutputsetofheat-mapfeatureimages(Figure5).Eachfeatureheat-mapcanbeviewedasa2DGaussian(truncatedtohavenitesupport),wherethepixelintensityrepresentstheprobabilityofthatfeatureoccur-ringinthatspatiallocation.TheGaussianUVmeaniscenteredatoneof14featurepointsoftheuser'shand.Thesefeaturesrepre-sentkeyjointlocationsinthe3Dmodel(e.g.,knuckles)andwerechosensuchthattheinversekinematics(IK)algorithmdescribedinSection6canrecoverafull3Dpose.ACMTransactionsonGraphics,Vol.,No.,Article,Publicationdate:. PREPRINT Real-TimeContinuousPoseRecoveryofHumanHandsUsingConvolutionalNetworks5 Fig.5:Depthimageoverlaidwith14featurelocationsandtheheat-mapforonengertipfeature. Fig.6:ConvolutionalNetworkArchitectureWefoundthattheintermediateheat-maprepresentationnotonlyreducesrequiredlearningcapacitybutalsoimprovesgeneraliza-tionperformancesincefailuremodesareoftenrecoverable.Casescontributingtohightestseterror(wheretheinputposeisvastlydif-ferentfromanythinginthetrainingset)areusuallyheat-mapsthatcontainmultiplehotspots.Forinstance,theheat-mapforanger-tipfeaturemightincorrectlycontainmultiplelobescorrespondingtotheotherngerlocationsasthenetworkfailedtodiscriminateamongngers.Whenthissituationoccursitispossibletorecoverareasonablefeaturelocationbysimpleheuristicstodecidewhichoftheselobescorrespondstothedesiredfeature(forinstanceifan-otherheatmapshowshigherprobabilityinthosesameloberegionsthenwecaneliminatetheseasspuriousoutliers).Similarly,thein-tensityoftheheat-maplobegivesadirectindicationofthesystem'scondenceforthatfeature,whichisanextremelyusefulmeasureforpracticalapplications.Ourmulti-resolutionConvNetarchitectureisshowninFigure6.Thesegmenteddepthimageisinitiallypre-processed,wherebytheimageiscroppedandscaledbyafactorproportionaltothemeandepthvalueofthehandpixels,sothatthehandisinthecenterandhassizethatisdepthinvariant.Thedepthvaluesofeachpixelarethennormalizedbetween0and1(withbackgroundpixelssetto1).ThecroppedandnormalizedimageisshowninFigure5.Thepreprocessedimageisthenlteredusinglocalcontrastnor-malization[Jarrettetal.2009],whichactsasahigh-passltertoemphasizegeometricdiscontinuities.Theimageisthendownsam-pledtwice(eachtimebyafactorof2)andthesamelterisap-pliedtoeachimage.Thisproducesamulti-resolutionband-passimagepyramidwith3banks(showninFigure7),whosetotalspec-traldensityapproximatesthespectraldensityoftheinputdepthimage.Sinceexperimentallywehavefoundthathand-poseextrac-tionrequiresknowledgeofbothlocalandglobalfeatures,asingle (a)9696px (b)4848px (c)2424pxFig.7:NeuralNetworkInput:Multi-ResolutionImagePyramid Fig.8:High-ResolutionBankFeatureDetector:eachstage:(Nfeaturesheightwidth)resolutionConvNetwouldneedtoexaminealargeimagewindowandthuswouldrequirealargelearningcapacity;assuchamulti-resolutionarchitectureisveryusefulforthisapplication.Thepyramidimagesarepropagatedthrougha2-stageConvNetarchitecture.ThehighestresolutionfeaturebankisshowninFigure8.Eachbankiscomprisedof2convolutionmodules,2piecewisenon-linearitymodules,and2max-poolingmodules.Eachconvo-lutionmoduleusesastackoflearnedconvolutionkernelswithanadditionallearnedoutputbiastocreateasetofoutputfeaturemaps(pleasereferto[LeCunetal.1998]foranin-depthdiscussion).Theconvolutionwindowsizesrangefrom4x4to6x6pixels.Eachmax-pooling[Nagietal.2011]modulesub-samplesit'sinputimagebytakingthemaximuminasetofnon-overlappingrectangularwin-dows.Weusemax-poolingsinceiteffectivelyreducescomputa-tionalcomplexityatthecostofspatialprecision.Themax-poolingwindowsrangefrom2x2to4x4pixels.ThenonlinearityisaRec-tifyLinearUnit(ReLU),whichhasbeenshowntoimprovetrainingspeedanddiscriminationperformanceincomparisontothestan-dardsigmoidunits[Krizhevskyetal.2012].EachReLUactivationmodulecomputesthefollowingper-pixelnon-linearfunction:f(x)=max(0;x)Lastly,theoutputoftheConvNetbanksarefedintoa2-stageneu-ralnetworkshowninFigure9.Thisnetworkusesthehigh-levelconvolutionfeaturestocreatethenal14heat-mapimages;itdoessobylearningamappingfromlocalizedconvolutionfeatureacti-vationstoprobabilitymapsforeachofthebonefeatures.Inprac-ticethesetwolargeandfully-connectedlinearnetworksaccountformorethan80%ofthetotalcomputationalcostoftheConvNet.However,reducingthesizeofthenetworkhasaverystrongimpactonruntimeperformance.Forthisreason,itisimportanttondagoodtradeoffbetweenqualityandspeed.Anotherdrawbackofthismethodisthattheneuralnetworkmustimplicitlylearnalikelihoodmodelforjointpositionsinordertoinferanatomicallycorrectout-putjoints.Sincewedonotexplicitlymodeljointconnectivityinthenetworkstructure,thenetworkrequiresalargeamountoftrainingdatatoperformthisinferencecorrectly.ACMTransactionsonGraphics,Vol.,No.,Article,Publicationdate:. PREPRINT 6J.Tompsonetal. Fig.9:2-StageNeuralNetworkToCreateThe14HeatMaps(withsizingofeachstageshown)ConvNettrainingwasperformedusingtheopen-sourcemachinelearningpackageTorch7[Collobertetal.2011],whichprovidesaccesstoanefcientGPUimplementationoftheback-propagationalgorithmfortrainingneuralnetworks.DuringsupervisedtrainingweusestochasticgradientdescentwithastandardL2-normerrorfunction,batchsizeof64andthefollowinglearnableparameterupdaterule:wi= wi�1�wi�@L @wiwi+1=wi+wi(3)Wherewiisabiasorweightparameterforeachofthenetworkmodulesforepochi(witheachepochrepresentingonepassovertheentiretraining-set)and@L @wiisthepartialderivativeoftheerrorfunctionLwithrespecttothelearnableparameterwiaveragedoverthecurrentbatch.Weuseaconstantlearningrateof=0:2,andamomentumterm =0:9toimprovelearningratewhenclosetothelocalminimum.Lastly,anL2regularizationfactorof=0:0005isusedtohelpimprovegeneralization.DuringConvNettrainingthepre-processeddatabaseimageswererandomlyrotated,scaledandtranslatedtoimprovegeneral-izationperformance[Farabetetal.2013].Notonlydoesthistech-niqueeffectivelyincreasethesizeofthetrainingset(whichim-provestest/validationseterror),italsohelpsimproveperformanceforotheruserswhosehandsizeisnotwellrepresentedintheorig-inaltrainingset.Weperformthisimagemanipulationinaback-groundthreadduringbatch-trainingsotheimpactontrainingtimeisminimal.6.POSERECOVERYWeformulatetheproblemofposeestimationfromtheheat-mapoutputasanoptimizationproblem,similartoinversekinematics(IK).Weextract2Dand3Dfeaturepositionsfromthe14heat-mapsandthenminimizeanappropriateobjectivefunctiontoalign3Dmodelfeaturestoeachheat-mapposition.Toinferthe3Dpositioncorrespondingtoaheat-mapimage,weneedtodeterminethemostlikelyUVpositionofthefeatureintheheatmap.AlthoughtheConvNetarchitectureistrainedtooutputheat-mapimagesof2DGaussianswithlow-variance,ingeneral,theyoutputmultimodalgrayscaleheat-mapswhichusuallydonotsumto1.Inpractice,itiseasytodeduceacorrectUVpositionbyndingthemaximalpeakintheheat-map(correspondingtothelocationofgreatestcondence).Ratherthanusethemostlikelyheat-maplocationasthenallocation,wetaGaussianmodeltothemaximallobetoobtainsub-pixelaccuracy.Firstweclampheat-mappixelsbelowaxedthresholdtogetridofspuriousoutliers.Wethennormalizetheresultingimagesoitsumsto1,andwetthebest2DGaussianusingLevenberg-Marquardt,andusethemeanoftheresultingGaussianastheUVposition.OncetheUVpositionisfoundforeachofthe14heat-maps,weperformalookupintothecaptureddepthframetoobtainthedepthcomponentattheUVlocation.IncasethisUVlocationliesonadepthshadowwherenodepthisgivenintheoriginalim-age,westorethecomputed2Dimageforthispointintheoriginalimagespace.Otherwise,westoreits3Dpoint.Wethenperformunconstrainednonlinearoptimizationonthefollowingobjectivefunction:f(m)=nXi=1[i(m)]+(C)(4)i(m)= (u;v;d)ti�(u;v;d)mi 2Ifdti6=0 (u;v)ti�(u;v)mi 2otherwiseWhere(u;v;d)tiisthetarget3Dheat-mappositionoffeatureiand(u;v;d)miisthemodelfeaturepositionforthecurrentposeesti-mate.Equation(4)isanL2errornormin3Dor2Ddependingonwhetherornotthegivenfeaturehasavaliddepthcomponentassociatedwithit.Wethenuseasimplelinearaccumulationofthesefeature-wiseerrortermsaswellasthesamelinearpenaltyconstraint((C))usedinSection4.WeusePrPSOtominimizeEquation(4).Sincefunctionevaluationsforeachswarmparticlecanbeparallelized,PrPSOisabletoruninreal-timeatinteractiveframeratesforthisstage.Furthermore,sinceanumberofthe42coefcientsfromSection4contributeonlysubtlebehaviortothedeformationoftheLBSmodelatrealtime,wefoundthatremov-ingcoefcientsdescribingngertwistandcouplingthelasttwoknucklesofeachngerintoasingleanglecoefcientsignicantlyreducesfunctionevaluationtimeof(4)withoutnoticeablelossinposeaccuracy.Therefore,wereducethecomplexityofthemodelto23DOFduringthisnalstage.Fewerthan50PrPSOiterationsarerequiredforadequateconvergence.ThisIKapproachhasoneimportantlimitation;theUVDtargetpositionmaynotbeagoodrepresentationofthetruefeatureposi-tion.Forinstance,whenafeatureisdirectlyoccludedbyanotherfeature,thetwofeatureswillincorrectlysharethesamedepthvalue(eventhoughoneisinfrontoftheother).However,wefoundthatforabroadrangeofgesturesthislimitationwasnotnoticeable.InfutureworkwehopetoaugmenttheConvNetoutputwithalearneddepthoffsettoovercomethislimitation.7.RESULTSFortheresultstofollow,wetestoursystemusingthesameexperi-mentalsetupthatwasusedtocapturethetrainingdata;thecameraisinfrontoftheuser(facingthem)andisatapproximatelyeyelevelheight.Wehavenotextensivelyevaluatedtheperformanceofouralgorithminothercamerasetups.TheRDFclassierdescribedinSection4wastrainedusing6,500images(withanadditional1,000validationimagesheldasidefortuningoftheRDFmeta-parameters)ofauserperformingtypicaloneandtwohandedgestures(pinching,drawing,clapping,grasping,etc).Trainingwasperformedona24coremachineforapproximately12hours.Foreachnodeinthetree,10,000weak-learnersweresampled.TheerrorratioofthenumberofincorrectACMTransactionsonGraphics,Vol.,No.,Article,Publicationdate:. PREPRINT Real-TimeContinuousPoseRecoveryofHumanHandsUsingConvolutionalNetworks7 Fig.10:RDFError (a)SensorDepth (b)SyntheticDepth (c)Per-PixelErrorFig.11:DatasetCreation:ObjectiveFunctionData(withLibhandmodel[Sari´c2011])pixellabelstototalnumberofhandpixelsinthedatasetforvaryingtreecountsandtreeheightsisshowninFigure10.Wefoundthat4treeswithaheightof25wasagoodtradeoffofclassicationaccuracyversusspeed.Thevalidationsetclassi-cationerrorfor4treesofdepth25was4.1%.Oftheclassicationerrors,76.3%werefalsepositivesand23.7%werefalsenegatives.Wefoundthatinpracticesmallclustersoffalsepositivepixella-belscanbeeasilyremovedusingmedianlteringandblobdetec-tion.Thecommonclassicationfailurecasesoccurwhenthehandisoccludedbyanotherbody-part(causingfalsepositives),orwhentheelbowismuchclosertothecamerathanthehand(causingfalsepositivesontheelbow).Webelievethisinaccuracyresultsfromthetrainingsetnotcontaininganyframeswiththeseposes.Amorecomprehensivedataset,containingexamplesoftheseposes,shouldimproveperformanceinfuture.Sincewedonothaveaground-truthmeasureforthe42DOFhandmodeltting,quantitativeevaluationofthisstageisdifcult.Qualitatively,thettingaccuracywasvisuallyconsistentwiththeunderlyingpointcloud.AnexampleofattedframeisshowninFigure11.Onlyaverysmallnumberofposesfailedtotcorrectly;forthesedifcultposes,manualinterventionwasrequired.OnelimitationofthissystemwasthattheframerateofthePrimeSenseTMcamera(30fps)wasnotenoughtoensuresufcienttemporalcoherenceforcorrectconvergenceofthePSOoptimizer.Toovercomethis,wehadeachusermovetheirhandsslowlyduringtrainingdatacapture.UsingaworkstationwithanNvidiaGTX580 Fig.12:SampleConvNetTestSetimages Fig.13:ConvNetLearningCurveGPUand4coreIntelprocessor,ttingeachframerequired3to6seconds.Thenaldatabaseconsistedof76,712trainingsetimages,2,421validationsetimagesand2,000testsetimageswiththeircorrespondingheat-maps,collectedfrommultipleparticipants.AsmallsampleofthetestsetimagesisshowninFigure12.TheConvNettrainingtookapproximately24hourswhereearlystoppingisperformedafter350epochstopreventovertting.Con-vNethyperparameters,suchaslearningrate,momentum,L2regu-larization,andarchitecturalparameters(e.g.,max-poolingwindowsizeornumberofstages)werechosenbycoarsemeta-optimizationtominimizeavalidation-seterror.2stagesofconvolution(ateachresolutionlevel)and2full-connectedneuralnetworkstageswerechosenasatradeoffbetweennumerousperformancecharacteris-tics:generalizationperformance,evaluationtime,andmodelcom-plexity(orabilitytoinfercomplexposes).Figure13showsthemeansquarederror(MSE)aftereachepoch.TheMSEwascalcu-latedbytakingthemeanofsum-of-squareddifferencesbetweenthecalculated14featuremapsandthecorrespondingtargetfeaturemaps.ThemeanUVerroroftheConvNetheat-mapoutputonthetestsetdatawas0.41px(withstandarddeviationof0.35px)onthe18x18resolutionheat-mapimage1.Aftereachheat-mapfea-turewastranslatedtothe640x480depthimage,themeanUVerrorwas5.8px(withstandarddeviationof4.9px).Sincetheheat-mapdownsamplingratioisdepthdependent,theUVerrorimprovesasthehandapproachesthesensor.Forapplicationsthatrequiregreater 1TocalculatethiserrorweusedthetechniquedescribedinSection6tocalculatetheheat-mapUVfeaturelocationandthencalculatedtheerrordistancebetweenthetargetandConvNetoutputlocationsACMTransactionsonGraphics,Vol.,No.,Article,Publicationdate:. PREPRINT 8J.Tompsonetal.TableI.:Heat-MapUVErrorbyFeatureType FeatureTypeMean(px)STD(px) Palm0.330.30ThumbBase&Knuckle0.330.43ThumbTip0.390.55FingerKnuckle0.380.27FingerTip0.540.33 (a) (b) (c)Fig.14:Real-TimeTrackingResults,a)TypicalHardwareSetup,b)DepthwithHeat-MapFeatures,c)ConvNetInputandPoseOutputaccuracy,theheat-mapresolutioncanbeincreasedforbetterspatialaccuracyatthecostofincreasedlatencyandreducedthroughput.TableIshowstheUVaccuracyforeachfeaturetype.Unsurpris-ingly,wefoundthattheConvNetarchitecturehadthemostdif-cultylearningngertippositions,wherethemeanerroris61%higherthantheaccuracyofthepalmfeatures.Thelikelycauseforthisinaccuracyistwofold.Firstly,thengertippositionsundergoalargerangeofmotionbetweenvarioushand-posesandthereforetheConvNetmustlearnamoredifcultmappingbetweenlocalfea-turesandngertippositions.Secondly,thePrimeSenseTMCarmine1.09depthcameracannotalwaysrecoverdepthofsmallsurfacessuchasngertips.TheConvNetisabletolearnthisnoisebehavior,andisactuallyabletoapproximatengertiplocationinthepres-enceofmissingdata,howevertheaccuracyfortheseposesislow.Thecomputationtimeoftheentirepipelineis24.9ms,whichiswithinour30fpsperformancetarget.Withinthisperiod:deci-sionforestevaluationtakes3.4ms,depthimagepreprocessingtakes4.7ms,ConvNetevaluationtakes5.6msandposeestimationtakes11.2ms.Theentirepipelineintroducesapproximatelyoneframeoflatency.Foranexampleoftheentirepipelinerunninginreal-timeaswellaspuppeteeringoftheLBShandmodel,pleaserefertothesupplementaryvideo(screenshotsfromthisvideoareshowninFigure14).Figure15showsthreetypicalfailcasesofoursystem.In15a)nitespatialprecisionoftheConvNetheat-mapresultsinngertippositionsthatarenotquitetouching.In15b)nosimilarposeexistsinthedatabaseusedtotraintheConvNet,andforthisex-amplethenetworkgeneralizationperformancewaspoor.In15c)thePrimeSenseTMdepthcamerafailstodetecttheringnger(thesurfaceareaofthengertippresentedtothecameraistoosmallandtheangleofincidenceinthecameraplaneistooshallow),andtheConvNethasdifcultyinferringthengertippositionwithoutadequatesupportinthedepthimage,whichresultsinanincorrectinferredposition.Figure16showsthattheConvNetoutputistolerantforhandshapesandsizesthatarenotwellrepresentedintheConvNettrain-ingset.TheConvNetandRDFtrainingsetsdidnotincludeanyimagesforuserb)anduserc)(onlyusera)).Wehaveonlyeval-uatedthesystem'sperformanceonadultsubjects.Wefoundthat a)b)c)Fig.15:FailCases:RGBground-truth(toprow),inferredmodel[Sari´c2011]pose(bottomrow)addingasingleper-userscaleparametertoapproximatelyadjustthesizeoftheLBSmodeltoauser'shand,helpedthereal-timeIKstagebetterttotheConvNetoutput. a)b)c)Fig.16:HandShape/SizeTolerance:RGBground-truth(toprow),depthwithannotatedConvNetoutputpositions(bottomrow)Comparisonoftherelativereal-timeperformanceofthisworkwithrelevantpriorart,suchasthatof[3Gear2014]and[Melaxetal.2013],isdifcultforanumberofreasons.Firstly,[Melaxetal.2013]usesadifferentcapturedevice,whichpreventsfaircomparisonasitisimpossible(withoutdegradingsensorperfor-mancebyusingmirrors)formultipledevicestoseethehandfromthesameviewpointsimultaneously.Secondly,nothird-partygroundtruthdatabaseofposeswithdepthframesexistsforhu-manhands,socomparingthequantitativeaccuracyofnumerousmethodsagainstaknownbaselineisnotpossible.Moreimpor-tantlyhowever,isthatthetechniqueutilizedby[3Gear2014]isoptimizedforanentirelydifferentusecaseandsofaircomparisonwiththeirworkisverydifcult.[3Gear2014]utilizesaverticallymountedcamera,cantrackmultiplehandssimultaneously,andiscomputationallylessexpensivethanthemethodpresentedinourwork.Figure17examinestheperformanceofthisworkwiththepropri-etarysystemof[3Gear2014](usingthexed-databaseversionofthelibrary),for4poseschosentohighlighttherelativedifferencebetweenthetwotechniques(imagesusedwithpermissionfrom3Gear).Wecapturedthisdatabystreamingtheoutputofbothsys-temssimultaneously(usingthesameRGBDcamera).WemountedACMTransactionsonGraphics,Vol.,No.,Article,Publicationdate:. PREPRINT Real-TimeContinuousPoseRecoveryofHumanHandsUsingConvolutionalNetworks9thecameravertically,asthisisrequiredfor[3Gear2014],howeverourtrainingsetdidnotincludeanyposesfromthisorientation.Therefore,weexpectoursystemtoperformsub-optimallyforthisverydifferentusecase. a)b)c)d)Fig.17:Comparisonwithstate-of-the-artcommercialsystem:RGBground-truth(toprow),thisworkinferredmodel[Sari´c2011]pose(middlerow),[3Gear2014]inferredmodelpose(bottomrow)(imagesusedwithpermis-sionfrom3Gear).8.FUTUREWORKAsindicatedinFigure16,qualitativelywehavefoundthattheCon-vNetgeneralizationperformancetovaryinghandshapesisaccept-ablebutcouldbeimproved.Wearecondentwecanmakeim-provementsbyaddingmoretrainingdatafromuserswithdifferenthandsizestothetrainingset.Forthiswork,onlytheConvNetforwardpropagationstagewasimplementedontheGPU.Wearecurrentlyworkingonimplement-ingtheentirepipelineontheGPU,whichshouldimproveperfor-manceoftheotherpipelinestagessignicantly.Forexample,theGPUConvNetimplementationrequires5.6ms,whilethesamenet-workexecutedontheCPU(usingoptimizedmulti-threadedC++code)requires139ms.Thecurrentimplementationofoursystemcantracktwohandsonlyiftheyarenotinteracting.Whilewehavedeterminedthatthedatasetgenerationsystemcantmultiplestronglyinteractinghandposeswithsufcientaccuracy,itisfutureworktoevaluatetheneu-ralnetworkrecognitionperformanceontheseposes.Likewise,wehopetoevaluatetherecognitionperformanceonhandposesin-volvinginteractionswithnon-handobjects(suchaspensandotherman-madedevices).Whiletheposerecoveryimplementationpresentedinthisworkisfast,wehopetoaugmentthisstagebyincludingamodel-basedttingstepthattradesconvergenceradiusfortquality.Speci-cally,wesuspectthatreplacingournalIKstagewithanenergy-basedlocaloptimizationmethod,inspiredbytheworkofLietal.[Lietal.2008]couldallowourmethodtorecoversecond-ordersurfaceeffectssuchasskinfoldingandskin-musclecou-plingfromverylimiteddata,andstillwithlow-latency.Inaddi-tiontoinference,suchalocalizedenergy-minimizingstagewouldenableimprovementstotheunderlyingmodelitself.Sincetheselocalizedmethodstypicallyrequiregoodregistration,ourmethod,whichgivescorrespondencefromasingleimagecouldadvancethestate-of-the-artinnon-rigidmodelcapture.Finally,wehopetoaugmentournalIKstagewithsomeformoftemporalposepriortoreducejitter;forinstance,usinganextendedKalmanlterasapost-processingsteptocleanuptheConvNetfeatureoutput.9.CONCLUSIONWehavepresentedanovelpipelinefortrackingtheinstantaneousposeofarticulableobjectsfromasingledepthimage.Asanappli-cationofthispipelineweshowedstate-of-the-artresultsfortrack-inghumanhandsinreal-timeusingcommodityhardware.Thispipelineleveragestheaccuracyofofinemodel-baseddatasetgen-erationroutinesinsupportofarobustreal-timeconvolutionalnet-workarchitectureforfeatureextraction.Weshowedthatitispossi-bletouseintermediateheat-mapfeaturestoextractaccurateandre-liable3Dposeinformationatinteractiveframe-ratesusinginversekinematics.REFERENCES3GEAR.2014.3gearsytemshand-trackingdevelopmentplatform,http://www.threegear.com/.ALLEN,B.,CURLESS,B.,ANDPOPOVI´C,Z.2003.Thespaceofhumanbodyshapes:reconstructionandparameterizationfromrangescans.ACMTrans.Graph.22,3,587–594.BALLAN,L.,TANEJA,A.,GALL,J.,VANGOOL,L.,ANDPOLLEFEYS,M.2012.Motioncaptureofhandsinactionusingdiscriminativesalientpoints.InProceedingsofthe12thEuropeanconferenceonComputerVision.640–653.BOYKOV,Y.,VEKSLER,O.,ANDZABIH,R.2001.Fastapproximateen-ergyminimizationviagraphcuts.IEEETrans.PatternAnal.Mach.In-tell.23,11(Nov.),1222–1239.BUTLER,D.A.,IZADI,S.,HILLIGES,O.,MOLYNEAUX,D.,HODGES,S.,ANDKIM,D.2012.Shake'n'sense:reducinginterferenceforover-lappingstructuredlightdepthcameras.InProceedingsofthe2012ACMannualconferenceonHumanFactorsinComputingSystems.1933–1936.COLLOBERT,R.,KAVUKCUOGLU,K.,ANDFARABET,C.2011.Torch7:Amatlab-likeenvironmentformachinelearning.InBigLearn,NIPSWorkshop.COUPRIE,C.,FARABET,C.,NAJMAN,L.,ANDLECUN,Y.2013.Indoorsemanticsegmentationusingdepthinformation.InInternationalConfer-enceonLearningRepresentations.EROL,A.,BEBIS,G.,NICOLESCU,M.,BOYLE,R.D.,ANDTWOMBLY,X.2007.Vision-basedhandposeestimation:Areview.ComputerVisionandImageUnderstanding.FARABET,C.,COUPRIE,C.,NAJMAN,L.,ANDLECUN,Y.2013.Learn-inghierarchicalfeaturesforscenelabeling.PatternAnalysisandMachineIntelligence,IEEETransactionson35,8,1915–1929.HORN,B.K.P.1987.Closed-formsolutionofabsoluteorientationusingunitquaternions.JournaloftheOpticalSocietyofAmerica4,4,629–642.JARRETT,K.,KAVUKCUOGLU,K.,RANZATO,M.,ANDLECUN,Y.2009.Whatisthebestmulti-stagearchitectureforobjectrecognition?InComputerVision,2009IEEE12thInternationalConferenceon.2146–2153.JIU,M.,WOLF,C.,TAYLOR,G.W.,ANDBASKURT,A.2013.Humanbodypartestimationfromdepthimagesviaspatially-constraineddeeplearning.PatternRecognitionLetters.ACMTransactionsonGraphics,Vol.,No.,Article,Publicationdate:.

Related Contents


Next Show more