2JJSunetal bProbabilisticViewInvariantPoseEmbeddingsPrVIPEFig1Weembed2Dposessuchthatourembeddingsareaviewinvariant2Dprojectionsofsimilar3Dposesareembeddedcloseandbprobabilisticemb ID: 821517
Download Pdf The PPT/PDF document "(a)View-InvariantPoseEmbed-dings(VIPE)." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
2J.J.Sunetal.(a)View-InvariantPoseEmbed
2J.J.Sunetal.(a)View-InvariantPoseEmbed-dings(VIPE).(b)ProbabilisticView-InvariantPoseEmbeddings(Pr-VIPE).Fig.1:Weembed2Dposessuchthatourembeddingsare(a)view-invariant(2Dpro-jectionsofsimilar3Dposesareembeddedclose)and(b)probabilistic(embeddingsaredistributionsthatcoverdierent3Dposesprojectingtothesameinput2Dpose).Inspiredby2D-to-3Dliftingmodels[32],welearnviewinvariantembeddingsdirectlyfrom2Dposekeypoints.AsillustratedinFig.1,weexplorewhetherviewinvarianceofhumanbodiescanbeachievedfrom2Dposesalone,with-outpredicting3Dpose.Typically,embeddingmodelsaretrainedfromimagesusingdeepmetriclearningtechniques[35,14,8].However,imageswithsimi-larhumanposescanappeardierentbecauseofchangingviewpoints,subjects,backgrounds,clothing,etc.Asaresult,itcanbediculttounderstanderrorsintheembeddingspacefromaspecicfactorofvariation.Furthermore,multi-viewimagedatasetsforhumanposesarediculttocaptureinthewildwith3Dgroundtruthannotations.Incontrast,ourmethodleveragesexisting2Dkey-pointdetectors:using2Dkeypointsasinputsallowstheembeddingmodeltofocusonlearningviewinvariance.Our2Dkeypointembeddingscanbetrainedusingdatasetsinlabenvironments,whilehavingthemodelgeneralizetoin-the-wilddata.Additionally,wecaneasilyaugmenttrainingdatabysynthesizingmulti-view2Dposesfrom3Dposesthroughperspectiveprojection.A
notheraspectweaddressisinputuncertainty.
notheraspectweaddressisinputuncertainty.Theinputtoourembed-dingmodelis2Dhumanpose,whichhasaninherentambiguity.Manyvalid3Dposescanprojecttothesameorverysimilar2Dpose[1].Thisinputuncer-taintyisdiculttorepresentusingdeterministicmappingstotheembeddingspace(pointembeddings)[37,24].OurembeddingspaceconsistsofprobabilisticembeddingsbasedonmultivariateGaussians,asshowninFig.1b.Weshowthatthelearnedvariancefromourmethodcorrelateswithinput2Dambiguities.WecallourapproachPr-VIPEforProbabilisticView-InvariantPoseEmbeddings.Thenon-probabilistic,pointembeddingformulationwillbereferredtoasVIPE.Weshowthatourembeddingisapplicabletosubsequentvisiontaskssuchasposeretrieval[35,21],videoalignment[11],andactionrecognition[60,18].Onedirectapplicationispose-basedimageretrieval.Ourembeddingenablesuserstosearchimagesbyne-grainedpose,suchasjumpingwithhandsup,ridingbikewithonehandwaving,andmanyotheractionsthatarepotentiallydiculttopre-dene.Theimportanceofthisapplicationisfurtherhighlightedbyworkssuchas[35,21].Comparedwithusing3Dkeypointswithalignmentforretrieval,ourembeddingenablesecientsimilaritycomparisonsinEuclideanspace.View-InvariantProbabilisticEmbeddingforHumanPose3ContributionsOurmaincontributionisthemethodforlearninganembed-dingspacewhere2Dposeembeddingdistancescorrespondtotheirsimilaritie
sinabsolute3Dposespace.Wealsodevelopapro
sinabsolute3Dposespace.Wealsodevelopaprobabilisticformulationthatcaptures2Dposeambiguity.Weusecross-viewposeretrievaltoevaluatetheview-invariantproperty:givenamonocularposeimage,weretrievethesameposefromdierentviewswithoutusingcameraparameters.Ourresultssuggest2Dposesaresucienttoachieveviewinvariancewithoutimagecontext,andwedonothavetopredict3Dposecoordinatestoachievethis.Wealsodemonstratetheuseofourembeddingsforactionrecognitionandvideoalignment.2RelatedWorkMetricLearningWeareworkingtounderstandsimilarityinhumanposesacrossviews.Mostworksthataimtocapturesimilaritybetweeninputsgener-allyapplytechniquesfrommetriclearning.Objectivessuchascontrastiveloss(basedonpairmatching)[4,12,37]andtripletloss(basedontupleranking)[56,50,57,13]areoftenusedtopushtogether/pullapartsimilar/dissimilarexamplesinembeddingspace.Thenumberofpossibletrainingtuplesincreasesexponen-tiallywithrespecttothenumberofsamplesinthetuple,andnotallcombina-tionsareequallyinformative.Tondinformativetrainingtuples,variousminingstrategiesareproposed[50,58,38,13].Inparticular,semi-hardtripletmininghasbeenwidelyused[50,58,42].Thisminingmethodndsnegativeexamplesthatarefairlyhardastobeinformativebutnottoohardforthemodel.Thehard-nessofanegativesampleisbasedonitsembeddingdistancetotheanchor.Commonly,thisdistanceistheEuclidea
ndistance[56,57,50,13],butanydif-ferenti
ndistance[56,57,50,13],butanydif-ferentiabledistancefunctioncouldbeapplied[13].[16,19]showthatalternativedistancemetricsalsoworkforimageandobjectretrieval.Inourwork,welearnamappingfromEuclideanembeddingdistancetoaprobabilisticsimilarityscore.Thisprobabilisticsimilaritycapturesclosenessin3Dposespacefrom2Dposes.Ourworkisinspiredbythemappingusedinsoftcontrastiveloss[37]forlearningfromanoccludedN-digitMNISTdataset.Mostofthepapersdiscussedaboveinvolvedeterministicallymappinginputstopointembeddings.Thereareworksthatalsomapinputstoprobabilisticem-beddings.Probabilisticembeddingshavebeenusedtomodelspecicityofwordembeddings[55],uncertaintyingraphrepresentations[3],andinputuncertaintyduetoocclusion[37].Wewillapplyprobabilisticembeddingstoaddressinherentambiguitiesin2Dposedueto3D-to-2Dprojection.HumanPoseEstimation3Dhumanposesinaglobalcoordinateframeareview-invariant,sinceimagesacrossviewsaremappedtothesame3Dpose.However,asmentionedby[32],itisdiculttoinferthe3Dposeinanarbitraryglobalframesinceanychangestotheframedoesnotchangetheinputdata.Manyapproachesworkwithposesinthecameracoordinatesystem[32,6,43,46,62,52,48,53,7],wheretheposedescriptionchangesbasedonviewpoint.Whileourworkfocusesonimageswithasingleperson,thereareotherworksfocusingondescribingposesofmultiplepeople[47].View-InvariantProbabilist
icEmbeddingforHumanPose5increasethematch
icEmbeddingforHumanPose5increasethematchingprobabilityofsimilarposes.Finally,theGaussianpriorloss(Section3.4)helpsregularizeembeddingmagnitudeandvariance.3.1MatchingDenitionThe3Dposespaceiscontinuous,andtwo3Dposescanbetriviallydierentwithoutbeingidentical.Wedenetwo3Dposestobematchingiftheyarevisuallysimilarregardlessofviewpoint.Giventwosetsof3Dkeypoints(yi;yj),wedeneamatchingindicatorfunctionmij=(1;ifNP-MPJPE(yi;yj)60;otherwise,(1)wherecontrolsvisualsimilaritybetweenmatchingposes.Here,weusemeanperjointpositionerror(MPJPE)[17]betweenthetwosetsof3Dposekeypointsasaproxytoquantifytheirvisualsimilarity.BeforecomputingMPJPE,wenormalizethe3DposesandapplyProcrustesalignmentbetweenthem.Thereasonisthatwewantourmodeltobeview-invariantandtodisregardrotation,translation,orscaledierencesbetween3Dposes.Werefertothisnormalized,ProcrustesalignedMPJPEasNP-MPJPE.3.2TripletRatioLossThetripletratiolossaimstoembed2Dposesbasedonthematchingindicatorfunction(1).Letnbethedimensionoftheinput2Dposekeypointsx,anddbethedimensionoftheoutputembedding.Wewouldliketolearnamappingf:Rn!Rd,suchthatD(zi;zj)D(zi;zj0);8mij-373;mij0,wherez=f(x),andD(zi;zj)isanembeddingspacedistancemeasure.Forapairofinput2Dposes(xi;xj),wedenep(mjxi;xj)tobetheproba-bilitythattheircorresponding3Dposes(yi;yj)match,
thatis,theyarevisuallysimilar.Whileitisd
thatis,theyarevisuallysimilar.Whileitisdiculttodenethisprobabilitydirectly,weproposetoas-signitsvaluesbyestimatingp(mjzi;zj)viametriclearning.Weknowthatiftwo3Dposesareidentical,thenp(mjxi;xj)=1,andiftwo3Dposesaresucientlydierent,p(mjxi;xj)shouldbesmall.Foranygiveninputtriplet(xi;xi+;xi)withmi;i+-373;mi;i,wewantp(mjzi;zi+)p(mjzi;zi);(2)where1representstheratioofthematchingprobabilityofasimilar3Dposepairtothatofadissimilarpair.Applyingnegativelogarithmtobothsides,wehave(logp(mjzi;zi+))(logp(mjzi;zi))6log:(3)Noticethatthemodelcanbetrainedtosatisfythiswiththetripletlossframe-work[50].GivenbatchsizeN,wedenetripletratiolossLratioasLratio=NXi=1max(0;Dm(zi;zi+)Dm(zi;zi)+));(4)View-InvariantProbabilisticEmbeddingforHumanPose7InferenceAtinferencetime,ourmodeltakesasingle2Dpose(eitherfromdetectionorprojection)andoutputsthemeanandthevarianceoftheembeddingGaussiandistribution.3.5CameraAugmentationOurtripletscanbemadeofdetectedand/orprojected2DkeypointsasshowninFig.2.Whenwetrainonlywithdetected2Dkeypoints,weareconstrainedtothecameraviewsintrainingimages.Toreduceoverttingtothesecameraviews,weperformcameraaugmentationbygeneratingtripletsusingdetectedkeypointsalongsideprojected2Dkeypointsatrandomviews.T
oformtripletsusingmulti-viewimagepairs,w
oformtripletsusingmulti-viewimagepairs,weusedetected2Dkeypointsfromdierentviewsasanchor-positivepairs.Touseprojected2Dkeypoints,weperformtworandomrotationstoanormalizedinput3Dposetogeneratetwo2Dposesfromdierentviewsforanchor/positive.Cameraaugmentationisthenperformedbyusingamixtureofdetectedandprojected2Dkeypoints.Wendthattrainingusingcameraaugmentationcanhelpourmodelslearntogeneralizebettertounseenviews(Section4.2.2).3.6ImplementationDetailsWenormalize3Dposessimilarto[7],andweperforminstancenormalizationto2Dposes.Thebackbonenetworkarchitectureforourmodelisbasedon[32].Weused=16asagoodtrade-obetweenembeddingsizeandaccuracy.Toweighdierentlosses,weusewratio=1,wpositive=0:005,andwprior=0:001.Wechoose=2forthetripletratiolossmarginandK=20forthenumberofsamples.ThematchingNP-MPJPEthresholdis=0:1foralltrainingandevaluation.Ourapproachdoesnotrelyonaparticular2Dkeypointdetector,andweusePersonLab[40]forourexperiments.Forrandomrotationincameraaugmentation,weuniformlysampleazimuthanglebetween180,elevationbetween30,androllbetween30.OurimplementationisinTensorFlow,andallthemodelsaretrainedwithCPUs.Moredetailsandablationstudiesonhyperparamtersareprovidedinthesupplementarymaterials.4ExperimentsWedemonstratetheperformanceofourmodelthroughposeretrievalac
rossdierentcameraviews(Section4.2).
rossdierentcameraviews(Section4.2).Wefurthershowourembeddingscanbedirectlyappliedtodownstreamtasks,suchasactionrecognition(Section4.3.1)andvideoalignment(Section4.3.2),withoutanyadditionaltraining.4.1DatasetsForalltheexperimentsinthispaper,weonlytrainonasubsetoftheHu-man3.6M[17]dataset.Forposeretrievalexperiments,wevalidateontheHu-man3.6Mhold-outsetandtestonanotherdataset(MPI-INF-3DHP[33]),whichView-InvariantProbabilisticEmbeddingforHumanPose9Table1:Comparisonofcross-viewposeretrievalresultsHit@k(%)onH3.6Mand3DHPwithchest-levelcamerasandallcameras.indicatesthatnormalizationandProcrustesalignmentareperformedonquery-indexpairs.DatasetH3.6M3DHP(Chest)3DHP(All)k1102011020110202Dkeypoints*28:747:150:95:2014:017:29:8021:625:53Dlifting*69:089:792:724:954:462:424:653:261:3L2-VIPE73:594:296:623:856:766:518:746:355:7L2-VIPE(w/aug.)70:491:894:524:955:463:623:753:061:4Pr-VIPE76:295:697:725:459:369:319:949:158:8Pr-VIPE(w/aug.)73:793:996:328:362:371:426:458:667:9asretrivalcondence.WepresenttheresultsontheVIPEmodelswithandwithoutcameraaugmentation.Weappliedsimilarcameraaugmentationtotheliftingmodel,butdidnotseeimprovementinperformance.Wealsoshowtheresultsofposeretrievalusingaligned2Dkeypointsonly.Thepoorperformanceofusinginput2Dkeypointsforretrievalfromdierentview
sconrmsthefactthatmodelsmustlearnvi
sconrmsthefactthatmodelsmustlearnviewinvariancefrominputsforthistask.Wealsocomparewiththeimage-basedEpipolarPosemodel[26].Pleaserefertothesupplementarymaterialsfortheexperimentdetailsandresults.4.2.2QuantitativeResultsFromTable1,weseethatPr-VIPE(withaug-mentation)outperformsallthebaselinesforH3.6Mand3DHP.TheH3.6Mre-sultsshownareonthehold-outset,and3DHPisunseenduringtraining,withmorediverseposesandviews.Whenweuseallthecamerasfrom3DHP,weevaluatethegeneralizationabilityofmodelstonewposesandnewviews.Whenweevaluateusingonlythe5chest-levelcamerasfrom3DHP,wheretheviewsaremoresimilartothetrainingsetinH3.6M,wemainlyevaluateforgeneraliza-tiontonewposes.Whenweevaluateusingonlythe5chest-levelcamerasfrom3DHP,theviewsaremoresimilartoH3.6M,andgeneralizationtonewposesbecomesmoreimportant.OurmodelisrobusttothechoiceofandthenumberofsamplesK(analysisinsupplementarymaterials).Table1showsthatPr-VIPEwithoutcameraaugmentationisabletoper-formbetterthanthebaselinesforH3.6Mand3DHP(chest-levelcameras).ThisshowsthatPr-VIPEisabletogeneralizeaswellasotherbaselinemethodstonewposes.However,for3DHP(allcameras),theperformanceforPr-VIPEwith-outaugmentationisworsecomparedwithchest-levelcameras.Thisobservationindicatesthatwhentrainedonchest-levelcamerasonly,Pr-VIPEdoesnotgen-eralizeaswelltonewviews.Thesameresultscanbeobse
rvedforL2-VIPEbetweenchest-levelandallca
rvedforL2-VIPEbetweenchest-levelandallcameras.Incontrast,the3DliftingmodelsareabletogeneralizebettertonewviewswiththehelpofadditionalProcrustesalign-ment,whichrequiresexpensiveSVDcomputationforeveryindex-querypair.WefurtherapplycameraaugmentationtotrainingthePr-VIPEandtheL2-VIPEmodel.Notethatthisstepdoesnotrequirecameraparametersoradditionalgroundtruth.TheresultsinTable1onPr-VIPEshowthattheaug-View-InvariantProbabilisticEmbeddingforHumanPose11posesandviewsunseenduringtraining,whichhasthenearestneighborslightlyfurtherawayintheembeddingspace.WeseethatthemodelcangeneralizetonewviewsastheimagesaretakenatdierentcameraelevationsfromH3.6M.Interestingly,therightmostpaironrow2showsthatthemodelcanretrieveposeswithlargedierencesinrollangle,whichisnotpresentinthetrainingset.Therightmostpaironrow3showsanexampleofalargeNP-MPJPEerrorduetomis-detectionoftheleftlegintheindexpose.WeshowqualitativeresultsusingqueriesfromtheH3.6Mhold-outsettoretrievefrom2DHPinthelasttworowsofFig.3.Theresultsonthesein-the-wildimagesindicatethataslongasthe2Dkeypointdetectorworksreliably,ourmodelisabletoretrieveposesacrossviewsandsubjects.Morequalitativeresultsareprovidedinthesupplementarymaterials.4.3DownstreamTasksWeshowthatourposeembeddingcanbedirectlyappliedtopose-baseddown-streamtasksusingsimplealgorithms.Wecomparethep
erformanceofPr-VIPE(onlytrainedonH3.6M,w
erformanceofPr-VIPE(onlytrainedonH3.6M,withnoadditionaltraining)onthePennAc-tiondatasetagainstotherapproachesspecicallytrainedforeachtaskonthetargetdataset.Inallthefollowingexperimentsinthissection,wecomputeourPr-VIPEembeddingsonsinglevideoframesandusethenegativelogarithmofthematchingprobability(7)asthedistancebetweentwoframes.Thenweapplytemporalaveragingwithinanatrouskernelofsize7andrate3aroundthetwocenterframesandusethisaverageddistanceastheframematchingdistance.Giventhematchingdistance,weusestandarddynamictimewarping(DTW)algorithmtoaligntwoactionsequencesbyminimizingthesumofframematch-ingdistances.Wefurtherusetheaveragedframematchingdistancefromthealignmentasthedistancebetweentwovideosequences.4.3.1ActionRecognitionWeevaluateourembeddingsforactionrecogni-tionusingnearestneighborsearchwiththesequencedistancedescribedabove.Providedpersonboundingboxesineachframe,weestimate2Dposekeypointsusing[41].OnPennAction,weusethestandardtrain/testsplit[36].Usingallthetestingvideosasqueries,weconducttwoexperiments:(1)weusealltrainingvideosasindextoevaluateoverallperformanceandcomparewithstate-of-the-artmethods,and(2)weusetrainingvideosonlyunderoneviewasindexandevaluatetheeectivenessofourembeddingsintermsofview-invariance.Forthissecondexperiment,actionswithzerooronlyonesampleundertheindexviewareigno
red,andaccuracyisaveragedoverdieren
red,andaccuracyisaveragedoverdierentviews.FromTable2wecanseethatwithoutanytrainingonthetargetdomainorusingimagecontextinformation,ourembeddingscanachievehighlycompetitiveresultsonpose-basedactionclassication,outperformingtheexistingbestbase-linethatonlyusesposeinputandevensomeothermethodsthatrelyonimagecontextoroptical ow.AsshowninthelastrowinTable2,ourembeddingscanbeusedtoclassifyactionsfromdierentviewsusingindexsamplesfromonlyonesingleviewwithrelativelyhighaccuracy,whichfurtherdemonstratestheadvantagesofourview-invariantembeddings.View-InvariantProbabilisticEmbeddingforHumanPose13(a)(b)(c)(d)Fig.5:Ablationstudy:(a)Topretrievalsby2DNP-MPJPEfromtheH3.6Mhold-outsubsetforquerieswithlargestandsmallestvariance.2Dposesareshownintheboxes.(b)Relationshipbetweenembeddingvarianceand2DNP-MPJPEtotop-10nearest2DposeneighborsfromtheH3.6Mhold-outsubset.Theorangecurverepresentsthebesttting5thdegreepolynomial.(c)ComparisonofHit@1withdierentembeddingdimensions.The3Dliftingbaselinepredictss39dimensions.(d)Relationshipbetweenretrievalcondenceandmatchingaccuracy.VIPEandPr-VIPEare75:4%and76:2%onH3.6M,and19:7%and20:0%on3DHP,respectively.Whenweaddcameraaugmentation,theHit@1forVIPEandPr-VIPEare73:8%and73:7%onH3.6M,and26:1%and26:5%on3DHP,respectively.Despitethesimilarretrievalaccuracie
s,Pr-VIPEisgenerallymoreaccurateand,more
s,Pr-VIPEisgenerallymoreaccurateand,moreimportantly,hasadditionaldesirablepropertiesinthatthevariancecanmodel2Dinputambiguityastobediscussednext.A2Dposeisambiguousiftherearesimilar2Dposesthatcanbeprojectedfromverydierentposesin3D.Tomeasurethis,wecomputetheaverage2DNP-MPJPEbetweena2Dposeanditstop-10nearestneighborsintermsof2DNP-MPJPE.Toensurethe3Dposesaredierent,wesample1200posesfromH3.6Mhold-outsetwithaminimumgapof0:13DNP-MPJPE.Ifa2Dposehassmall2DNP-MPJPEtoitsneighbors,itmeanstherearemanysimilar2Dposescorrespondingtodierent3Dposesandsothe2Dposeisambiguous.Fig.5ashowsthatthe2Dposewiththelargestvarianceisambiguousasithassimilar2DposesinH3.6Mwithdierent3Dposes.Incontrast,weseethattheclosest2DposescorrespondingtothesmallestvarianceposeontherstrowofFig.5aareclearlydierent.Fig.5bfurthershowsthatastheaveragevarianceincreases,the2DNP-MPJPEbetweensimilarposesgenerallydecreases,whichmeansthat2Dposeswithlargervariancesaremoreambiguous.EmbeddingDimensionsFig.5cdemonstratestheeectofembeddingdi-mensionsonH3.6Mand3DHP.Theliftingmodellifts132Dkeypointsto3D,andthereforehasaconstantoutputdimensionof39.WeseethatPr-VIPE(withaugmentation)isabletoachieveahigheraccuracythanliftingat16dimensions.14J.J.Sunetal.Additionally,wecanincreasethenumberofembeddingdimensionsto32,whichincreases
accuracyofPr-VIPEfrom73:7%to75:5%.Retrie
accuracyofPr-VIPEfrom73:7%to75:5%.RetrievalCondenceInordertovalidatetheretrievalcondencevalues,werandomlysample100queriesalongwiththeirtop-5retrievals(usingPr-VIPEretrievalcondence)fromeachquery-indexcamerapair.Thisprocedureforms6000query-retrievalsamplepairsforH3.6M(4views,12camerapairs)and55000for3DHP(11views,110camerapairs),whichwebinbytheirretrievalcondences.Fig.5dshowsthematchingaccuracyforeachcondencebin.Wecanseethattheaccuracypositivelycorrelateswiththecondencevalues,whichsuggestourretrievalcondenceisavalidindicatortomodelperformance.Whatif2Dkeypointdetectorswereperfect?Werepeatourposeretrievalexperimentsusinggroundtruth2Dkeypointstosimulateaperfect2DkeypointdetectoronH3.6Mand3DHP.Allexperimentsusethe4viewsfromH3.6Mfortrainingfollowingthestandardprotocol.Forthebaselineliftingmodelincameraframe,weachieve89:9%Hit@1onH3.6M,48:2%on3DHP(all),and48:8%on3DHP(chest).ForPr-VIPE,weachieve97:5%Hit@1onH3.6M,44:3%on3DHP(all),and66:4%on3DHP(chest).TheseresultsfollowthesametrendasusingdetectedkeypointsinputsinTable1.Comparingtheresultswithusingdetectedkeypoints,thelargeimprovementinperformanceusinggroundtruthkeypointssuggeststhataconsiderablefractionoferrorinourmodelisduetoimperfect2Dkeypointdetections.Pleaserefertothesupplementarymaterialsformoreablationstudiesandembe
ddingspacevisualization.5ConclusionWeint
ddingspacevisualization.5ConclusionWeintroducePr-VIPE,anapproachtolearningprobabilisticview-invariantem-beddingsfrom2Dposekeypoints.Byworkingwith2Dkeypoints,wecanusecameraaugmentationtoimprovemodelgeneralizationtounseenviews.Wealsodemonstratethatourprobabilisticembeddinglearnstocaptureinputambigu-ity.Pr-VIPEhasasimplearchitectureandcanbepotentiallyappliedtoobjectandhandposes.Forcross-viewposeretrieval,3Dposeestimationmodelsre-quireexpensiverigidalignmentbetweenquery-indexpair,whileourembeddingscanbeappliedtocomparesimilaritiesinsimpleEuclideanspace.Inaddition,wedemonstratedtheeectivenessofourembeddingsondownstreamtasksforactionrecognitionandvideoalignment.Ourembeddingfocusesonasingleper-son,andforfuturework,wewillinvestigateextendingittomultiplepeopleandrobustmodelsthatcanhandlemissingkeypointsfrominput.AcknowledgmentWethankYuxiaoWang,DebidattaDwibedi,andLiangzheYuanfromGoogleResearch,LongZhaofromRutgersUniversity,andXiaoZhangfromUniversityofChicagoforhelpfuldiscussions.WeappreciatethesupportofPietroPerona,YisongYue,andtheComputationalVisionLabatCaltechformakingthiscol-laborationpossible.TheauthorJenniferJ.SunissupportedbyNSERC(fundingnumberPGSD3-532647-2019)andCaltech.16J.J.Sunetal.24.Kendall,A.,Gal,Y.:Whatuncertaintiesdoweneedinbayesiandeeplearningforcomputervision?In:NeurIPS(2017)25.Kin
gma,D.P.,Welling,M.:Auto-encodingvariati
gma,D.P.,Welling,M.:Auto-encodingvariationalbayes(2014)26.Kocabas,M.,Karagoz,S.,Akbas,E.:Self-supervisedlearningof3Dhumanposeusingmulti-viewgeometry(2019)27.LeCun,Y.,Huang,F.J.,Bottou,L.,etal.:Learningmethodsforgenericobjectrecognitionwithinvariancetoposeandlighting.In:CVPR(2004)28.Li,J.,Wong,Y.,Zhao,Q.,Kankanhalli,M.:Unsupervisedlearningofview-invariantactionrepresentations.In:NeurIPS(2018)29.Liu,J.,Akhtar,N.,Ajmal,M.:ViewpointinvariantactionrecognitionusingRGB-Dvideos.IEEEAccess6,70061{70071(2018)30.Liu,M.,Yuan,J.:Recognizinghumanactionsastheevolutionofposeestimationmaps.In:CVPR(2018)31.Luvizon,D.C.,Tabia,H.,Picard,D.:Multi-taskdeeplearningforreal-time3Dhumanposeestimationandactionrecognition.arXiv:1912.08077(2019)32.Martinez,J.,Hossain,R.,Romero,J.,Little,J.J.:Asimpleyeteectivebaselinefor3Dhumanposeestimation.In:ICCV(2017)33.Mehta,D.,Rhodin,H.,Casas,D.,Fua,P.,Sotnychenko,O.,Xu,W.,Theobalt,C.:Monocular3DhumanposeestimationinthewildusingimprovedCNNsupervision.In:3DV(2017)34.Misra,I.,Zitnick,C.L.,Hebert,M.:Shueandlearn:Unsupervisedlearningusingtemporalorderverication.In:ECCV(2016)35.Mori,G.,Pantofaru,C.,Kothari,N.,Leung,T.,Toderici,G.,Toshev,A.,Yang,W.:Poseembeddings:Adeeparchitectureforlearningtomatchhumanposes.arXiv:1507.00302(2015)36.Nie,B.X.,Xiong,C.,Zhu,S.C.:Jointactionrecogn
itionandposeestimationfromvideo.In:CVPR(
itionandposeestimationfromvideo.In:CVPR(2015)37.Oh,S.J.,Murphy,K.,Pan,J.,Roth,J.,Schro,F.,Gallagher,A.:Modelingun-certaintywithhedgedinstanceembedding.ICLR(2019)38.OhSong,H.,Xiang,Y.,Jegelka,S.,Savarese,S.:Deepmetriclearningvialiftedstructuredfeatureembedding.In:CVPR(2016)39.Ong,E.J.,Micilotta,A.S.,Bowden,R.,Hilton,A.:Viewpointinvariantexemplar-based3Dhumantracking.CVIU(2006)40.Papandreou,G.,Zhu,T.,Chen,L.C.,Gidaris,S.,Tompson,J.,Murphy,K.:Per-sonLab:Personposeestimationandinstancesegmentationwithabottom-up,part-based,geometricembeddingmodel.In:ECCV(2018)41.Papandreou,G.,Zhu,T.,Kanazawa,N.,Toshev,A.,Tompson,J.,Bregler,C.,Murphy,K.:Towardsaccuratemulti-personposeestimationinthewild.In:CVPR(2017)42.Parkhi,O.M.,Vedaldi,A.,Zisserman,A.,etal.:Deepfacerecognition.In:BMVC(2015)43.Pavllo,D.,Feichtenhofer,C.,Grangier,D.,Auli,M.:3Dhumanposeestimationinvideowithtemporalconvolutionsandsemi-supervisedtraining.In:CVPR(2019)44.Qiu,H.,Wang,C.,Wang,J.,Wang,N.,Zeng,W.:CrossViewFusionfor3DHumanPoseEstimation.arXiv:1909.01203(2019)45.Rao,C.,Shah,M.:View-invarianceinactionrecognition.In:CVPR(2001)46.RayatImtiazHossain,M.,Little,J.J.:Exploitingtemporalinformationfor3Dhumanposeestimation.In:ECCV(2018)47.Rhodin,H.,Constantin,V.,Katircioglu,I.,Salzmann,M.,Fua,P.:Neuralscenedecompositionformulti-personmotioncapture.