/
(a)View-InvariantPoseEmbed-dings(VIPE). (a)View-InvariantPoseEmbed-dings(VIPE).

(a)View-InvariantPoseEmbed-dings(VIPE). - PDF document

naomi
naomi . @naomi
Follow
342 views
Uploaded On 2020-11-23

(a)View-InvariantPoseEmbed-dings(VIPE). - PPT Presentation

2JJSunetal bProbabilisticViewInvariantPoseEmbeddingsPrVIPEFig1Weembed2Dposessuchthatourembeddingsareaviewinvariant2Dprojectionsofsimilar3Dposesareembeddedcloseandbprobabilisticemb ID: 821517

view vipe 2019 cvpr vipe view cvpr 2019 on3dhp mpjpe 2018 fig 6mhold mjzi chest section4 6mand3dhp wede 2017

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "(a)View-InvariantPoseEmbed-dings(VIPE)." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

2J.J.Sunetal.(a)View-InvariantPoseEmbed
2J.J.Sunetal.(a)View-InvariantPoseEmbed-dings(VIPE).(b)ProbabilisticView-InvariantPoseEmbeddings(Pr-VIPE).Fig.1:Weembed2Dposessuchthatourembeddingsare(a)view-invariant(2Dpro-jectionsofsimilar3Dposesareembeddedclose)and(b)probabilistic(embeddingsaredistributionsthatcoverdi erent3Dposesprojectingtothesameinput2Dpose).Inspiredby2D-to-3Dliftingmodels[32],welearnviewinvariantembeddingsdirectlyfrom2Dposekeypoints.AsillustratedinFig.1,weexplorewhetherviewinvarianceofhumanbodiescanbeachievedfrom2Dposesalone,with-outpredicting3Dpose.Typically,embeddingmodelsaretrainedfromimagesusingdeepmetriclearningtechniques[35,14,8].However,imageswithsimi-larhumanposescanappeardi erentbecauseofchangingviewpoints,subjects,backgrounds,clothing,etc.Asaresult,itcanbediculttounderstanderrorsintheembeddingspacefromaspeci cfactorofvariation.Furthermore,multi-viewimagedatasetsforhumanposesarediculttocaptureinthewildwith3Dgroundtruthannotations.Incontrast,ourmethodleveragesexisting2Dkey-pointdetectors:using2Dkeypointsasinputsallowstheembeddingmodeltofocusonlearningviewinvariance.Our2Dkeypointembeddingscanbetrainedusingdatasetsinlabenvironments,whilehavingthemodelgeneralizetoin-the-wilddata.Additionally,wecaneasilyaugmenttrainingdatabysynthesizingmulti-view2Dposesfrom3Dposesthroughperspectiveprojection.A

notheraspectweaddressisinputuncertainty.
notheraspectweaddressisinputuncertainty.Theinputtoourembed-dingmodelis2Dhumanpose,whichhasaninherentambiguity.Manyvalid3Dposescanprojecttothesameorverysimilar2Dpose[1].Thisinputuncer-taintyisdiculttorepresentusingdeterministicmappingstotheembeddingspace(pointembeddings)[37,24].OurembeddingspaceconsistsofprobabilisticembeddingsbasedonmultivariateGaussians,asshowninFig.1b.Weshowthatthelearnedvariancefromourmethodcorrelateswithinput2Dambiguities.WecallourapproachPr-VIPEforProbabilisticView-InvariantPoseEmbeddings.Thenon-probabilistic,pointembeddingformulationwillbereferredtoasVIPE.Weshowthatourembeddingisapplicabletosubsequentvisiontaskssuchasposeretrieval[35,21],videoalignment[11],andactionrecognition[60,18].Onedirectapplicationispose-basedimageretrieval.Ourembeddingenablesuserstosearchimagesby ne-grainedpose,suchasjumpingwithhandsup,ridingbikewithonehandwaving,andmanyotheractionsthatarepotentiallydiculttopre-de ne.Theimportanceofthisapplicationisfurtherhighlightedbyworkssuchas[35,21].Comparedwithusing3Dkeypointswithalignmentforretrieval,ourembeddingenablesecientsimilaritycomparisonsinEuclideanspace.View-InvariantProbabilisticEmbeddingforHumanPose3ContributionsOurmaincontributionisthemethodforlearninganembed-dingspacewhere2Dposeembeddingdistancescorrespondtotheirsimilaritie

sinabsolute3Dposespace.Wealsodevelopapro
sinabsolute3Dposespace.Wealsodevelopaprobabilisticformulationthatcaptures2Dposeambiguity.Weusecross-viewposeretrievaltoevaluatetheview-invariantproperty:givenamonocularposeimage,weretrievethesameposefromdi erentviewswithoutusingcameraparameters.Ourresultssuggest2Dposesaresucienttoachieveviewinvariancewithoutimagecontext,andwedonothavetopredict3Dposecoordinatestoachievethis.Wealsodemonstratetheuseofourembeddingsforactionrecognitionandvideoalignment.2RelatedWorkMetricLearningWeareworkingtounderstandsimilarityinhumanposesacrossviews.Mostworksthataimtocapturesimilaritybetweeninputsgener-allyapplytechniquesfrommetriclearning.Objectivessuchascontrastiveloss(basedonpairmatching)[4,12,37]andtripletloss(basedontupleranking)[56,50,57,13]areoftenusedtopushtogether/pullapartsimilar/dissimilarexamplesinembeddingspace.Thenumberofpossibletrainingtuplesincreasesexponen-tiallywithrespecttothenumberofsamplesinthetuple,andnotallcombina-tionsareequallyinformative.To ndinformativetrainingtuples,variousminingstrategiesareproposed[50,58,38,13].Inparticular,semi-hardtripletmininghasbeenwidelyused[50,58,42].Thisminingmethod ndsnegativeexamplesthatarefairlyhardastobeinformativebutnottoohardforthemodel.Thehard-nessofanegativesampleisbasedonitsembeddingdistancetotheanchor.Commonly,thisdistanceistheEuclidea

ndistance[56,57,50,13],butanydif-ferenti
ndistance[56,57,50,13],butanydif-ferentiabledistancefunctioncouldbeapplied[13].[16,19]showthatalternativedistancemetricsalsoworkforimageandobjectretrieval.Inourwork,welearnamappingfromEuclideanembeddingdistancetoaprobabilisticsimilarityscore.Thisprobabilisticsimilaritycapturesclosenessin3Dposespacefrom2Dposes.Ourworkisinspiredbythemappingusedinsoftcontrastiveloss[37]forlearningfromanoccludedN-digitMNISTdataset.Mostofthepapersdiscussedaboveinvolvedeterministicallymappinginputstopointembeddings.Thereareworksthatalsomapinputstoprobabilisticem-beddings.Probabilisticembeddingshavebeenusedtomodelspeci cityofwordembeddings[55],uncertaintyingraphrepresentations[3],andinputuncertaintyduetoocclusion[37].Wewillapplyprobabilisticembeddingstoaddressinherentambiguitiesin2Dposedueto3D-to-2Dprojection.HumanPoseEstimation3Dhumanposesinaglobalcoordinateframeareview-invariant,sinceimagesacrossviewsaremappedtothesame3Dpose.However,asmentionedby[32],itisdiculttoinferthe3Dposeinanarbitraryglobalframesinceanychangestotheframedoesnotchangetheinputdata.Manyapproachesworkwithposesinthecameracoordinatesystem[32,6,43,46,62,52,48,53,7],wheretheposedescriptionchangesbasedonviewpoint.Whileourworkfocusesonimageswithasingleperson,thereareotherworksfocusingondescribingposesofmultiplepeople[47].View-InvariantProbabilist

icEmbeddingforHumanPose5increasethematch
icEmbeddingforHumanPose5increasethematchingprobabilityofsimilarposes.Finally,theGaussianpriorloss(Section3.4)helpsregularizeembeddingmagnitudeandvariance.3.1MatchingDe nitionThe3Dposespaceiscontinuous,andtwo3Dposescanbetriviallydi erentwithoutbeingidentical.Wede netwo3Dposestobematchingiftheyarevisuallysimilarregardlessofviewpoint.Giventwosetsof3Dkeypoints(yi;yj),wede neamatchingindicatorfunctionmij=(1;ifNP-MPJPE(yi;yj)60;otherwise,(1)wherecontrolsvisualsimilaritybetweenmatchingposes.Here,weusemeanperjointpositionerror(MPJPE)[17]betweenthetwosetsof3Dposekeypointsasaproxytoquantifytheirvisualsimilarity.BeforecomputingMPJPE,wenormalizethe3DposesandapplyProcrustesalignmentbetweenthem.Thereasonisthatwewantourmodeltobeview-invariantandtodisregardrotation,translation,orscaledi erencesbetween3Dposes.Werefertothisnormalized,ProcrustesalignedMPJPEasNP-MPJPE.3.2TripletRatioLossThetripletratiolossaimstoembed2Dposesbasedonthematchingindicatorfunction(1).Letnbethedimensionoftheinput2Dposekeypointsx,anddbethedimensionoftheoutputembedding.Wewouldliketolearnamappingf:Rn!Rd,suchthatD(zi;zj)D(zi;zj0);8mij&#x-373;mij0,wherez=f(x),andD(zi;zj)isanembeddingspacedistancemeasure.Forapairofinput2Dposes(xi;xj),wede nep(mjxi;xj)tobetheproba-bilitythattheircorresponding3Dposes(yi;yj)match,

thatis,theyarevisuallysimilar.Whileitisd
thatis,theyarevisuallysimilar.Whileitisdiculttode nethisprobabilitydirectly,weproposetoas-signitsvaluesbyestimatingp(mjzi;zj)viametriclearning.Weknowthatiftwo3Dposesareidentical,thenp(mjxi;xj)=1,andiftwo3Dposesaresucientlydi erent,p(mjxi;xj)shouldbesmall.Foranygiveninputtriplet(xi;xi+;xi�)withmi;i+&#x-373;mi;i�,wewantp(mjzi;zi+)p(mjzi;zi�)� ;(2)where �1representstheratioofthematchingprobabilityofasimilar3Dposepairtothatofadissimilarpair.Applyingnegativelogarithmtobothsides,wehave(�logp(mjzi;zi+))�(�logp(mjzi;zi�))6�log :(3)Noticethatthemodelcanbetrainedtosatisfythiswiththetripletlossframe-work[50].GivenbatchsizeN,wede netripletratiolossLratioasLratio=NXi=1max(0;Dm(zi;zi+)�Dm(zi;zi�)+ ));(4)View-InvariantProbabilisticEmbeddingforHumanPose7InferenceAtinferencetime,ourmodeltakesasingle2Dpose(eitherfromdetectionorprojection)andoutputsthemeanandthevarianceoftheembeddingGaussiandistribution.3.5CameraAugmentationOurtripletscanbemadeofdetectedand/orprojected2DkeypointsasshowninFig.2.Whenwetrainonlywithdetected2Dkeypoints,weareconstrainedtothecameraviewsintrainingimages.Toreduceover ttingtothesecameraviews,weperformcameraaugmentationbygeneratingtripletsusingdetectedkeypointsalongsideprojected2Dkeypointsatrandomviews.T

oformtripletsusingmulti-viewimagepairs,w
oformtripletsusingmulti-viewimagepairs,weusedetected2Dkeypointsfromdi erentviewsasanchor-positivepairs.Touseprojected2Dkeypoints,weperformtworandomrotationstoanormalizedinput3Dposetogeneratetwo2Dposesfromdi erentviewsforanchor/positive.Cameraaugmentationisthenperformedbyusingamixtureofdetectedandprojected2Dkeypoints.We ndthattrainingusingcameraaugmentationcanhelpourmodelslearntogeneralizebettertounseenviews(Section4.2.2).3.6ImplementationDetailsWenormalize3Dposessimilarto[7],andweperforminstancenormalizationto2Dposes.Thebackbonenetworkarchitectureforourmodelisbasedon[32].Weused=16asagoodtrade-o betweenembeddingsizeandaccuracy.Toweighdi erentlosses,weusewratio=1,wpositive=0:005,andwprior=0:001.Wechoose =2forthetripletratiolossmarginandK=20forthenumberofsamples.ThematchingNP-MPJPEthresholdis=0:1foralltrainingandevaluation.Ourapproachdoesnotrelyonaparticular2Dkeypointdetector,andweusePersonLab[40]forourexperiments.Forrandomrotationincameraaugmentation,weuniformlysampleazimuthanglebetween180,elevationbetween30,androllbetween30.OurimplementationisinTensorFlow,andallthemodelsaretrainedwithCPUs.Moredetailsandablationstudiesonhyperparamtersareprovidedinthesupplementarymaterials.4ExperimentsWedemonstratetheperformanceofourmodelthroughposeretrievalac

rossdi erentcameraviews(Section4.2).
rossdi erentcameraviews(Section4.2).Wefurthershowourembeddingscanbedirectlyappliedtodownstreamtasks,suchasactionrecognition(Section4.3.1)andvideoalignment(Section4.3.2),withoutanyadditionaltraining.4.1DatasetsForalltheexperimentsinthispaper,weonlytrainonasubsetoftheHu-man3.6M[17]dataset.Forposeretrievalexperiments,wevalidateontheHu-man3.6Mhold-outsetandtestonanotherdataset(MPI-INF-3DHP[33]),whichView-InvariantProbabilisticEmbeddingforHumanPose9Table1:Comparisonofcross-viewposeretrievalresultsHit@k(%)onH3.6Mand3DHPwithchest-levelcamerasandallcameras.indicatesthatnormalizationandProcrustesalignmentareperformedonquery-indexpairs.DatasetH3.6M3DHP(Chest)3DHP(All)k1102011020110202Dkeypoints*28:747:150:95:2014:017:29:8021:625:53Dlifting*69:089:792:724:954:462:424:653:261:3L2-VIPE73:594:296:623:856:766:518:746:355:7L2-VIPE(w/aug.)70:491:894:524:955:463:623:753:061:4Pr-VIPE76:295:697:725:459:369:319:949:158:8Pr-VIPE(w/aug.)73:793:996:328:362:371:426:458:667:9asretrivalcon dence.WepresenttheresultsontheVIPEmodelswithandwithoutcameraaugmentation.Weappliedsimilarcameraaugmentationtotheliftingmodel,butdidnotseeimprovementinperformance.Wealsoshowtheresultsofposeretrievalusingaligned2Dkeypointsonly.Thepoorperformanceofusinginput2Dkeypointsforretrievalfromdi erentview

scon rmsthefactthatmodelsmustlearnvi
scon rmsthefactthatmodelsmustlearnviewinvariancefrominputsforthistask.Wealsocomparewiththeimage-basedEpipolarPosemodel[26].Pleaserefertothesupplementarymaterialsfortheexperimentdetailsandresults.4.2.2QuantitativeResultsFromTable1,weseethatPr-VIPE(withaug-mentation)outperformsallthebaselinesforH3.6Mand3DHP.TheH3.6Mre-sultsshownareonthehold-outset,and3DHPisunseenduringtraining,withmorediverseposesandviews.Whenweuseallthecamerasfrom3DHP,weevaluatethegeneralizationabilityofmodelstonewposesandnewviews.Whenweevaluateusingonlythe5chest-levelcamerasfrom3DHP,wheretheviewsaremoresimilartothetrainingsetinH3.6M,wemainlyevaluateforgeneraliza-tiontonewposes.Whenweevaluateusingonlythe5chest-levelcamerasfrom3DHP,theviewsaremoresimilartoH3.6M,andgeneralizationtonewposesbecomesmoreimportant.Ourmodelisrobusttothechoiceof andthenumberofsamplesK(analysisinsupplementarymaterials).Table1showsthatPr-VIPEwithoutcameraaugmentationisabletoper-formbetterthanthebaselinesforH3.6Mand3DHP(chest-levelcameras).ThisshowsthatPr-VIPEisabletogeneralizeaswellasotherbaselinemethodstonewposes.However,for3DHP(allcameras),theperformanceforPr-VIPEwith-outaugmentationisworsecomparedwithchest-levelcameras.Thisobservationindicatesthatwhentrainedonchest-levelcamerasonly,Pr-VIPEdoesnotgen-eralizeaswelltonewviews.Thesameresultscanbeobse

rvedforL2-VIPEbetweenchest-levelandallca
rvedforL2-VIPEbetweenchest-levelandallcameras.Incontrast,the3DliftingmodelsareabletogeneralizebettertonewviewswiththehelpofadditionalProcrustesalign-ment,whichrequiresexpensiveSVDcomputationforeveryindex-querypair.WefurtherapplycameraaugmentationtotrainingthePr-VIPEandtheL2-VIPEmodel.Notethatthisstepdoesnotrequirecameraparametersoradditionalgroundtruth.TheresultsinTable1onPr-VIPEshowthattheaug-View-InvariantProbabilisticEmbeddingforHumanPose11posesandviewsunseenduringtraining,whichhasthenearestneighborslightlyfurtherawayintheembeddingspace.Weseethatthemodelcangeneralizetonewviewsastheimagesaretakenatdi erentcameraelevationsfromH3.6M.Interestingly,therightmostpaironrow2showsthatthemodelcanretrieveposeswithlargedi erencesinrollangle,whichisnotpresentinthetrainingset.Therightmostpaironrow3showsanexampleofalargeNP-MPJPEerrorduetomis-detectionoftheleftlegintheindexpose.WeshowqualitativeresultsusingqueriesfromtheH3.6Mhold-outsettoretrievefrom2DHPinthelasttworowsofFig.3.Theresultsonthesein-the-wildimagesindicatethataslongasthe2Dkeypointdetectorworksreliably,ourmodelisabletoretrieveposesacrossviewsandsubjects.Morequalitativeresultsareprovidedinthesupplementarymaterials.4.3DownstreamTasksWeshowthatourposeembeddingcanbedirectlyappliedtopose-baseddown-streamtasksusingsimplealgorithms.Wecomparethep

erformanceofPr-VIPE(onlytrainedonH3.6M,w
erformanceofPr-VIPE(onlytrainedonH3.6M,withnoadditionaltraining)onthePennAc-tiondatasetagainstotherapproachesspeci callytrainedforeachtaskonthetargetdataset.Inallthefollowingexperimentsinthissection,wecomputeourPr-VIPEembeddingsonsinglevideoframesandusethenegativelogarithmofthematchingprobability(7)asthedistancebetweentwoframes.Thenweapplytemporalaveragingwithinanatrouskernelofsize7andrate3aroundthetwocenterframesandusethisaverageddistanceastheframematchingdistance.Giventhematchingdistance,weusestandarddynamictimewarping(DTW)algorithmtoaligntwoactionsequencesbyminimizingthesumofframematch-ingdistances.Wefurtherusetheaveragedframematchingdistancefromthealignmentasthedistancebetweentwovideosequences.4.3.1ActionRecognitionWeevaluateourembeddingsforactionrecogni-tionusingnearestneighborsearchwiththesequencedistancedescribedabove.Providedpersonboundingboxesineachframe,weestimate2Dposekeypointsusing[41].OnPennAction,weusethestandardtrain/testsplit[36].Usingallthetestingvideosasqueries,weconducttwoexperiments:(1)weusealltrainingvideosasindextoevaluateoverallperformanceandcomparewithstate-of-the-artmethods,and(2)weusetrainingvideosonlyunderoneviewasindexandevaluatethee ectivenessofourembeddingsintermsofview-invariance.Forthissecondexperiment,actionswithzerooronlyonesampleundertheindexviewareigno

red,andaccuracyisaveragedoverdi eren
red,andaccuracyisaveragedoverdi erentviews.FromTable2wecanseethatwithoutanytrainingonthetargetdomainorusingimagecontextinformation,ourembeddingscanachievehighlycompetitiveresultsonpose-basedactionclassi cation,outperformingtheexistingbestbase-linethatonlyusesposeinputandevensomeothermethodsthatrelyonimagecontextoroptical ow.AsshowninthelastrowinTable2,ourembeddingscanbeusedtoclassifyactionsfromdi erentviewsusingindexsamplesfromonlyonesingleviewwithrelativelyhighaccuracy,whichfurtherdemonstratestheadvantagesofourview-invariantembeddings.View-InvariantProbabilisticEmbeddingforHumanPose13(a)(b)(c)(d)Fig.5:Ablationstudy:(a)Topretrievalsby2DNP-MPJPEfromtheH3.6Mhold-outsubsetforquerieswithlargestandsmallestvariance.2Dposesareshownintheboxes.(b)Relationshipbetweenembeddingvarianceand2DNP-MPJPEtotop-10nearest2DposeneighborsfromtheH3.6Mhold-outsubset.Theorangecurverepresentsthebest tting5thdegreepolynomial.(c)ComparisonofHit@1withdi erentembeddingdimensions.The3Dliftingbaselinepredictss39dimensions.(d)Relationshipbetweenretrievalcon denceandmatchingaccuracy.VIPEandPr-VIPEare75:4%and76:2%onH3.6M,and19:7%and20:0%on3DHP,respectively.Whenweaddcameraaugmentation,theHit@1forVIPEandPr-VIPEare73:8%and73:7%onH3.6M,and26:1%and26:5%on3DHP,respectively.Despitethesimilarretrievalaccuracie

s,Pr-VIPEisgenerallymoreaccurateand,more
s,Pr-VIPEisgenerallymoreaccurateand,moreimportantly,hasadditionaldesirablepropertiesinthatthevariancecanmodel2Dinputambiguityastobediscussednext.A2Dposeisambiguousiftherearesimilar2Dposesthatcanbeprojectedfromverydi erentposesin3D.Tomeasurethis,wecomputetheaverage2DNP-MPJPEbetweena2Dposeanditstop-10nearestneighborsintermsof2DNP-MPJPE.Toensurethe3Dposesaredi erent,wesample1200posesfromH3.6Mhold-outsetwithaminimumgapof0:13DNP-MPJPE.Ifa2Dposehassmall2DNP-MPJPEtoitsneighbors,itmeanstherearemanysimilar2Dposescorrespondingtodi erent3Dposesandsothe2Dposeisambiguous.Fig.5ashowsthatthe2Dposewiththelargestvarianceisambiguousasithassimilar2DposesinH3.6Mwithdi erent3Dposes.Incontrast,weseethattheclosest2Dposescorrespondingtothesmallestvarianceposeonthe rstrowofFig.5aareclearlydi erent.Fig.5bfurthershowsthatastheaveragevarianceincreases,the2DNP-MPJPEbetweensimilarposesgenerallydecreases,whichmeansthat2Dposeswithlargervariancesaremoreambiguous.EmbeddingDimensionsFig.5cdemonstratesthee ectofembeddingdi-mensionsonH3.6Mand3DHP.Theliftingmodellifts132Dkeypointsto3D,andthereforehasaconstantoutputdimensionof39.WeseethatPr-VIPE(withaugmentation)isabletoachieveahigheraccuracythanliftingat16dimensions.14J.J.Sunetal.Additionally,wecanincreasethenumberofembeddingdimensionsto32,whichincreases

accuracyofPr-VIPEfrom73:7%to75:5%.Retrie
accuracyofPr-VIPEfrom73:7%to75:5%.RetrievalCon denceInordertovalidatetheretrievalcon dencevalues,werandomlysample100queriesalongwiththeirtop-5retrievals(usingPr-VIPEretrievalcon dence)fromeachquery-indexcamerapair.Thisprocedureforms6000query-retrievalsamplepairsforH3.6M(4views,12camerapairs)and55000for3DHP(11views,110camerapairs),whichwebinbytheirretrievalcon dences.Fig.5dshowsthematchingaccuracyforeachcon dencebin.Wecanseethattheaccuracypositivelycorrelateswiththecon dencevalues,whichsuggestourretrievalcon denceisavalidindicatortomodelperformance.Whatif2Dkeypointdetectorswereperfect?Werepeatourposeretrievalexperimentsusinggroundtruth2Dkeypointstosimulateaperfect2DkeypointdetectoronH3.6Mand3DHP.Allexperimentsusethe4viewsfromH3.6Mfortrainingfollowingthestandardprotocol.Forthebaselineliftingmodelincameraframe,weachieve89:9%Hit@1onH3.6M,48:2%on3DHP(all),and48:8%on3DHP(chest).ForPr-VIPE,weachieve97:5%Hit@1onH3.6M,44:3%on3DHP(all),and66:4%on3DHP(chest).TheseresultsfollowthesametrendasusingdetectedkeypointsinputsinTable1.Comparingtheresultswithusingdetectedkeypoints,thelargeimprovementinperformanceusinggroundtruthkeypointssuggeststhataconsiderablefractionoferrorinourmodelisduetoimperfect2Dkeypointdetections.Pleaserefertothesupplementarymaterialsformoreablationstudiesandembe

ddingspacevisualization.5ConclusionWeint
ddingspacevisualization.5ConclusionWeintroducePr-VIPE,anapproachtolearningprobabilisticview-invariantem-beddingsfrom2Dposekeypoints.Byworkingwith2Dkeypoints,wecanusecameraaugmentationtoimprovemodelgeneralizationtounseenviews.Wealsodemonstratethatourprobabilisticembeddinglearnstocaptureinputambigu-ity.Pr-VIPEhasasimplearchitectureandcanbepotentiallyappliedtoobjectandhandposes.Forcross-viewposeretrieval,3Dposeestimationmodelsre-quireexpensiverigidalignmentbetweenquery-indexpair,whileourembeddingscanbeappliedtocomparesimilaritiesinsimpleEuclideanspace.Inaddition,wedemonstratedthee ectivenessofourembeddingsondownstreamtasksforactionrecognitionandvideoalignment.Ourembeddingfocusesonasingleper-son,andforfuturework,wewillinvestigateextendingittomultiplepeopleandrobustmodelsthatcanhandlemissingkeypointsfrominput.AcknowledgmentWethankYuxiaoWang,DebidattaDwibedi,andLiangzheYuanfromGoogleResearch,LongZhaofromRutgersUniversity,andXiaoZhangfromUniversityofChicagoforhelpfuldiscussions.WeappreciatethesupportofPietroPerona,YisongYue,andtheComputationalVisionLabatCaltechformakingthiscol-laborationpossible.TheauthorJenniferJ.SunissupportedbyNSERC(fundingnumberPGSD3-532647-2019)andCaltech.16J.J.Sunetal.24.Kendall,A.,Gal,Y.:Whatuncertaintiesdoweneedinbayesiandeeplearningforcomputervision?In:NeurIPS(2017)25.Kin

gma,D.P.,Welling,M.:Auto-encodingvariati
gma,D.P.,Welling,M.:Auto-encodingvariationalbayes(2014)26.Kocabas,M.,Karagoz,S.,Akbas,E.:Self-supervisedlearningof3Dhumanposeusingmulti-viewgeometry(2019)27.LeCun,Y.,Huang,F.J.,Bottou,L.,etal.:Learningmethodsforgenericobjectrecognitionwithinvariancetoposeandlighting.In:CVPR(2004)28.Li,J.,Wong,Y.,Zhao,Q.,Kankanhalli,M.:Unsupervisedlearningofview-invariantactionrepresentations.In:NeurIPS(2018)29.Liu,J.,Akhtar,N.,Ajmal,M.:ViewpointinvariantactionrecognitionusingRGB-Dvideos.IEEEAccess6,70061{70071(2018)30.Liu,M.,Yuan,J.:Recognizinghumanactionsastheevolutionofposeestimationmaps.In:CVPR(2018)31.Luvizon,D.C.,Tabia,H.,Picard,D.:Multi-taskdeeplearningforreal-time3Dhumanposeestimationandactionrecognition.arXiv:1912.08077(2019)32.Martinez,J.,Hossain,R.,Romero,J.,Little,J.J.:Asimpleyete ectivebaselinefor3Dhumanposeestimation.In:ICCV(2017)33.Mehta,D.,Rhodin,H.,Casas,D.,Fua,P.,Sotnychenko,O.,Xu,W.,Theobalt,C.:Monocular3DhumanposeestimationinthewildusingimprovedCNNsupervision.In:3DV(2017)34.Misra,I.,Zitnick,C.L.,Hebert,M.:Shueandlearn:Unsupervisedlearningusingtemporalorderveri cation.In:ECCV(2016)35.Mori,G.,Pantofaru,C.,Kothari,N.,Leung,T.,Toderici,G.,Toshev,A.,Yang,W.:Poseembeddings:Adeeparchitectureforlearningtomatchhumanposes.arXiv:1507.00302(2015)36.Nie,B.X.,Xiong,C.,Zhu,S.C.:Jointactionrecogn

itionandposeestimationfromvideo.In:CVPR(
itionandposeestimationfromvideo.In:CVPR(2015)37.Oh,S.J.,Murphy,K.,Pan,J.,Roth,J.,Schro ,F.,Gallagher,A.:Modelingun-certaintywithhedgedinstanceembedding.ICLR(2019)38.OhSong,H.,Xiang,Y.,Jegelka,S.,Savarese,S.:Deepmetriclearningvialiftedstructuredfeatureembedding.In:CVPR(2016)39.Ong,E.J.,Micilotta,A.S.,Bowden,R.,Hilton,A.:Viewpointinvariantexemplar-based3Dhumantracking.CVIU(2006)40.Papandreou,G.,Zhu,T.,Chen,L.C.,Gidaris,S.,Tompson,J.,Murphy,K.:Per-sonLab:Personposeestimationandinstancesegmentationwithabottom-up,part-based,geometricembeddingmodel.In:ECCV(2018)41.Papandreou,G.,Zhu,T.,Kanazawa,N.,Toshev,A.,Tompson,J.,Bregler,C.,Murphy,K.:Towardsaccuratemulti-personposeestimationinthewild.In:CVPR(2017)42.Parkhi,O.M.,Vedaldi,A.,Zisserman,A.,etal.:Deepfacerecognition.In:BMVC(2015)43.Pavllo,D.,Feichtenhofer,C.,Grangier,D.,Auli,M.:3Dhumanposeestimationinvideowithtemporalconvolutionsandsemi-supervisedtraining.In:CVPR(2019)44.Qiu,H.,Wang,C.,Wang,J.,Wang,N.,Zeng,W.:CrossViewFusionfor3DHumanPoseEstimation.arXiv:1909.01203(2019)45.Rao,C.,Shah,M.:View-invarianceinactionrecognition.In:CVPR(2001)46.RayatImtiazHossain,M.,Little,J.J.:Exploitingtemporalinformationfor3Dhumanposeestimation.In:ECCV(2018)47.Rhodin,H.,Constantin,V.,Katircioglu,I.,Salzmann,M.,Fua,P.:Neuralscenedecompositionformulti-personmotioncapture.