spatiotemporallocationsandscalesinvideobymaximizingspecicsaliencyfunctionsThedetectorsusuallydifferinthetypeandthesparsityofselectedpointsFeaturedescriptorscaptureshapeandmotionintheneighborhoodso ID: 191811
Download Pdf The PPT/PDF document "2WANG,ULLAH,KL" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
2WANG,ULLAH,KLÄSER,LAPTEV,SCHMID:FEATURESFORACTIONRECOGNITION spatio-temporallocationsandscalesinvideobymaximizingspecicsaliencyfunctions.Thedetectorsusuallydifferinthetypeandthesparsityofselectedpoints.Featuredescriptorscaptureshapeandmotionintheneighborhoodsofselectedpointsusingimagemeasurementssuchasspatialorspatio-temporalimagegradientsandopticalow.Whilespecicpropertiesofdetectorsanddescriptorshavebeenadvocatedintheliter-ature,theirjusticationisofteninsufcientduetothelimitedandnon-comparableexper-imentalevaluationsusedincurrentpapers.Forexample,resultsarefrequentlypresentedfordifferentdatasetssuchastheKTHdataset[ 6 , 10 , 12 , 16 , 24 , 26 , 27 ],theWeizmanndataset[ 3 , 25 ]ortheaerobicactionsdataset[ 22 ].ForthecommonKTHdataset[ 24 ],resultsareoftennon-comparableduetothedifferentexperimentalsettingsused.Furthermore,mostofthepreviousevaluationswerereportedforactionsincontrolledenvironmentssuchasinKTHandWeizmanndatasets.Itisthereforeunclearhowthesemethodsgeneralizetoactionrecognitioninrealisticsetups[ 16 , 23 ].Severalevaluationsoflocalspace-timefeatureshavebeenreportedinthepast.Laptev[ 13 ]evaluatedtherepeatabilityofspace-timeinterestpointsaswellastheassociatedaccuracyofactionrecognitionunderchangesinspatialandtemporalvideoresolutionaswellasun-dercameramotion.Similarly,Willemsetal.[ 26 ]evaluatedrepeatabilityofdetectedfeaturesunderscalechanges,in-planerotations,videocompressionandcameramotion.Localspace-timedescriptorswereevaluatedinLaptevetal.[ 15 ],wherethecomparisonincludedfami-liesofhigher-orderderivatives(localjets),imagegradientsandopticalow.Dolláretal.[ 6 ]comparedlocaldescriptorsintermsofimagebrightness,gradientandopticalow.Scovan-neretal.[ 25 ]evaluatedthe3D-SIFTdescriptoranditstwo-dimensionalvariants.Jhuangetal.[ 10 ]evaluatedlocaldescriptorsintermsofthemagnitudeandorientationofspace-timegradientsaswellasopticalow.Kläseretal.[ 12 ]comparedspace-timeHOGdescriptorwithHOGandHOFdescriptors[ 16 ].Willemsetal.[ 26 ]evaluatedtheextendedSURFde-scriptor.However,evaluationsintheseworkswereusuallylimitedtoasingledetectionordescriptionmethodaswellastoasingledataset.Thecurrentpaperovercomesabove-mentionedlimitationsandprovidesafaircompari-sonforanumberoflocalspace-timedetectorsanddescriptors.Weevaluateperformanceofthreespace-timeinterestpointdetectorsandsixdescriptorsalongwiththeircombinationsonthreedatasetswithvaryingdegreeofdifculty.Moreover,wecomparewithdensefeaturesobtainedbyregularsamplingoflocalspace-timepatches,asrecentlyexcellentresultswereobtainedbydensesamplinginthecontextofobjectrecognition[ 7 , 11 ].We,furthermore,investigatetheinuenceofspatialvideoresolutionandshotboundariesontheperformance.Wealsocomparemethodsintermsoftheirsparsityaswellasthespeedofavailableim-plementations.Allexperimentsarereportedforthesamebag-of-featuresSVMrecognitionframework.Amonginterestingconclusions,wedemonstratethatregularsamplingconsis-tentlyoutperformsalltestedspace-timedetectorsforhumanactionsinrealisticsetups.Wealsodemonstrateaconsistentrankingforthemajorityofmethodsacrossdatasets.Therestofthepaperisorganizedasfollows.InSection 2 ,wegiveadetailedpresentationofthelocalspatio-temporalfeaturesincludedinourcomparison.Section 3 thenpresentstheexperimentalsetup,i.e.,thedatasetsandthebag-of-featuresapproachusedtoevaluatetheresults.Finally,Section 4 comparesresultsobtainedfordifferentfeaturesandSection 5 concludesthepaperwiththediscussion. WANG,ULLAH,KLÄSER,LAPTEV,SCHMID:FEATURESFORACTIONRECOGNITION3 2Localspatio-temporalvideofeatures Thissectiondescribeslocalfeaturedetectorsanddescriptorsusedinthefollowingevalua-tion.Methodswereselectedbasedontheiruseintheliteratureaswellastheavailabilityoftheimplementation.Inallcasesweusetheoriginalimplementationandparametersettingsprovidedbytheauthors. 2.1Detectors TheHarris3DdetectorwasproposedbyLaptevandLindebergin[ 14 ],asaspace-timeex-tensionoftheHarrisdetector[ 9 ].Theauthorscomputeaspatio-temporalsecond-momentmatrixateachvideopointm(;s;t)=g(;ss;st)(ÑL(;s;t)(ÑL(;s;t))T)usingin-dependentspatialandtemporalscalevaluess;t,aseparableGaussiansmoothingfunctiong,andspace-timegradientsÑL.Thenallocationsofspace-timeinterestpointsaregivenbylocalmaximaofH=det(m)ktrace3(m);H0.Theauthorsproposedanoptionalmechanismforspatio-temporalscaleselection.Thisisnotusedinourexperiments,butweusepointsextractedatmultiplescalesbasedonaregularsamplingofthescaleparameterss;t.Thishasshowntogivepromisingresultsin[ 16 ].Weusetheoriginalimplementa-tionavailableon-line 1 andstandardparametersettingsk=0:0005,s2=4;8;16;32;64;128,t2=2;4.TheCuboiddetectorisbasedontemporalGaborltersandwasproposedbyDolláretal.[ 6 ].Theresponsefunctionhastheform:R=(Ighev)2+(Ighod)2,whereg(x;y;s)isthe2DspatialGaussiansmoothingkernel,andhevandhodareaquadraturepairof1DGaborlterswhichareappliedtemporally.TheGaborltersaredenedbyhev(t;t;w)=cos(2ptw)et2=t2andhod(t;t;w)=sin(2ptw)et2=t2withw=4=t.ThetwoparameterssandtoftheresponsefunctionRcorrespondroughlytothespatialandtemporalscaleofthedetector.InterestpointsarethelocalmaximaoftheresponsefunctionR.Weusethecodefromtheauthors'website 2 anddetectfeaturesusingstandardscalevaluess=2;t=4.TheHessiandetectorwasproposedbyWillemsetal.[ 26 ]asaspatio-temporalextensionoftheHessiansaliencymeasureusedin[ 2 , 18 ]forblobdetectioninimages.Thedetectormeasuresthesaliencywiththedeterminantofthe3DHessianmatrix.Thepositionandscaleoftheinterestpointsaresimultaneouslylocalizedwithoutanyiterativeprocedure.Inordertospeedupthedetector,theauthorsusedapproximativebox-lteroperationsonanintegralvideostructure.Eachoctaveisdividedinto5scales,witharatiobetweensubsequentscalesintherange1:21:5fortheinner3scales.ThedeterminantoftheHessianiscomputedoverseveraloctavesofboththespatialandtemporalscales.Anon-maximumsuppressionalgo-rithmselectsjointextremaoverspace,timeandscales:(x;y;t;s;t).Weusetheexecutablesfromtheauthors'website 3 andemploythedefaultparametersetting.Densesamplingextractsvideoblocksatregularpositionsandscalesinspaceandtime.Thereare5dimensionstosamplefrom:(x,y,t,s,t),wheresandtarethespatialandtemporalscale,respectively.Inourexperiments,theminimumsizeofa3Dpatchis1818pixelsand10frames.(InSection 4.4 ,weevaluatedifferentspatialpatchsizesfordensesampling.)Spatialandtemporalsamplingaredonewith50%overlap.Multi-scalepatches 1 http://www.irisa.fr/vista/Equipe/People/Laptev/download.html#stip 2 http://vision.ucsd.edu/~pdollar/toolbox/doc/index.html 3 http://homes.psat.kuleuven.be/~gwillems/research/Hes-STIP/ WANG,ULLAH,KLÄSER,LAPTEV,SCHMID:FEATURESFORACTIONRECOGNITION5 WalkingJoggingRunningBoxingWavingClapping DivingKickingWalkingSkateboardingHigh-Bar-Swinging AnswerPhoneGetOutCarHandShakeHugPersonKiss Figure1:SampleframesfromvideosequencesofKTH(top),UCFSports(middle),andHollywood2(bottom)humanactiondatasets. sumsv=(ådx;ådy;ådt)ofuniformlysampledresponsesoftheHaar-waveletsdx,dy,dtalongthethreeaxes.Weusetheexecutablesfromtheauthors'website 3 withthedefaultparametersdenedintheexecutable. 3Experimentalsetup Inthissectionwedescribethedatasetsusedfortheevaluationaswellastheevaluationprotocol.Weevaluatethefeaturesinabag-of-featuresbasedactionclassicationtaskandemploytheevaluationmeasuresproposedbytheauthorsofthedatasets. 3.1Datasets Wecarryoutourexperimentsonthreedifferentactiondatasetswhichweobtainedfromtheauthors'websites.TheKTHactionsdataset[ 24 ] 5 consistsofsixhumanactionclasses:walking,jogging,running,boxing,waving,andclapping(cf.Figure 1 ,top).Eachactionclassisperformedseveraltimesby25subjects.Thesequenceswererecordedinfourdif-ferentscenarios:outdoors,outdoorswithscalevariation,outdoorswithdifferentclothesandindoors.Thebackgroundishomogeneousandstaticinmostsequences.Intotal,thedataconsistsof2391videosamples.Wefollowtheoriginalexperimentalsetupoftheauthors,i.e.,dividethesamplesintotestset(9subjects:2,3,5,6,7,8,9,10,and22)andtrainingset(theremaining16subjects).Asintheinitialpaper[ 24 ],wetrainandevaluateamulti-classclassierandreportaverageaccuracyoverallclassesasperformancemeasure.TheUCFsportactionsdataset[ 23 ] 6 containstendifferenttypesofhumanactions:swinging(onthepommelhorseandontheoor),diving,kicking(aball),weight-lifting,horse-riding,running,skateboarding,swinging(atthehighbar),golfswingingandwalking(cf.Figure 1 ,middle).Thedatasetconsistsof150videosampleswhichshowalargeintra-classvariability.Toincreasetheamountofdatasamples,weextendthedatasetbyadding 5Availableat http://www.nada.kth.se/cvap/actions/ 6Availableat http://www.cs.ucf.edu/vision/public_html/ WANG,ULLAH,KLÄSER,LAPTEV,SCHMID:FEATURESFORACTIONRECOGNITION11 [15] I.LaptevandT.Lindeberg.Localdescriptorsforspatio-temporalrecognition.InFirstInternationalWorkshoponSpatialCoherenceforVisualMotionAnalysis,LNCS.Springer,2004. [16] I.Laptev,M.Marszaek,C.Schmid,andB.Rozenfeld.Learningrealistichumanac-tionsfrommovies.InCVPR,2008. [17] S.Lazebnik,C.Schmid,andJ.Ponce.Beyondbagsoffeatures:Spatialpyramidmatchingforrecognizingnaturalscenecategories.InCVPR,2006. [18] T.Lindeberg.Featuredetectionwithautomaticscaleselection.IJCV,30(2):79116,1998. [19] J.LiuandM.Shah.Learninghumanactionsviainformationmaximization.InCVPR,2008. [20] D.Lowe.Distinctiveimagefeaturesfromscale-invariantkeypoints.IJCV,60(2):91110,2004. [21] M.Marszaek,I.Laptev,andC.Schmid.Actionsincontext.InCVPR,2009. [22] A.Oikonomopoulos,I.Patras,andM.Pantic.Spatio-temporalsalientpointsforvisualrecognitionofhumanactions.IEEETrans.Systems,Man,andCybernetics,PartB,36(3):710719,2006. [23] M.Rodriguez,J.Ahmed,andM.Shah.Actionmach:Aspatio-temporalmaximumaveragecorrelationheightlterforactionrecognition.InCVPR,2008. [24] C.Schüldt,I.Laptev,andB.Caputo.Recognizinghumanactions:AlocalSVMap-proach.InICPR,2004. [25] P.Scovanner,S.Ali,andM.Shah.A3-dimensionalSIFTdescriptoranditsapplicationtoactionrecognition.InACMInternationalConferenceonMultimedia,2007. [26] G.Willems,T.Tuytelaars,andL.VanGool.Anefcientdenseandscale-invariantspatio-temporalinterestpointdetector.InECCV,2008. [27] S.F.WongandR.Cipolla.Extractingspatio-temporalinterestpointsusingglobalin-formation.InICCV,2007. [28] J.Zhang,M.Marszaek,S.Lazebnik,andC.Schmid.Localfeaturesandkernelsforclassicationoftextureandobjectcategories:Acomprehensivestudy.IJCV,73(2):213238,2007.