/
Learning realistic human actions from movies Ivan Laptev Marcin Marszaek Cordelia Schmid Learning realistic human actions from movies Ivan Laptev Marcin Marszaek Cordelia Schmid

Learning realistic human actions from movies Ivan Laptev Marcin Marszaek Cordelia Schmid - PDF document

ellena-manuel
ellena-manuel . @ellena-manuel
Follow
560 views
Uploaded On 2014-12-12

Learning realistic human actions from movies Ivan Laptev Marcin Marszaek Cordelia Schmid - PPT Presentation

laptevinriafr marcinmarszalekinriafr cordeliaschmidinriafr grurgrurgmailcom Abstract The aim of this paper is to address recognition of natural human actions in diverse and realistic video settings This challenging but important subject has mostly be ID: 23055

laptevinriafr marcinmarszalekinriafr cordeliaschmidinriafr grurgrurgmailcom Abstract

Share:

Link:

Embed:


Presentation Transcript

[1:13:41-1:13:45]Ablackcarpullsup.Twoarmyofcersgetout.Figure3.Evaluationofscript-basedactionannotation.Left:Preci-sionofactionannotationevaluatedonvisualgroundtruth.Right:Exampleofavisualfalsepositivefor“getoutofacar”. scriptsandmovies.Toinvestigatethisissue,wemanuallyannotatedseveralhundredsofactionsin12moviescriptsandveriedtheseonthevisualgroundtruth.From147ac-tionswithcorrecttextalignment(a=1)only70%didmatchwiththevideo.Therestofsampleseitherweremisalignedintime(10%),wereoutsidetheeldofview(10%)orwerecompletelymissinginthevideo(10%).Misalignmentofsubtitles(a1)furtherdecreasesthevisualprecisionasillustratedingure 3 (left).Figure 3 (right)showsatypicalexampleofa“visualfalsepositive”fortheaction“getoutofacar”occurringoutsidetheeldofviewofthecamera.2.2.TextretrievalofhumanactionsExpressionsforhumanactionsintextmayhaveacon-siderablewithin-classvariability.Thefollowingexamplesillustratevariationsinexpressionsforthe“GetOutCar”ac-tion:“WillgetsoutoftheChevrolet.”,“Ablackcarpullsup.Twoarmyofcersgetout.”,“Erinexitshernewtruck.”.Furthermore,falsepositivesmightbedifculttodistin-guishfrompositives,seeexamplesforthe“SitDown”ac-tion:“Abouttositdown,hefreezes.”,“Smiling,heturnstositdown.ButthesmilediesonhisfacewhenhendshisplaceoccupiedbyEllie.”.Text-basedactionretrieval,hence,isanon-trivialtaskthatmightbedifculttosolvebyasimplekeywordsearchsuchascommonlyusedforre-trievingimagesofobjects,e.g.in[ 14 ].Tocopewiththevariabilityoftextdescribinghumanac-tions,weadoptamachinelearningbasedtextclassicationapproach[ 16 ].Aclassierlabelseachscenedescriptioninscriptsascontainingthetargetactionornot.Theimple-mentedapproachreliesonthebag-of-featuresmodel,whereeachscenedescriptionisrepresentedasasparsevectorinahigh-dimensionalfeaturespace.Asfeaturesweusewords,adjacentpairsofwords,andnon-adjacentpairsofwordsoc-curringwithinasmallwindowofNwordswhereNvariesbetween2and8.Featuressupportedbylessthanthreetrainingdocumentsareremoved.Fortheclassicationweusearegularizedperceptron[ 21 ],whichisequivalenttoasupportvectormachine.Theclassieristrainedonaman-uallylabeledsetofscenedescriptions,andtheparameters(regularizationconstant,windowsizeN,andtheacceptance Figure4.Resultsofretrievingeightclassesofhumanactionsfromscriptsusingregularizedperceptronclassier(left)andregularex-pressionmatching(right). threshold)aretunedusingavalidationset.Weevaluatetext-basedactionretrievalonoureightclassesofmovieactionsthatweusethroughoutthispa-per:AnswerPhone,GetOutCar,HandShake,HugPerson,Kiss,SitDown,SitUp,StandUp.Thetexttestsetcontains397actionsamplesandover17Knon-actionsamplesfrom12manu-allyannotatedmoviescripts.Thetexttrainingsetwassam-pledfromalargesetofscriptsdifferentfromthetestset.Wecompareresultsobtainedbytheregularizedperceptronclassierandbymatchingregularexpressionswhichweremanuallytunedtoexpressionsofhumanactionsintext.Theresultsingure 4 veryclearlyconrmthebenetsofthetextclassier.Theaverageprecision-recallvaluesforallactionsare[prec.0.95/rec.0.91]forthetextclassierver-sus[prec.0.55/rec.0.88]forregularexpressionmatching.2.3.VideodatasetsforhumanactionsWeconstructtwovideotrainingsets,amanualandanautomaticone,aswellasavideotestset.Theycontainvideoclipsforoureightclassesofmovieactions(seetoprowofgure 10 forillustration).Inallcaseswerstapplyautomaticscriptalignmentasdescribedinsection 2.1 .Fortheclean,manualdatasetaswellasthetestsetwemanu-allyselectvisuallycorrectsamplesfromthesetofmanu-allytext-annotatedactionsinscripts.Theautomaticdatasetcontainstrainingsamplesthathavebeenretrievedautomat-icallyfromscriptsbythetextclassierdescribedinsec-tion 2.2 .Welimittheautomatictrainingsettoactionswithanalignmentscorea�0:5andavideolengthoflessthan1000frames.Ourmanualandautomatictrainingsetscon-tainactionvideosequencesfrom12movies 2 andthetestsetactionsfrom20differentmovies 3 .Ourdatasets,i.e.,the 2“AmericanBeauty”,“BeingJohnMalkovich”,“BigFish”,“Casablanca”,“TheCryingGame”,“DoubleIndemnity”,“ForrestGump”,“TheGodfather”,“IAmSam”,“IndependenceDay”,“PulpFiction”and“RaisingArizona”.3“AsGoodAsItGets”,“BigLebowski”,“BringingOutTheDead”,“TheButteryEffect”,“DeadPoetsSociety”,“ErinBrockovich”,“Fargo”,“Gandhi”,“TheGraduate”,“IndianaJonesAndTheLastCrusade”,“ItsAWonderfulLife”,“Kids”,“LACondential”,“TheLordoftheRings:FellowshipoftheRing”,“LostHighway”,“TheLostWeekend”,“MissionToMars”,“NakedCity”,“ThePianist”and“ReservoirDogs”. Table1.Thenumberofactionlabelsinautomatictrainingset(top),clean/manualtrainingset(middle)andtestset(bottom). videoclipsandthecorrespondingannotations,areavailableathttp://www.irisa.fr/vista/actions.Theobjectiveofhavingtwotrainingsetsistoevalu-aterecognitionofactionsbothinasupervisedsettingandwithautomaticallygeneratedtrainingsamples.Notethatnomanualannotationisperformedneitherforscriptsnorforvideosusedintheautomatictrainingset.Thedistri-butionofactionlabelsforthedifferentsubsetsandactionclassesisgivenintable 1 .Wecanobservethatthenum-berofcorrectlylabeledvideosintheautomaticsetis60%.Mostofthewronglabelsresultfromthescript-videomis-alignmentandafewadditionalerrorscomefromthetextclassier.Theproblemofclassicationinthepresenceofwrongtraininglabelswillbeaddressedinsection 4.3 .3.VideoclassicationforactionrecognitionThissectionpresentsourapproachforactionclassica-tion.Itbuildsonexistingbag-of-featuresapproachesforvideodescription[ 3 , 13 , 15 ]andextendsrecentadvancesinstaticimageclassicationtovideos[ 2 , 9 , 12 ].Lazebniketal.[ 9 ]showedthataspatialpyramid,i.e.,acoarsedescrip-tionofthespatiallayoutofthescene,improvesrecognition.Successfulextensionsofthisideaincludetheoptimizationofweightsfortheindividualpyramidlevels[ 2 ]andtheuseofmoregeneralspatialgrids[ 12 ].Herewebuildontheseideasandgoastepfurtherbybuildingspace-timegrids.Thedetailsofourapproacharedescribedinthefollowing.3.1.Space­timefeaturesSparsespace-timefeatureshaverecentlyshowngoodperformanceforactionrecognition[ 3 , 6 , 13 , 15 ].Theypro-videacompactvideorepresentationandtolerancetoback-groundclutter,occlusionsandscalechanges.Herewefol-low[ 7 ]anddetectinterestpointsusingaspace-timeexten-sionoftheHarrisoperator.However,insteadofperformingscaleselectionasin[ 7 ],weuseamulti-scaleapproachandextractfeaturesatmultiplelevelsofspatio-temporalscales(2i;2j)withi=2(1+i)=2;i=1;:::;6andj=2j=2;j=1;2.Thischoiceismotivatedbythereducedcomputational Figure5.Space-timeinterestpointsdetectedfortwovideoframeswithhumanactionshandshake(left)andgetoutcar(right). complexity,theindependencefromscaleselectionartifactsandtherecentevidenceofgoodrecognitionperformanceusingdensescalesampling.Wealsoeliminatedetectionsduetoartifactsatshotboundaries[ 11 ].Interestpointsde-tectedfortwoframeswithhumanactionsareillustratedingure 5 .Tocharacterizemotionandappearanceoflocalfeatures,wecomputehistogramdescriptorsofspace-timevolumesintheneighborhoodofdetectedpoints.Thesizeofeachvolume(x;y;t)isrelatedtothedetectionscalesbyx;y=2k,t=2k.Eachvolumeissubdividedintoa(nx;ny;nt)gridofcuboids;foreachcuboidwecomputecoarsehistogramsoforientedgradient(HoG)andopticow(HoF).NormalizedhistogramsareconcatenatedintoHoGandHoFdescriptorvectorsandaresimilarinspirittothewellknownSIFTdescriptor.Weuseparametervaluesk=9,nx;ny=3,nt=2.3.2.Spatio­temporalbag­of­featuresGivenasetofspatio-temporalfeatures,webuildaspatio-temporalbag-of-features(BoF).Thisrequirestheconstructionofavisualvocabulary.Inourexperimentsweclusterasubsetof100kfeaturessampledfromthetrainingvideoswiththek-meansalgorithm.Thenumberofclustersissettok=4000,whichhasshownempiricallytogivegoodresultsandisconsistentwiththevaluesusedforstaticimageclassication.TheBoFrepresentationthenassignseachfeaturetotheclosest(weuseEuclideandistance)vo-cabularywordandcomputesthehistogramofvisualwordoccurrencesoveraspace-timevolumecorrespondingeithertotheentirevideosequenceorsubsequencesdenedbyaspatio-temporalgrid.Ifthereareseveralsubsequencesthedifferenthistogramsareconcatenatedintoonevectorandthennormalized.Inthespatialdimensionsweusea1x1grid—correspondingtothestandardBoFrepresentation—,a2x2grid—showntogiveexcellentresultsin[ 9 ]—,ahorizontalh3x1grid[ 12 ]aswellasaverticalv1x3one.Moreover,weimplementedadenser3x3gridandacenter-focusedo2x2gridwhereneighboringcellsoverlapby50%oftheirwidthandheight.Forthetemporaldimensionwesubdividethevideosequenceinto1to3non-overlappingtemporalbins, Task HoGBoF HoFBoF Bestchannel Bestcombination KTHmulti-class 81:6% 89:7% 91.1%(hofh3x1t3) 91.8%(hof1t2,hog1t3) ActionAnswerPhone 13:4% 24:6% 26.7%(hofh3x1t3) 32.1%(hofo2x2t1,hofh3x1t3) ActionGetOutCar 21:9% 14:9% 22.5%(hofo2x21) 41.5%(hofo2x2t1,hogh3x1t1) ActionHandShake 18:6% 12:1% 23.7%(hogh3x11) 32.3%(hogh3x1t1,hogo2x2t3) ActionHugPerson 29:1% 17:4% 34.9%(hogh3x1t2) 40.6%(hog1t2,hogo2x2t2,hogh3x1t2) ActionKiss 52:0% 36:5% 52.0%(hog11) 53.3%(hog1t1,hof1t1,hofo2x2t1) ActionSitDown 29:1% 20:7% 37.8%(hog1t2) 38.6%(hog1t2,hog1t3) ActionSitUp 6:5% 5:7% 15.2%(hogh3x1t2) 18.2%(hogo2x2t1,hogo2x2t2,hogh3x1t2) ActionStandUp 45:4% 40:0% 45.4%(hog11) 50.5%(hog1t1,hof1t2) Table2.Classicationperformanceofdifferentchannelsandtheircombinations.FortheKTHdatasettheaverageclassaccuracyisreported,whereasforourmanuallycleanedmoviedatasettheper-classaverageprecision(AP)isgiven. channelsfurtherimprovestheaccuracy.Interestingly,HoGsperformbetterthanHoFsforallreal-worldactionsexceptforansweringthephone.TheinverseholdsforKTHactions.Thisshowsthatthecontextandtheimagecontentplayalargeroleinrealisticsettings,whilesimpleactionscanbeverywellcharacterizedbytheirmo-tiononly.Furthermore,HoGfeaturesalsocapturemotioninformationuptosomeextentthroughtheirlocaltemporalbinning.Inmoredetail,theoptimizedcombinationsforsittingdownandstandingupdonotmakeuseofspatialgrids,whichcanbeexplainedbythefactthattheseactionscanoccuranywhereinthescene.Ontheotherhand,temporalbinningdoesnothelpinthecaseofkissing,forwhichahighvariabilitywithrespecttothetemporalextentcanbeob-served.Forgettingoutofacar,handshakingandhuggingacombinationofah3x1andao2x2spatialgridissuccessful.Thiscouldbeduetothefactthatthoseactionsareusuallypicturedeitherinawidesetting(whereascene-alignedgridshouldwork)orasacloseup(whereauniformgridshouldperformwell).Theoptimizedcombinationsdeterminedinthissection,cf.table 2 ,areusedintheremainderoftheexperimentalsection.4.2.Comparisontothestate­of­the­artWecompareourworktothestate-of-the-artontheKTHactionsdataset[ 15 ],seegure 8 .Itcontainssixtypesofhumanactions,namelywalking,jogging,running,boxing,handwavingandhandclapping,performedseveraltimesby25subjects.Thesequencesweretakenforfourdiffer-entscenarios:outdoors,outdoorswithscalevariation,out-doorswithdifferentclothesandindoors.Notethatinallcasesthebackgroundishomogeneous.Thedatasetcon- Method Schuldt Niebles Wong ours etal.[ 15 ] etal.[ 13 ] etal.[ 18 ] Accuracy 71.7% 81.5% 86.7% 91.8% Table3.AverageclassaccuracyontheKTHactionsdataset. WalkingJoggingRunningBoxingWavingClapping Figure8.SampleframesfromtheKTHactionssequences.Allsixclasses(columns)andscenarios(rows)arepresented. tainsatotalof2391sequences.WefollowtheexperimentalsetupofSchuldtetal.[ 15 ]withsequencesdividedintothetraining/validationset(8+8people)andthetestset(9peo-ple).Thebestperformingchannelcombination,reportedintheprevioussection,wasdeterminedby10-foldcross-validationonthecombinedtraining+validationset.Resultsarereportedforthiscombinationonthetestset.Table 3 comparestheaverageclassaccuracyofourmethodwithresultsreportedbyotherresearchers.Com-paredtotheexistingapproaches,ourmethodshowssignif-icantlybetterperformance,outperformingthestate-of-the-artinthesamesetup.Theconfusionmatrixforourmethodisgivenintable 4 .Interestingly,themajorconfusionoccursbetweenjoggingandrunning. Walking Jogging Running Boxing Waving Clapping Walking .99 .01 .00 .00 .00 .00Jogging .04 .89 .07 .00 .00 .00Running .01 .19 .80 .00 .00 .00Boxing .00 .00 .00 .97 .00 .03Waving .00 .00 .00 .00 .91 .09Clapping .00 .00 .00 .05 .00 .95Table4.ConfusionmatrixfortheKTHactions. NotethatresultsobtainedbyJhuangetal.[ 6 ]andWongetal.[ 19 ]arenotcomparabletoours,astheyarebasedonnon-standardexperimentalsetups:theyeitherusemoretrainingdataortheproblemisdecomposedintosimplertasks.4.3.RobustnesstonoiseinthetrainingdataTrainingwithautomaticallyretrievedsamplesavoidsthehighcostofmanualdataannotation.Yet,thisgoesinhandwiththeproblemofwronglabelsinthetrainingset.Inthissectionweevaluatetherobustnessofouractionclassica-tionapproachtolabelingerrorsinthetrainingset.Figure 9 showstherecognitionaccuracyasafunctionoftheprobabilitypofalabelbeingwrong.Trainingforp=0isperformedwiththeoriginallabels,whereaswithp=1alltraininglabelsarewrong.Theexperimentalresultsareob-tainedfortheKTHdatasetandthesamesetupasdescribedinsubsection 4.2 .Differentwronglabelingsaregeneratedandevaluated20timesforeachp;theaverageaccuracyanditsvariancearereported.Theexperimentshowsthattheperformanceofourmethoddegradesgracefullyinthepresenceoflabelinger-rors.Uptop=0:2theperformancedecreasesinsigni-cantly,i.e.,bylessthantwopercent.Atp=0:4theperfor-mancedecreasesbyaround10%.Wecan,therefore,predictaverygoodperformancefortheproposedautomatictrain-ingscenario,wheretheobservedamountofwronglabelsisaround40%.Notethatwehaveobservedacomparablelevelofresis-tancetolabelingerrorswhenevaluatinganimageclassi-cationmethodonthenatural-sceneimagesofthePASCALVOC'07challengedataset. Figure9.Performanceofourvideoclassicationapproachinthepresenceofwronglabels.ResultsarereportfortheKTHdataset. 4.4.Actionrecognitioninreal­worldvideosInthissectionwereportactionclassicationresultsforreal-wordvideos,i.e.,forourtestsetwith217videos.Trainingisperformedwithaclean,manualdatasetaswellasanautomaticallyannotatedone,seesection 2.3 forde- Clean Automatic Chance AnswerPhone 32:1% 16:4% 10:6% GetOutCar 41:5% 16:4% 6:0% HandShake 32:3% 9:9% 8:8% HugPerson 40:6% 26:8% 10:1% Kiss 53:3% 45:1% 23:5% SitDown 38:6% 24:8% 13:8% SitUp 18:2% 10:4% 4:6% StandUp 50:5% 33:6% 22:6% Table5.Averageprecision(AP)foreachactionclassofourtestset.Wecompareresultsforclean(annotated)andautomatictrain-ingdata.Wealsoshowresultsforarandomclassier(chance). tails.Wetrainaclassierforeachactionasbeingpresentornotfollowingtheevaluationprocedureof[ 5 ].Theperfor-manceisevaluatedwiththeaverageprecision(AP)oftheprecision/recallcurve.Weusetheoptimizedcombinationofspatio-temporalgridsfromsection 4.1 .Table 5 presentstheAPvaluesforthetwotrainingsetsandforarandomclassierreferredtoaschanceAP.Theclassicationresultsaregoodforthemanualtrain-ingsetandlowerfortheautomaticone.However,forallclassesexcept“HandShake”theautomatictrainingobtainsresultssignicantlyabovechancelevel.Thisshowsthatanautomaticallytrainedsystemcansuccessfullyrecognizehumanactionsinreal-worldvideos.Forkissing,theper-formancelossbetweenautomaticandmanualannotationsisminor.Thissuggeststhatthemaindifcultywithourau-tomaticapproachisthelownumberofcorrectlylabeledex-amplesandnotthepercentageofwronglabels.Thisprob-lemcouldeasilybeavoidedbyusingalargedatabaseofmovieswhichweplantoaddressinthefuture.Figure 10 showssomeexampleresultsobtainedbyourapproachtrainedwithautomaticallyannotateddata.Wedisplaykeyframesoftestvideosforwhichclassicationobtainedthehighestcondencevalues.Thetwotoprowsshowtruepositivesandtruenegatives.Notethatdespitethefactthatsampleswerehighlyscoredbyourmethod,theyarefarfromtrivial:thevideosshowalargevariabilityofscale,viewpointandbackground.Thetwobottomrowsshowwronglyclassiedvideos.Amongthefalsepositivesmanydisplayfeaturesnotunusualfortheclassiedaction,forexampletherapidgettingupistypicalfor“GetOutCar”orthestretchedhandsaretypicalfor“HugPerson”.Mostofthefalsenegativesareverydifculttorecognize,seeforex-ampletheoccludedhandshakeorthehardlyvisiblepersongettingoutofthecar.5.ConclusionThispaperhaspresentedanapproachforautomaticallycollectingtrainingdataforhumanactionsandhasshownthatthisdatacanbeusedtotrainaclassierforac-tionrecognition.Ourapproachforautomaticannotation