/
Learning realistic human actions from movies Ivan Laptev Marcin Marszaek Cordelia Schmid Learning realistic human actions from movies Ivan Laptev Marcin Marszaek Cordelia Schmid

Learning realistic human actions from movies Ivan Laptev Marcin Marszaek Cordelia Schmid - PDF document

ellena-manuel
ellena-manuel . @ellena-manuel
Follow
546 views
Uploaded On 2014-12-12

Learning realistic human actions from movies Ivan Laptev Marcin Marszaek Cordelia Schmid - PPT Presentation

laptevinriafr marcinmarszalekinriafr cordeliaschmidinriafr grurgrurgmailcom Abstract The aim of this paper is to address recognition of natural human actions in diverse and realistic video settings This challenging but important subject has mostly be ID: 23055

laptevinriafr marcinmarszalekinriafr cordeliaschmidinriafr grurgrurgmailcom Abstract

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Learning realistic human actions from mo..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

[1:13:41-1:13:45]Ablackcarpullsup.Twoarmyofcersgetout.Figure3.Evaluationofscript-basedactionannotation.Left:Preci-sionofactionannotationevaluatedonvisualgroundtruth.Right:Exampleofavisualfalsepositivefor“getoutofacar”. scriptsandmovies.Toinvestigatethisissue,wemanuallyannotatedseveralhundredsofactionsin12moviescriptsandveriedtheseonthevisualgroundtruth.From147ac-tionswithcorrecttextalignment(a=1)only70%didmatchwiththevideo.Therestofsampleseitherweremisalignedintime(10%),wereoutsidetheeldofview(10%)orwerecompletelymissinginthevideo(10%).Misalignmentofsubtitles(a1)furtherdecreasesthevisualprecisionasillustratedingure 3 (left).Figure 3 (right)showsatypicalexampleofa“visualfalsepositive”fortheaction“getoutofacar”occurringoutsidetheeldofviewofthecamera.2.2.TextretrievalofhumanactionsExpressionsforhumanactionsintextmayhaveacon-siderablewithin-classvariability.Thefollowingexamplesillustratevariationsinexpressionsforthe“GetOutCar”ac-tion:“WillgetsoutoftheChevrolet.”,“Ablackcarpullsup.Twoarmyofcersgetout.”,“Erinexitshernewtruck.”.Furthermore,falsepositivesmightbedifculttodistin-guishfrompositives,seeexamplesforthe“SitDown”ac-tion:“Abouttositdown,hefreezes.”,“Smiling,heturnstositdown.ButthesmilediesonhisfacewhenhendshisplaceoccupiedbyEllie.”.Text-basedactionretrieval,hence,isanon-trivialtaskthatmightbedifculttosolvebyasimplekeywordsearchsuchascommonlyusedforre-trievingimagesofobjects,e.g.in[ 14 ].Tocopewiththevariabilityoftextdescribinghumanac-tions,weadoptamachinelearningbasedtextclassicationapproach[ 16 ].Aclassierlabelseachscenedescriptioninscriptsascontainingthetargetactionornot.Theimple-mentedapproachreliesonthebag-of-featuresmodel,whereeachscenedescriptionisrepresentedasasparsevectorinahigh-dimensionalfeaturespace.Asfeaturesweusewords,adjacentpairsofwords,andnon-adjacentpairsofwordsoc-curringwithinasmallwindowofNwordswhereNvariesbetween2and8.Featuressupportedbylessthanthreetrainingdocumentsareremoved.Fortheclassicationweusearegularizedperceptron[ 21 ],whichisequivalenttoasupportvectormachine.Theclassieristrainedonaman-uallylabeledsetofscenedescriptions,andtheparameters(regularizationconstant,windowsizeN,andtheacceptance Figure4.Resultsofretrievingeightclassesofhumanactionsfromscriptsusingregularizedperceptronclassier(left)andregularex-pressionmatching(right). threshold)aretunedusingavalidationset.Weevaluatetext-basedactionretrievalonoureightclassesofmovieactionsthatweusethroughoutthispa-per:AnswerPhone,GetOutCar,HandShake,HugPerson,Kiss,SitDown,SitUp,StandUp.Thetexttestsetcontains397actionsamplesandover17Knon-actionsamplesfrom12manu-allyannotatedmoviescripts.Thetexttrainingsetwassam-pledfromalargesetofscriptsdifferentfromthetestset.Wecompareresultsobtainedbytheregularizedperceptronclassierandbymatchingregularexpressionswhichweremanuallytunedtoexpressionsofhumanactionsintext.Theresultsingure 4 veryclearlyconrmthebenetsofthetextclassier.Theaverageprecision-recallvaluesforallactionsare[prec.0.95/rec.0.91]forthetextclassierver-sus[prec.0.55/rec.0.88]forregularexpressionmatching.2.3.VideodatasetsforhumanactionsWeconstructtwovideotrainingsets,amanualandanautomaticone,aswellasavideotestset.Theycontainvideoclipsforoureightclassesofmovieactions(seetoprowofgure 10 forillustration).Inallcaseswerstapplyautomaticscriptalignmentasdescribedinsection 2.1 .Fortheclean,manualdatasetaswellasthetestsetwemanu-allyselectvisuallycorrectsamplesfromthesetofmanu-allytext-annotatedactionsinscripts.Theautomaticdatasetcontainstrainingsamplesthathavebeenretrievedautomat-icallyfromscriptsbythetextclassierdescribedinsec-tion 2.2 .Welimittheautomatictrainingsettoactionswithanalignmentscorea�0:5andavideolengthoflessthan1000frames.Ourmanualandautomatictrainingsetscon-tainactionvideosequencesfrom12movies 2 andthetestsetactionsfrom20differentmovies 3 .Ourdatasets,i.e.,the 2“AmericanBeauty”,“BeingJohnMalkovich”,“BigFish”,“Casablanca”,“TheCryingGame”,“DoubleIndemnity”,“ForrestGump”,“TheGodfather”,“IAmSam”,“IndependenceDay”,“PulpFiction”and“RaisingArizona”.3“AsGoodAsItGets”,“BigLebowski”,“BringingOutTheDead”,“TheButteryEffect”,“DeadPoetsSociety”,“ErinBrockovich”,“Fargo”,“Gandhi”,“TheGraduate”,“IndianaJonesAndTheLastCrusade”,“ItsAWonderfulLife”,“Kids”,“LACondential”,“TheLordoftheRings:FellowshipoftheRing”,“LostHighway”,“TheLostWeekend”,“MissionToMars”,“NakedCity”,“ThePianist”and“ReservoirDogs”. Table1.Thenumberofactionlabelsinautomatictrainingset(top),clean/manualtrainingset(middle)andtestset(bottom). videoclipsandthecorrespondingannotations,areavailableathttp://www.irisa.fr/vista/actions.Theobjectiveofhavingtwotrainingsetsistoevalu-aterecognitionofactionsbothinasupervisedsettingandwithautomaticallygeneratedtrainingsamples.Notethatnomanualannotationisperformedneitherforscriptsnorforvideosusedintheautomatictrainingset.Thedistri-butionofactionlabelsforthedifferentsubsetsandactionclassesisgivenintable 1 .Wecanobservethatthenum-berofcorrectlylabeledvideosintheautomaticsetis60%.Mostofthewronglabelsresultfromthescript-videomis-alignmentandafewadditionalerrorscomefromthetextclassier.Theproblemofclassicationinthepresenceofwrongtraininglabelswillbeaddressedinsection 4.3 .3.VideoclassicationforactionrecognitionThissectionpresentsourapproachforactionclassica-tion.Itbuildsonexistingbag-of-featuresapproachesforvideodescription[ 3 , 13 , 15 ]andextendsrecentadvancesinstaticimageclassicationtovideos[ 2 , 9 , 12 ].Lazebniketal.[ 9 ]showedthataspatialpyramid,i.e.,acoarsedescrip-tionofthespatiallayoutofthescene,improvesrecognition.Successfulextensionsofthisideaincludetheoptimizationofweightsfortheindividualpyramidlevels[ 2 ]andtheuseofmoregeneralspatialgrids[ 12 ].Herewebuildontheseideasandgoastepfurtherbybuildingspace-timegrids.Thedetailsofourapproacharedescribedinthefollowing.3.1.Space­timefeaturesSparsespace-timefeatureshaverecentlyshowngoodperformanceforactionrecognition[ 3 , 6 , 13 , 15 ].Theypro-videacompactvideorepresentationandtolerancetoback-groundclutter,occlusionsandscalechanges.Herewefol-low[ 7 ]anddetectinterestpointsusingaspace-timeexten-sionoftheHarrisoperator.However,insteadofperformingscaleselectionasin[ 7 ],weuseamulti-scaleapproachandextractfeaturesatmultiplelevelsofspatio-temporalscales(2i;2j)withi=2(1+i)=2;i=1;:::;6andj=2j=2;j=1;2.Thischoiceismotivatedbythereducedcomputational Figure5.Space-timeinterestpointsdetectedfortwovideoframeswithhumanactionshandshake(left)andgetoutcar(right). complexity,theindependencefromscaleselectionartifactsandtherecentevidenceofgoodrecognitionperformanceusingdensescalesampling.Wealsoeliminatedetectionsduetoartifactsatshotboundaries[ 11 ].Interestpointsde-tectedfortwoframeswithhumanactionsareillustratedingure 5 .Tocharacterizemotionandappearanceoflocalfeatures,wecomputehistogramdescriptorsofspace-timevolumesintheneighborhoodofdetectedpoints.Thesizeofeachvolume(x;y;t)isrelatedtothedetectionscalesbyx;y=2k,t=2k.Eachvolumeissubdividedintoa(nx;ny;nt)gridofcuboids;foreachcuboidwecomputecoarsehistogramsoforientedgradient(HoG)andopticow(HoF).NormalizedhistogramsareconcatenatedintoHoGandHoFdescriptorvectorsandaresimilarinspirittothewellknownSIFTdescriptor.Weuseparametervaluesk=9,nx;ny=3,nt=2.3.2.Spatio­temporalbag­of­featuresGivenasetofspatio-temporalfeatures,webuildaspatio-temporalbag-of-features(BoF).Thisrequirestheconstructionofavisualvocabulary.Inourexperimentsweclusterasubsetof100kfeaturessampledfromthetrainingvideoswiththek-meansalgorithm.Thenumberofclustersissettok=4000,whichhasshownempiricallytogivegoodresultsandisconsistentwiththevaluesusedforstaticimageclassication.TheBoFrepresentationthenassignseachfeaturetotheclosest(weuseEuclideandistance)vo-cabularywordandcomputesthehistogramofvisualwordoccurrencesoveraspace-timevolumecorrespondingeithertotheentirevideosequenceorsubsequencesdenedbyaspatio-temporalgrid.Ifthereareseveralsubsequencesthedifferenthistogramsareconcatenatedintoonevectorandthennormalized.Inthespatialdimensionsweusea1x1grid—correspondingtothestandardBoFrepresentation—,a2x2grid—showntogiveexcellentresultsin[ 9 ]—,ahorizontalh3x1grid[ 12 ]aswellasaverticalv1x3one.Moreover,weimplementedadenser3x3gridandacenter-focusedo2x2gridwhereneighboringcellsoverlapby50%oftheirwidthandheight.Forthetemporaldimensionwesubdividethevideosequenceinto1to3non-overlappingtemporalbins, Task HoGBoF HoFBoF Bestchannel Bestcombination KTHmulti-class 81:6% 89:7% 91.1%(hofh3x1t3) 91.8%(hof1t2,hog1t3) ActionAnswerPhone 13:4% 24:6% 26.7%(hofh3x1t3) 32.1%(hofo2x2t1,hofh3x1t3) ActionGetOutCar 21:9% 14:9% 22.5%(hofo2x21) 41.5%(hofo2x2t1,hogh3x1t1) ActionHandShake 18:6% 12:1% 23.7%(hogh3x11) 32.3%(hogh3x1t1,hogo2x2t3) ActionHugPerson 29:1% 17:4% 34.9%(hogh3x1t2) 40.6%(hog1t2,hogo2x2t2,hogh3x1t2) ActionKiss 52:0% 36:5% 52.0%(hog11) 53.3%(hog1t1,hof1t1,hofo2x2t1) ActionSitDown 29:1% 20:7% 37.8%(hog1t2) 38.6%(hog1t2,hog1t3) ActionSitUp 6:5% 5:7% 15.2%(hogh3x1t2) 18.2%(hogo2x2t1,hogo2x2t2,hogh3x1t2) ActionStandUp 45:4% 40:0% 45.4%(hog11) 50.5%(hog1t1,hof1t2) Table2.Classicationperformanceofdifferentchannelsandtheircombinations.FortheKTHdatasettheaverageclassaccuracyisreported,whereasforourmanuallycleanedmoviedatasettheper-classaverageprecision(AP)isgiven. channelsfurtherimprovestheaccuracy.Interestingly,HoGsperformbetterthanHoFsforallreal-worldactionsexceptforansweringthephone.TheinverseholdsforKTHactions.Thisshowsthatthecontextandtheimagecontentplayalargeroleinrealisticsettings,whilesimpleactionscanbeverywellcharacterizedbytheirmo-tiononly.Furthermore,HoGfeaturesalsocapturemotioninformationuptosomeextentthroughtheirlocaltemporalbinning.Inmoredetail,theoptimizedcombinationsforsittingdownandstandingupdonotmakeuseofspatialgrids,whichcanbeexplainedbythefactthattheseactionscanoccuranywhereinthescene.Ontheotherhand,temporalbinningdoesnothelpinthecaseofkissing,forwhichahighvariabilitywithrespecttothetemporalextentcanbeob-served.Forgettingoutofacar,handshakingandhuggingacombinationofah3x1andao2x2spatialgridissuccessful.Thiscouldbeduetothefactthatthoseactionsareusuallypicturedeitherinawidesetting(whereascene-alignedgridshouldwork)orasacloseup(whereauniformgridshouldperformwell).Theoptimizedcombinationsdeterminedinthissection,cf.table 2 ,areusedintheremainderoftheexperimentalsection.4.2.Comparisontothestate­of­the­artWecompareourworktothestate-of-the-artontheKTHactionsdataset[ 15 ],seegure 8 .Itcontainssixtypesofhumanactions,namelywalking,jogging,running,boxing,handwavingandhandclapping,performedseveraltimesby25subjects.Thesequencesweretakenforfourdiffer-entscenarios:outdoors,outdoorswithscalevariation,out-doorswithdifferentclothesandindoors.Notethatinallcasesthebackgroundishomogeneous.Thedatasetcon- Method Schuldt Niebles Wong ours etal.[ 15 ] etal.[ 13 ] etal.[ 18 ] Accuracy 71.7% 81.5% 86.7% 91.8% Table3.AverageclassaccuracyontheKTHactionsdataset. WalkingJoggingRunningBoxingWavingClapping Figure8.SampleframesfromtheKTHactionssequences.Allsixclasses(columns)andscenarios(rows)arepresented. tainsatotalof2391sequences.WefollowtheexperimentalsetupofSchuldtetal.[ 15 ]withsequencesdividedintothetraining/validationset(8+8people)andthetestset(9peo-ple).Thebestperformingchannelcombination,reportedintheprevioussection,wasdeterminedby10-foldcross-validationonthecombinedtraining+validationset.Resultsarereportedforthiscombinationonthetestset.Table 3 comparestheaverageclassaccuracyofourmethodwithresultsreportedbyotherresearchers.Com-paredtotheexistingapproaches,ourmethodshowssignif-icantlybetterperformance,outperformingthestate-of-the-artinthesamesetup.Theconfusionmatrixforourmethodisgivenintable 4 .Interestingly,themajorconfusionoccursbetweenjoggingandrunning. Walking Jogging Running Boxing Waving Clapping Walking .99 .01 .00 .00 .00 .00Jogging .04 .89 .07 .00 .00 .00Running .01 .19 .80 .00 .00 .00Boxing .00 .00 .00 .97 .00 .03Waving .00 .00 .00 .00 .91 .09Clapping .00 .00 .00 .05 .00 .95Table4.ConfusionmatrixfortheKTHactions. NotethatresultsobtainedbyJhuangetal.[ 6 ]andWongetal.[ 19 ]arenotcomparabletoours,astheyarebasedonnon-standardexperimentalsetups:theyeitherusemoretrainingdataortheproblemisdecomposedintosimplertasks.4.3.RobustnesstonoiseinthetrainingdataTrainingwithautomaticallyretrievedsamplesavoidsthehighcostofmanualdataannotation.Yet,thisgoesinhandwiththeproblemofwronglabelsinthetrainingset.Inthissectionweevaluatetherobustnessofouractionclassica-tionapproachtolabelingerrorsinthetrainingset.Figure 9 showstherecognitionaccuracyasafunctionoftheprobabilitypofalabelbeingwrong.Trainingforp=0isperformedwiththeoriginallabels,whereaswithp=1alltraininglabelsarewrong.Theexperimentalresultsareob-tainedfortheKTHdatasetandthesamesetupasdescribedinsubsection 4.2 .Differentwronglabelingsaregeneratedandevaluated20timesforeachp;theaverageaccuracyanditsvariancearereported.Theexperimentshowsthattheperformanceofourmethoddegradesgracefullyinthepresenceoflabelinger-rors.Uptop=0:2theperformancedecreasesinsigni-cantly,i.e.,bylessthantwopercent.Atp=0:4theperfor-mancedecreasesbyaround10%.Wecan,therefore,predictaverygoodperformancefortheproposedautomatictrain-ingscenario,wheretheobservedamountofwronglabelsisaround40%.Notethatwehaveobservedacomparablelevelofresis-tancetolabelingerrorswhenevaluatinganimageclassi-cationmethodonthenatural-sceneimagesofthePASCALVOC'07challengedataset. Figure9.Performanceofourvideoclassicationapproachinthepresenceofwronglabels.ResultsarereportfortheKTHdataset. 4.4.Actionrecognitioninreal­worldvideosInthissectionwereportactionclassicationresultsforreal-wordvideos,i.e.,forourtestsetwith217videos.Trainingisperformedwithaclean,manualdatasetaswellasanautomaticallyannotatedone,seesection 2.3 forde- Clean Automatic Chance AnswerPhone 32:1% 16:4% 10:6% GetOutCar 41:5% 16:4% 6:0% HandShake 32:3% 9:9% 8:8% HugPerson 40:6% 26:8% 10:1% Kiss 53:3% 45:1% 23:5% SitDown 38:6% 24:8% 13:8% SitUp 18:2% 10:4% 4:6% StandUp 50:5% 33:6% 22:6% Table5.Averageprecision(AP)foreachactionclassofourtestset.Wecompareresultsforclean(annotated)andautomatictrain-ingdata.Wealsoshowresultsforarandomclassier(chance). tails.Wetrainaclassierforeachactionasbeingpresentornotfollowingtheevaluationprocedureof[ 5 ].Theperfor-manceisevaluatedwiththeaverageprecision(AP)oftheprecision/recallcurve.Weusetheoptimizedcombinationofspatio-temporalgridsfromsection 4.1 .Table 5 presentstheAPvaluesforthetwotrainingsetsandforarandomclassierreferredtoaschanceAP.Theclassicationresultsaregoodforthemanualtrain-ingsetandlowerfortheautomaticone.However,forallclassesexcept“HandShake”theautomatictrainingobtainsresultssignicantlyabovechancelevel.Thisshowsthatanautomaticallytrainedsystemcansuccessfullyrecognizehumanactionsinreal-worldvideos.Forkissing,theper-formancelossbetweenautomaticandmanualannotationsisminor.Thissuggeststhatthemaindifcultywithourau-tomaticapproachisthelownumberofcorrectlylabeledex-amplesandnotthepercentageofwronglabels.Thisprob-lemcouldeasilybeavoidedbyusingalargedatabaseofmovieswhichweplantoaddressinthefuture.Figure 10 showssomeexampleresultsobtainedbyourapproachtrainedwithautomaticallyannotateddata.Wedisplaykeyframesoftestvideosforwhichclassicationobtainedthehighestcondencevalues.Thetwotoprowsshowtruepositivesandtruenegatives.Notethatdespitethefactthatsampleswerehighlyscoredbyourmethod,theyarefarfromtrivial:thevideosshowalargevariabilityofscale,viewpointandbackground.Thetwobottomrowsshowwronglyclassiedvideos.Amongthefalsepositivesmanydisplayfeaturesnotunusualfortheclassiedaction,forexampletherapidgettingupistypicalfor“GetOutCar”orthestretchedhandsaretypicalfor“HugPerson”.Mostofthefalsenegativesareverydifculttorecognize,seeforex-ampletheoccludedhandshakeorthehardlyvisiblepersongettingoutofthecar.5.ConclusionThispaperhaspresentedanapproachforautomaticallycollectingtrainingdataforhumanactionsandhasshownthatthisdatacanbeusedtotrainaclassierforac-tionrecognition.Ourapproachforautomaticannotation