/
Automatic Annotation of Human Actions in Video Olivier Duchenne Ivan Laptev Josef Sivic Automatic Annotation of Human Actions in Video Olivier Duchenne Ivan Laptev Josef Sivic

Automatic Annotation of Human Actions in Video Olivier Duchenne Ivan Laptev Josef Sivic - PDF document

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
623 views
Uploaded On 2014-12-12

Automatic Annotation of Human Actions in Video Olivier Duchenne Ivan Laptev Josef Sivic - PPT Presentation

duchennejosefsivicfrancisbachjeanponce ensfr ivanlaptevinriafr Abstract This paper addresses the problem of automatic temporal annotation of realistic human actions in video using mini mal manual supervision To this end we consider two asso ciated pr ID: 23056

duchennejosefsivicfrancisbachjeanponce ensfr ivanlaptevinriafr Abstract This

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Automatic Annotation of Human Actions in..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

AutomaticAnnotationofHumanActionsinVideoOlivierDuchenne,IvanLaptev,JosefSivic,FrancisBachandJeanPonceINRIA/EcoleNormaleSuperieure,Paris,France@ens.frivan.laptev@inria.frThispaperaddressestheproblemofautomatictemporalannotationofrealistichumanactionsinvideousingmini-malmanualsupervision.Tothisendweconsidertwoasso-ciatedproblems:(a)weakly-supervisedlearningofactionmodelsfromreadilyavailableannotations,and(b)tempo-rallocalizationofhumanactionsintestvideos.Toavoidtheprohibitivecostofmanualannotationfortraining,weusemoviescriptsasameansofweaksupervision.Scripts,how-ever,provideonlyimplicit,noisy,andimpreciseinformationaboutthetypeandlocationofactionsinvideo.Weaddressthisproblemwithakernel-baseddiscriminativeclusteringalgorithmthatlocatesactionsintheweakly-labeledtrain-ingdata.Usingtheobtainedactionsamples,wetraintem-poralactiondetectorsandapplythemtolocateactionsintherawvideodata.Ourexperimentsdemonstratethattheproposedmethodforweakly-supervisedlearningofactionmodelsleadstosignicantimprovementinactiondetection.Wepresentdetectionresultsforthreeactionclassesinfourfeaturelengthmovieswithchallengingandrealisticvideo1.IntroductionIdentifyinghumanactionsinvideoisachallengingcom-putervisionproblemandthekeytechnologyformanypo-tentialvideominingapplications.Suchapplicationsbe-comeincreasinglyimportantwiththerapidgrowthofper-sonal,educational,andprofessionalvideodata.Actionrecognitionhasalonghistoryofresearchwithsignicantprogressreportedoverthelastfewyears.Mostofrecentworks,however,addresstheproblemofactionclassication,i.e.,“whatactionsarepresentinthevideo?”incontrastto“where?”and“when?”theyoccur.Inthispaper,similarto[7,14,21,24]weaimatidentifyingboththeclassesandthetemporallocationofactionsinvideo.Recentpapersonactionrecognitionreportimpressivere-sultsforevaluationsincontrolledsettingssuchasinWeiz-man[1]andinKTH[19]datasets.Atthesametime,state-of-the-artmethodsonlyachievelimitedperformanceinreal Figure1.Videoclipswithactionsprovidedbyauto-maticscript-basedannotation.Selectedframesillustrateboththevariabilityofactionsampleswithinaclassaswellastheimpreciselocalizationofactionsinvideoclips.scenariossuchasmoviesandsurveillancevideosasdemon-stratedin[13,22].Thisemphasisestheimportanceofre-alisticvideodatawithhumanactionsforthetrainingandevaluationofnewmethods.Inthisworkweuserealisticvideodatawithhumanac-tionsfromfeaturelengthmovies.Toavoidtheprohibitivecostofmanualannotation,weproposeanautomaticandscalablesolutionfortrainingandusemoviescriptsasameansofweaksupervision.Scriptsenabletextbasedre-trievalofmanyactionsamplesbutonlyprovideimpreciseactionintervalsasillustratedinFigure1.Ourmaintech-nicalcontributionistoaddressthislimitationwithanewweakly-superviseddiscriminativeclusteringmethodwhichsegmentsactionsinvideoclips(Section3).Usingthere-sultingsegmentswithhumanactionsfortraining,wenextturntotemporalactionlocalization.Wepresentactionde-tectionresultsinhighlychallengingdatafrommovies,anddemonstrateimprovementsachievedbyourdiscriminativeclusteringmethodinSections4and5. 1.1.RelatedworkThispaperisrelatedtoseveralrecentresearchdirections.Withrespecttohumanactionrecognition,similarto[6,13,17,19]andothers,weadoptabag-of-featuresframework,representactionsbyhistogramsofquantizedlocalspace-timefeaturesanduseanSVMtotrainactionmodels.Ourautomaticvideoannotationisbasedonvideoalignmentofscriptsusedin[5,8,13].Similarto[13],weusescriptstondcoarsetemporallocationsofactionsinthetrainingdata.Unlike[13],however,weuseclusteringtodiscoverpreciseactionboundariesfromvideo.Unsupervisedactionclusteringandlocalizationhasbeenaddressedin[24]bymeansofnormalizedcuts[16].Whereasthisdirectclusteringapproachworksinsimplesettings[24],wenditisnotwellsuitedforactionswithlargeintra-classvariationandproposeanalternativedis-criminativeclusteringapproach.[17]dealswithunsuper-visedlearningofactionclassesbutonlyconsidersactionsinsimplesettingsanddoesnotaddresstemporalactionlo-calizationaswedointhiswork.Recentlyandindepen-dentlyofourwork,Buehleretal.[2]consideredlearningsignlanguagefromweakTVvideoannotationsusingmul-tipleinstancelearning.Ourweaklysupervisedclusteringisalsorelatedtotheworkonlearningobjectmodelsfromweaklyannotatedim-ages[4,15].Thetemporallocalizationoftrainingsamplesinvideos,addressedinthiswork,isalsosimilarinspirittoweaklysupervisedlearningofobjectpartlocationsinthecontextofobjectdetection[9].Severalpreviousmethodsaddresstemporallocalizationofactionsinvideo.Whereasmostofthemevaluatere-sultsinsimplesettings[7,21,24],ourworkismorere-latedto[14]thatdetectsactionsinarealmovie.Differentlyto[14]ourmethodisweaklysupervisedandenablesthelearningofactionswithminimalmanualsupervision.2.AutomaticsupervisionforactionrecognitionManualannotationofmanyclassesandmanysamplesofhumanactionsinvideoisatime-consumingprocess.Forexample,commonactionssuchashandshakingorkissingoccuronlyafewtimespermovieonaverage,thuscollectingafair-sizednumberofactionsamplesfortrainingrequiresannotationoftensorhundredsofhoursofvideo.Inthissituation,theautomaticannotationoftrainingdataisanin-terestingandscalablealternativewhichwillbeaddressedinthispaper.Script-basedretrievalofactionsamples.Inthiswork,wefollow[13]andautomaticallycollecttrainingsampleswithhumanactionsusingvideoscripts.Whereas[13]uses Notethatweusescriptsasameansofweaksupervisionatthetrainingstageonly.Scriptsarenotavailableattheteststageofvisualactionanno-supervisedtextclassicationforlocalizinghumanactionsinscripts,hereweavoidmanualtextannotationaltogether.WeusetheOpenNLPtoolbox[18]fornaturallanguagepro-cessingandapplypartofspeech(POS)taggingtoidentifyinstancesofnouns,verbsandparticles.Wealsousenamedentityrecognition(NER)toidentifypeople'snames.GivenresultsofPOSandNERwesearchforpatternscorrespond-ingtoparticularclassesofhumanactionssuchasSON.*opens/VERB.*door/NOUN).Thisprocedureautomat-icallylocatesinstancesofhumanactionsintext:JanejumpsupandCarolynopensthefrontJaneopensherbedroomTemporallocalizationofactionsinvideo.Scriptsde-scribeeventsandtheirorderinvideobutusuallydonotprovidetimeinformation.Following[5,8,13]wendtem-porallocalizationofdialoguesinscriptsbymatchingscripttextwiththecorrespondingsubtitlesusingdynamicpro-gramming.Thetemporallocalizationofhumanactionsandscenedescriptionsisthenestimatedfromthetimeintervalsofsurroundingspeechasillustratedbelow:SubtitlesScript 00:24:22––Yes,MonsieurLaszlo.Rightthisway. - \n HHHHj 00:24:51–TwoCointreaux,please. - \n : MonsieurLaszlo.Rightthisway. Speech Astheheadwaitertakesthemtoatabletheypassbythepiano,andthewomanlooksatSam.Sam,withaconsciouseffort,keepshiseyesonthekeyboardastheygopast.TheheadwaiterseatsIlsa... Scenedescription Twocointreaux,please. Automaticscriptalignmentonlyprovidescoarsetempo-rallocalizationofhumanactions,especiallyforepisodeswithraredialogues.Inaddition,incorrectorderingofac-tionsandspeechinscriptsandtheerrorsofscript/subtitlealignmentoftenresultinunknowntemporaloffsets.Toovercometemporalmisalignment,weincreasetheesti-matedtimeboundariesofscenedescriptions.Inthisworkwedenotepartsofthevideocorrespondingtoscenedescrip-tionsasvideoclips.Table1illustratesmanuallyevaluatedaccuracyofautomaticscript-basedannotationinvideoclipswithincreasingtemporalextents.Traininganaccurateactionclassierrequiresvideoclipswithbothaccuratelabelsandprecisetemporalboundaries.ThehighlabellingaccuracyinTable1,however,isboundtotheimprecisetemporallocalizationofactionsamples.Thistrade-offbetweenaccuraciesinlabelsandtemporallocal-izationofactionsamplescomesfromtheaforementioned tation.Weusemoviescriptspubliclyavailablefromwww.dailyscript.com,www.movie-page.comandwww.weeklyscript.com Figure2.Space-timeinterstpointsdetectedformultiplevideoresolutionsandthreeframesofaStandUpaction.Thecirclesindicatespatiallocationsandscalesofspace-timepatchesusedtoconstructbag-of-featuresvideorepresentaion(seetextformoredetails). Cliplength(frames) 100 200 400 800 1600 Labelaccuracy 19% 43% 60% 75% 83% Localizationaccuracy 74% 37% 18% 9% 5% Table1.Accuracyofautomaticscript-basedactionannotationinvideoclips.Labelaccuracyindicatestheproportionofclipscon-taininglabeledactions.Localizationaccuracyindicatesthepro-portionofframescorrespondingtolabeledactions.Theevalua-tionisbasedontheannotationofthreeactionsclasses:StandUp,SitDownandOpenDoorinfteenmoviesselectedbasedontheiravailability.problemswithimprecisescriptalignment.Inthispaperwetargetthisproblemandaddressvisuallearningofhumanactionsinaweaklysupervisedsettinggivenimprecisetem-porallocalizationoftrainingsamples.Next,wepresentaweaklysupervisedclusteringalgorithmtoautomaticallylo-calizeactionsintrainingsamples.3.HumanactionclusteringinvideoTotrainaccurateactionmodelsweaimatlocalizinghumanactionsinsidevideoclipsprovidedbyautomaticvideoannotation.Weassumemostoftheclipscontainatleastoneinstanceofatargetactionandexploitthisredun-dancybyclusteringclipsegmentswithconsistentmotionandshape.InSection3.1wedescribeourvideorepresenta-tion.Sections3.2and3.3,respectively,formalizetheclus-teringproblemanddescribethediscriminativeclusteringprocedure.Finally,Section3.4detailsthebaselinek-meanslikealgorithmandSection3.5experimentallyevaluatesandcomparesthetwoclusteringalgorithms.3.1.VideorepresentationWeuseabag-of-featuresrepresentationmotivatedbyitsrecentsuccessforobject,sceneandactionclassica-tion[6,13,17,19].Wedetectlocalspace-timefeaturesusinganextendedHarrisoperator[12,13]appliedatmul-tiplespatialandtemporalvideoresolutions.Theresultingpatchescorrespondtolocaleventswithcharacteristicmo-tionandshapeinvideoasillustratedinFigure2.Foreach Weusepubliclyavailableimplementationoffeaturedetectorandde-scriptorfromdetectedpatch,wecomputethecorrespondinglocalmotionandshapedescriptorsrepresentedbyhistogramsofspatialgradientorientationsandopticalowrespectively.Bothde-scriptorsareconcatenatedintoasinglefeaturevectorandquantisedusingk-meansvectorquantisationandavisualvocabularyofsize=1000.Werepresentavideosegmentbyits-normalizedhistogramofvisualwords.3.2.JointclusteringofvideoclipsOurgoalistojointlysegmentvideoclipscontainingaparticularaction—thatis,weaimatseparatingwhatiscommonwithinthevideoclips(i.e.,theparticularaction)fromwhatisdifferentamongthese(i.e,thebackgroundframes).Oursettingishoweversimplerthangeneralco-segmentationintheimagedomainsinceweonlyperformtemporalsegmentation.Thatis,welookforsegmentsthatarecomposedofcontiguousframes.Forsimplicity,wefurtherreducetheproblemtoseparat-ingonesegmentpervideoclip(theactionsegment)fromasetofbackgroundvideosegments,takenfromthesamemovieorothermovies,andwhichareunlikelytocontainthespecicaction.Wethushavethefollowinglearningproblem:Wearegivenvideoclips;:::;cingtheactionofinterestbutatunknownpositionwithintheclipasillustratedinFigure3.Eachclipisrepresentedtemporallyoverlappingsegmentscenteredatframes;:::;nrepresentedbyhistogramshistograms;:::;hhni]inRN.Eachhistogramcapturesthe-normalizedfrequencycountsofquantizedspace-timeinterestpoints,asdescribedinsection3.1,i.e.itisapositivevectorinwhosecom-ponentssumto.Wearealsogivenbackgroundvideosegmentsrepresentedbyhistograms;:::;hOurgoalistondineachoftheonespecicvideosegmentcenteredatframe2f;:::;nsothatthesetofoffi],i=1;:::;Mformoneclusterwhilethebackgroundhistogramsformanotherclusterasillustratedingure4.3.3.DiscriminativeclusteringInthissection,weformulatetheaboveclusteringprob-lemasaminimizationofadiscriminativecostfunction[23].First,letusassumethatcorrectsegmentlocations Figure3.Illustrationofthetemporalactionclusteringproblem.Givenasetofvideoclips;:::;ccontainingtheactionofinterestatunknownposition,thegoalistotemporallylocalizeavideosegmentineachclipcontainingtheaction. positive samples samples negative p les p temporal tracks tracks Figure4.Infeaturespace,positivesamplesareconstrainedtobelocatedontemporalfeaturetrackscorrespondingtoconsequenttemporalwindowsinvideoclips.Background(non-action)sam-plesprovidefurtherconstrainsontheclustering.;:::;M,areknown(i.e.,wehaveidentiedtheloca-tionsoftheactionsinvideo).Wecannowconsiderasup-portvectormachine(SVM)[20]classieraimingatsepa-ratingtheidentiedactionvideosegmentsfromthegivenbackgroundvideosegments,whichleadstothefollowingcostfunctionf;w;b)=XXfi])bg+CPX1+)+2Fareparametersoftheclassi-erandistheimplicitfeaturemapfromtofeatureF,correspondingtotheintersectionkernelbetweenhistograms,denedas[10]x;x)=Xmin(;xNotethatthersttwotermsincostfunction(1)repre-sentthehingelossonpositiveandnegativetrainingdataweightedbyfactorsrespectively,andthelasttermistheregularizeroftheclassier.NotethattrainingtheSVMwithlocationsknownandxedcorrespondstof;w;bwithrespecttoclassierparametersw;bHowever,intheclusteringsetupconsideredinthiswork,wherethelocationsofactionvideosegmentswithinclipsareunknown,thegoalistominimizethecostfunction(1)bothwithrespecttothelocationsandtheclassierpa-,soastoseparatepositiveactionsegmentsfrom(xed)negativebackgroundvideosegments.Denot-ingby)=minf;w;btheassociatedopti-malvaluesoff;w;b,thecostfunctionnowchar-acterizestheseparabilityofaparticularselectionofactionvideosegmentsfromthe(xed)backgroundvideos.Fol-lowing[11,23],wecannowoptimizewithrespecttotheassignmentWeconsideracoordinatedescentalgorithm,whereweiterativelyoptimizewithrespecttopositionoftheactionsegmentineachclip,whileleavingallothercompo-nents(positionsofotherpositivevideosegments)xed.Inourimplementation,whichusestheLibSVM[3]software,inordertosavecomputingtime,were-traintheSVM(up-)onlyonceafteranoptimalisfoundineachclip.Notethatthepositionofanactionsegmentwithinclipcanbeparametrizedusingabinaryindicatorvectorzwith1atpositionandzerootherwise.Thisrepresentationnaturallyleadstoacontinuousrelaxationoftheclusteringproblembyallowingztohaveany(i.e.non-binary)posi-tivevalues,whichsumtoone.Weusetheideaofcontin-uousrelaxationforinitializationofthecoordinatedescentalgorithm.Initialhistogramforeachvideoclipissettotheaverageofallsegmenthistogramshistogramsfi]withintheclip.Usingtherelaxedindicatornotation,thiscorrespondstoinitializingzwithasmallxedvalueforallsegments,equaltooneoverthenumberofsegmentsintheclip.3.4.Clusteringbaseline:modiedk­meansToillustratethedifcultyoftheclusteringtaskaddressedinthispaper,weconsiderabaselinemethodintermsofamodiedk-meansalgorithm.Thistypeofalgorithmhasbeenusedpreviouslyforweakly-supervisedspatialobjectcategorylocalizationinimagesets[4].Weconsiderthejointdistortionmeasureofassigningsomeofthecandidatesegmentstoamean2F,whileas-signingallothersegments(andthebackground)toamean2F.Usingtheindicatorvectornotation,weminimizethefollowingfunctionwithrespecttoz;XXzzj])+(1zzj])k2+PX -500 0 500 0 5 10 15 20 Frame numberClip Figure5.Visualizationoftemporalsegmentationresults.Groundtruthactionlocationsareshownbythickbluebarsandlocationsautomaticallyobtainedbytemporalclusteringareshownasred-yellowbars.Theactionisapproximatelylocalizedin19out21cases.Seetextfordetails.Thiscostfunctioncanbe(locally)minimizedalternativelywithrespecttoallz2f;:::;Mandto.Inourexperiments,asshowninSection3.5,itdoesnotleadtogoodsolutions.Ourinterpretationisthatthistypeofalgorithm,likeregulark-means,reliesheavilyonthemetricwithoutthepossibilityofadaptingittothedata,whichdiscriminativeclusteringcando.3.5.EvaluationofclusteringperformanceInthissectionweapplythediscriminativeclusteringalgorithmdescribedabovetotemporalsegmentationof“drinking”actionsinthemovie“CoffeeandCigarettes”.Thequalityofthesegmentationisevaluatedintermsoflo-calizationaccuracy.Section4thenevaluatesthebenetofautomaticactionsegmentationforthesupervisedtemporalactiondetection.Theclusteringalgorithmisevaluatedonasetof21drinkingactions.Theremaining38actionsareleftoutfortestingdetectionperformance.Ourtestandtrainingvideosdonotsharethesamescenesoractors.Forboththetrain-ingandtestsetthegroundtruthactionboundarieswereob-tainedmanually.Toevaluatethealgorithmincontrolledset-tings,wesimulatescript-basedweaksupervisionbyextend-ingthegroundtruthactionsegmentsbyrandomamountsofbetween0and800framesoneachside.Thenegativedataisobtainedbyrandomlysamplingsegmentsofaxedsizefromtheentiremovie.Thesizeofpositivesegmentsiskeptxedat60frames.Ourclusteringalgorithmconvergesinafew(3-5)itera-tionsbothintermsofthecostgivenin(1)andthelocaliza-tionaccuracy(discussedbelow).AutomaticallylocalizedsegmentsareshowninFigure5andexampleframesfromseveralclipsareshowninFigure6.Notethesignicantvariationofappearancebetweenthedifferentactions.Thetemporallocalizationaccuracyismeasuredbythepercentageofclipswithrelativetemporaloverlaptogroundtruthactionsegmentsgreaterthan0.2.Thebestoverlapscoreof1isachievedwhentheautomaticallyfoundactionvideosegmentalignsperfectlywiththegroundtruthaction,and0inthecaseofnotemporaloverlap.Thisrelativelyloosethresholdof0.2isusedinordertocompensateforthefactthattemporalboundariesofactionsaresomewhatambiguousandnotalwaysaccuratelydened.Usingthisperformancemeasurediscriminativeclusteringcorrectlylo-calizes19outof21clips,whichcorrespondstoanaccuracyof90%.Therearetwomissedactions(6and18ing-ure5).Clip6containsasignicantsimultaneousmotionofanotherpersonnotperformingthetargetaction.Inclip18drinkingismismatchedforsmoking(thetwoactionsarevisuallyquitesimilar).The90%accuracyachievedbydiscriminativeclusteringisasignicantimprovementoverthek-meansalgorithmde-scribedinsection3.4,whichfailscompletelyonthisdataandachievesaccuracyofonly5%.Wehavealsoimple-mentedavariationofthek-meansmethodminimizingthesumofdistancesbetweenallpositiveexamples[4]insteadofsumofdistancestothemean,butobtainedsimilarlylowperformance.Thiscouldbeattributedtothefactthatdis-criminativeclusteringselectsrelevantfeatureswithinthehistograms.Thisisimportantinoursettingwherehis-togramscanbepollutedbybackgroundmotionorotherac-tionshappeningwithinthesameframe.4.TemporalactiondetectioninvideoInthissectionweexperimentallyevaluate,inacon-trolledsetting,whethertheactionclassiertrainedonauto-maticallyclusteredactionsegmentscanbeusedtoimprovetheperformanceoftemporalactiondetectioninnewunseentestvideoswithnotextualannotation.TemporalslidingwindowdetectionSimilartoobjectdetectioninimagedomain,wetrainaSVMclassiertoclassifyashortvideosegmentastowhetheritcontainstheactionofinterest,andapplytheclassierinaslidingwin-dowmannerovertheentirevideo.Theclassieroutputisthenprocessedusingastandardnon-maximumsuppressionEvaluationWeusethesametestdataasin[14]formedby35,973framesofthemovieCoffeeandCigarettescon-taining38drinkingactions.Inallcasesweconsiderslidingwindowswithtemporalscalesof60,80and100frames,andthenegativetrainingdataisformedby5,000videoseg-mentsrandomlysampledfromthetrainingportionofthemovie.Similartoobjectdetection,performanceismea-suredusingprecision-recallcurveandaverageprecision Figure6.Examplesoftemporallylocalized“drinking”actionsinthemovie“CoffeeandCigarettes”bytheproposeddiscriminativeclus-teringalgorithm.Eachrowshowsexampleframesfromtheentirevideoclip.Exampleframesofautomaticallylocalizedactionswithintheclipsareshowninred.Notetheappearancevariationbetweenthedifferentinstancesoftheaction.Weinvestigate,howthedetectionperformancechangeswiththeincreasingsizeofthetrainingvideosegmentscon-tainingthepositivetrainingexamples.Thegoalofthisex-perimentistosimulateinaccuratetemporallocalizationofactionsinclipsobtainedfromtextannotation.Figure7(top)showsprecision-recallcurvesfortrainingclipsizesvary-ingbetween800framesandtheprecisegroundtruth(GT)boundariesofthepositiveactions.Thedecreasingperfor-mancewithincreasingtrainingclipsizeclearlyillustratestheimportanceoftemporalactionlocalizationinthetrain-ingdata.Theperformanceisalsocomparedtotheap-proachofLaptevandPerez[14]whichinadditionspatiallylocalizesactionsinvideoframes.Forcomparison,how-ever,wehereonlyconsidertemporallocalizationperfor-manceof[14].Measuredbyaverageprecision,thespatio-temporalslidingwindowclassierofLaptevandPerezper-formsslightlybetter(APof0.49)comparedtothetempo-ralslidingwindowclassierconsideredinthiswork(APof0.40).Itshouldbenoted,however,that[14]requiresmuchstrongersupervisionintheformofspatio-temporallocalizationoftrainingactionsinthevideo.Finally,Fig-ure7(bottom)showstheprecision-recallcurvefortrainingfromlocalizedactionsegmentsobtainedautomaticallyus-ingourdiscriminativeclusteringmethod(Section3)com-paredwithtrainingonentirevideoclips.Notetheclearimprovementindetectionperformancewhenusingtempo-rallylocalizedtrainingsamplesobtainedwiththeclustering5.ExperimentsInthissectionwetestourfullframeworkforautomaticlearningofactiondetectorsincluding(i)automaticretrievaloftrainingactionclipsbymeansofscriptmining(Sec-tion2),(ii)temporallocalizationofactionsinsideclipsus-ingdiscriminativeclustering(Section3)and(iii)Super-visedtemporaldetectionofactionsintestvideos(Sec-tion4).TotrainanactionclassierweusefteenmoviesalignedwiththescriptsandchoosetwotestactionclassesOpenDoorandSitDownbasedontheirhighfrequencyinourdata.Ouronlymanualsupervisionprovidedtothe Ourfteentrainingmovieswereselectedbasedontheiravailabilityaswellasthequalityofscriptalignment.Thetitlesofthemoviesare:AmericanBeauty;BeingJohnMalkovich;Casablanca;ForrestGump;GetShorty;ItsaWonderfulLife;JackieBrown;JayandSilentBobStrikeBack;LightSleeper;MeninBlack;Mumford;Ninotchka;TheHustler;TheNakedCityandTheNightoftheHunter. 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 recallprecision Laptev&Perez (AP:0.49) GT+0 frames (AP:0.40) GT+100 frames (AP:0.30) GT+200 frames (AP:0.19) GT+800 frames (AP:0.07) 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 recallprecision Automatic segmentation (AP:0.26) GT+800 frames (AP:0.07) Figure7.Actiondetectionperformanceon”CoffeeandCigarettes”testsetfor(top)increasingtrainingclipsizesand(bot-tom)automaticallysegmentedtrainingdatacomparedwithtrain-ingfromunsegmentedclips.systemconsistsoftextpatternsfortheactionsdenedas(*/PERSON.*opens/VERB.*door/NOUN)(*/PERSON.*sits/VERB.*down/PARTICLE).Matchingthesetextpatternswithscriptsresultsin31and44clipswithOpenDoorandSitDownactionsrespectively.Weusetheseclipsasinputtothediscriminativeclusteringalgorithmandobtainsegmentswithtemporallylocalizedactionboundaries.ThesegmentsarepassedaspositivetrainingsamplestotrainanSVMac-tionclassier.Tocompareperformanceofthemethodwealsotraintwoactionclassiersusingpositivetrainingsam-plescorrespondingto(a)entireclipsand(b)groundtruthactionintervals.Totestdetectionperformancewemanuallyannotatedall93OpenDoorand86SitDownactionsinthreemovies:Liv-inginoblivion,ThecryinggameandThegraduate.Detec-tionresultsforthethreedifferentmethodsandtwoactionclassesareillustratedintermsofprecision-recallcurvesinFigure9.Thecomparisonofdetectorstrainedonclipsandonactionsegmentsprovidedbytheclusteringclearlyindicatestheimprovementachievedbythediscriminativeclusteringalgorithmforbothactions.Moreover,theperfor-manceofautomaticallytrainedactiondetectorsiscompa-rabletothedetectorstrainedonthegroundtruthdata.Weemphasizethelargeamount(450.000framesintotal)andhighcomplexityofourtestdataillustratedwithafewde-tectedactionsamplesinFigure9.6.ConclusionsWedescribedamethodfortrainingtemporalactionde-tectorsusingminimalmanualsupervision.Inparticular,weaddressedtheproblemofweakactionsupervisionandpro-posedadiscriminativeclusteringmethodthatovercomeslocalizationerrorsofactionsinscript-basedvideoannota-tion.Wepresentedresultsofactiondetectioninchalleng-ingvideodataanddemonstratedaclearimprovementofourclusteringschemecomparedtoactiondetectorstrainedfromimpreciselylocalizedactionsamples.Ourapproachisgenericandcanbeappliedtotrainingofalargevarietyandnumberofactionclassesinanunsupervisedmanner.Acknowledgments.ThisworkwaspartlysuppportedbytheQuaeroProgramme,fundedbyOSEO,andbytheMSR-INRIAlaboratory.References[1]M.Blank,L.Gorelick,E.Shechtman,M.Irani,andR.Basri.Actionsasspace-timeshapes.In,pagesII:1395–1402,[2]P.Buehler,M.Everingham,andA.Zisserman.Learningsignlanguagebywatchingtv(usingweaklyalignedsubtitles).In,2009.[3]C.C.ChangandC.J.Lin.LIBSVM:alibraryforsup-portvectormachines,2001.http://www.csie.ntu.edu.tw/[4]O.ChumandA.Zisserman.Anexemplarmodelforlearningobjectclasses.In,2007.[5]T.Cour,C.Jordan,E.Miltsakaki,andB.Taskar.Movie/script:Alignmentandparsingofvideoandtexttran-scription.In,pagesIV:158–171,2008.[6]P.Dollar,V.Rabaud,G.Cottrell,andS.Belongie.Behaviorrecognitionviasparsespatio-temporalfeatures.In[7]A.Efros,A.Berg,G.Mori,andJ.Malik.Recognizingactionatadistance.In,pages726–733,2003.[8]M.Everingham,J.Sivic,andA.Zisserman.Hello!mynameis...Buffy–automaticnamingofcharactersinTVvideo.In,2006.[9]P.Felzenszwalb,D.McAllester,andD.Ramanan.Adis-criminativelytrained,multiscale,deformablepartmodel.In,2008.[10]M.HeinandO.Bousquet.Hilbertianmetricsandpositivedenitekernelsonprobabilitymeasures.InProc.AISTATS[11]A.HowardandT.Jebara.Learningmonotonictransforma-tionsforclassication.In,2007. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0. 5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 recallprecisionSit Down GT (AP:0.139) Cluster (AP:0.121) Clip (AP:0.016) 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0. 5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 recallprecisionOpen Door GT (AP:0.144) Cluster (AP:0.141) Clip (AP:0.029) Figure8.Precision-recallcurvescorrespondingtodetectionresultsfortwoactionclassesinthreemovies.Thethreecomparedmethodscorrespondtodetectorstrainedongroundtruthintervals(GT),clusteringoutput(Cluster)andclipsobtainedfromscriptmining(Clip).Notethattheaxesarescaledbetween0and0.5forbetterclarity.ExamplesofdetectedSitDownactionExamplesofdetectedOpenDooraction Figure9.Examplesofactionsamplesdetectedwiththeautomaticallytrainedactiondetectorinthreetestmovies.[12]I.Laptev.Onspace-timeinterestpoints.,64(2/3):107–123,2005.[13]I.Laptev,M.Marszaek,C.Schmid,andB.Rozenfeld.Learningrealistichumanactionsfrommovies.In[14]I.LaptevandP.Perez.Retrievingactionsinmovies.In,2007.[15]S.Lazebnik,C.Schmid,andJ.Ponce.Semi-localafnepartsforobjectrecognition.In,volume2,pages959–968,[16]A.Ng,M.Jordan,andY.Weiss.Onspectralclustering:Analysisandanalgorithm.In,2001.[17]J.Niebles,H.Wang,andL.Fei-Fei.Unsupervisedlearningofhumanactioncategoriesusingspatial-temporalwords.In,2006.[18]OpenNLP.http://opennlp.sourceforge.net.[19]C.Schuldt,I.Laptev,andB.Caputo.Recognizinghumanactions:AlocalSVMapproach.In,2004.[20]J.Shawe-TaylorandN.Cristianini.KernelMethodsforPat-ternAnalysis.Camb.U.P.,2004.[21]E.ShechtmanandM.Irani.Space-timebehaviorbasedcor-relation.In,pagesI:405–412,2005.[22]Trecvidevaluationforsurveillanceeventdetection,Na-tionalInstituteofStandardsandTechnology(NIST),2008.http://www-nlpir.nist.gov/projects/trecvid.[23]L.Xu,J.Neufeld,B.Larson,andD.Schuurmans.Maximummarginclustering.InAdv.NIPS,2004.[24]L.Zelnik-ManorandM.Irani.Event-basedanalysisofvideo.In,pagesII:123–130,2001.