/
YouTubeCat Learning to Categorize Wild Web Videos Zheshen Wang  Ming Zhao  Yang Song YouTubeCat Learning to Categorize Wild Web Videos Zheshen Wang  Ming Zhao  Yang Song

YouTubeCat Learning to Categorize Wild Web Videos Zheshen Wang Ming Zhao Yang Song - PDF document

lindy-dunigan
lindy-dunigan . @lindy-dunigan
Follow
644 views
Uploaded On 2015-03-03

YouTubeCat Learning to Categorize Wild Web Videos Zheshen Wang Ming Zhao Yang Song - PPT Presentation

wang baoxinli asuedu 23 mingzhao yangsong sanjivk googlecom Abstract Automatic categorization of videos in a Webscale un constrained collection such as YouTube is a challenging task A key issue is how to build an effective training set in the presenc ID: 40663

wang baoxinli asuedu mingzhao

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "YouTubeCat Learning to Categorize Wild W..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Figure2.MultipledatasourcesforYouTubevideosincludingasmallsetofmanuallylabeleddata,related(e.g.co-watchedvideodata),searcheddatacollectedbyusingavideosearchenginewithcategoriesasqueries,andcross-domaindata(e.g.webpages)whicharelabeledwiththesametaxonomystructure.Depth-0istheroot).RandomlyselectedYouTubevideosthathavebeenviewedmorethanacertainnumberoftimesarelabeledbyprofessionally-trainedhumanexpertsbasedontheestablishedtaxonomy.EachvideoislabeledfromDepth-0tothedeepestdepthitcango.Forexample,ifavideoislabeledasPopMusic,itmustbeassociatedwithla-belMusic&AudioandArt&Entertainmentaswell.NotethatthisisageneraltaxonomyinsteadofbeingdesignedforYouTubevideosspecically.Thus,itisnotsurprisingthatthedistributionofmanually-labeledvideosoverallcat-egoriesisextremelyunbalanced.Forexample,theArt&Entertainmentcategorycontainscloseto90%ofallthela-beledvideos,andcategoriessuchasAgriculture&Forestryhaveonlyafewvideos.Infact,suchimbalancereectstherealdistributionofvideosintheentireYouTubecorpus.Inthispaper,weworkon29categoriesthathadareasonableamountofmanually-labeledsamples,i.e.,morethan200forDepth-1categoriesandmorethan100forDepth-2to4cate-gories.Manually-labeledsamplesfromthese29categories(4345samplesintotal)covercloseto80%ofallthedatawelabeled,roughlyimplyingthatthecategorieswearework-ingwithcover80%ofallpossiblevideosonYouTube.Tothebestofourknowledge,thisistherstpaperwhichdealswithgeneralWebvideoclassicationonsuchdiversecate-gories.Inourexperiments,50%randomlyselectedsamplesareusedasinitialseedsfortraining(denotedas“M”)andtheremaining50%areusedfortesting.3.2.Related(Co­watched)dataToincreasethetrainingsamplesforeachcategory,weconsideredco-watchedvideos,i.e.,thenextvideosthatuserswatchedafterwatchingthecurrentvideo.Weempiri-callynoticedifavideoisco-watchedmorethan100timeswithacertainvideo,theytendtohavethesamecategory.Ofcourse,suchlabelscanbenoisybutourtree-DRFbasedlatefusionmethodisabletohandlesuchnoiserobustly.So,inourexperiments,co-watchedvideos(denotedas“R”)ofalltheinitialseedvideoswithco-watchcountslargerthan100(3277videointotal)arecollectedtoassisttraining.3.3.SearcheddataAnotherpossibilityforexpandingthetrainingsetisbysearchingforvideosusingonlinevideosearchenginesus-ingacategorylabelasatextquery.Forexample,returnedvideosbysubmittingaquery“soccer”maybeusedastrain-ingsamplesforthe“soccer”category.Constrainedbythequalityofexistingsearchengines,searchedvideosmaybenoisy.Inourwork,wekeepabouttop1000videosreturnedforeachcategory.Sincethecategoriesformahierarchicalstructure,thevideosreturnedforcategoriesatlowerlevelsareincludedfortheirancestorsaswell.QueryingGooglevideosearchgaveusasetofabout71,029videos(denotedas“S”).3.4.Cross­domainlabeleddataComparedtovideolabeling,assigninglabelstoothertypesofdata(e.g.text-basedwebpages)isusuallyeasier.Althoughsuchdatacomesfromacompletelydifferentdo-main,itcanbehelpfulforvideoclassicationaslongasthesamplesarelabeledusingthesametaxonomy.Thisisbe-causewealsousetext-basedfeaturestodescribeeachvideoasexplainedinSection4.1.Wecollected73,375manually-labeledwebpages(denotedas“W”)asoneoftheadditionaldatasourcesinourexperiments.4.LearningfrommultipledatasourcesInSection3,inadditiontothemanually-labeleddata,weintroducedseveralauxiliarysourceswhichmaybeusefulforboostingthevideoclassicationaccuracy.Themainchallengeishowtomakeuseofsuchdiversesetofdatawithdifferentproperties(e.g.,videocontentfeaturesarenotavailableforwebpages)andlabelingquality(e.g.,labelsofsearchedandco-watcheddataarefairlynoisy).Inthispaper,weproposeageneralframeworktoin-tegratingdatafrommixedsources.AsillustratedinFig-ure3,eachauxiliarydatasourceisrstpairwisecombinedwiththemanually-labeledtrainingset.Initialclassiersaretrainedoneachsuchpair.Foreachpair,twosep-arateclassiersarelearned,onewithtext-basedandan-otherwithcontent-basedfeatures.Forexample,inFigure3,MScisacontent-basedandMStisatext-basedmodelforthecombinationofmanually-labeleddataandsearcheddata.Trainedmodelsarethenfusedusingatree-DRFfusionstrategy.Differentfromtraditionalmethodsthatfusemod-elsforeachcategoryindependently,theproposedtree-DRFincorporatesthehierarchicaltaxonomystructureexploringthecategoryrelationshipseffectively.Nextweintroducethefeaturesusedfortrainingindivid-ualclassiersfollowedbythedescriptionofourtree-DRFfusionmethod. Figure3.Generalframeworkoftheproposedsolution:Additionaldatasourcesarerstcombinedwithmanually-labeleddatainde-pendentlyandclassiermodelsaretrainedbasedoneithertextorcontentfeaturesforeachcombination.IndividualclassierarefurtherfusedtoformthenalclassierM.4.1.FeaturesItiswellknownthatdesigninggoodfeaturesisperhapsthemostcriticalpartofanysuccessfulclassicationap-proach.TocapturetheattributesofwildWebvideosascompletelyaspossible,state-of-the-arttextandvideocon-tentfeaturesareutilizedinourexperimentsasbrieysum-marizedbelow.Textfeatures:Foreachvideo,thetextwordsfromti-tle,descriptionandkeywordsareextracted.Then,allthesewordsareweightedtogeneratetextclusters.Thetextclus-tersareobtainedfromNoisy-OrBayesianNetworks[16],whereallthewordsareleafnodesinthenetworkandalltheclustersareinternalnodes.Anedgefromaninternalnodetoaleafnodemeansthewordintheleafnodebelongstothatcluster.Theweightoftheedgemeanshowstronglythewordbelongstothatcluster.Videocontentfeatures:colorhistogramcomputedusinghueandsaturationinHSVcolorspace,colormotiondenedascosinedistanceofcolorhistogramsbetweentwocon-secutiveframes,skincolorfeaturesasdenedin[9],edgefeaturesusingedgesdetectedbyCannyedgedetectorinre-gionsofinterest,linefeaturesusinglinesdetectedbyproba-bilisticHoughTransform,histogramoflocalfeaturesusingLaplacian-of-Gaussian(LoG)andSIFT[15],histogramoftextons[13],entropyfeaturesforeachframeusingnormal-izedintensityhistogramandentropydifferencesformulti-pleframes,facefeaturessuchasnumberoffaces,sizeandaspectratiooflargestfaceregion(facesaredetectedbyanextensionofAdaBoostclassier[23]),shotboundaryde-tectionbasedfeaturesusingdifferenceofcolorhistogramsfromconsecutiveframes[26],audiofeaturessuchasau-diovolumeand32-binspectrograminaxedtimeframecenteredatthecorrespondingvideoframe,adultcontentfeaturesbasedonaboosting-basedclassierinadditiontoframe-basedadult-contentfeatures[18].Weextracttheau-dioandvisualfeaturesinthesametimeinterval.Then,a1DHaarwaveletdecompositionisappliedtothemat8scales.Insteadofusingthewaveletcoefcientsdirectly,wetakethemaximum,minimum,meanandvarianceofthemasthefeaturesineachscale.Thismulti-scalefeatureextractionisappliedtoallouraudioandvideocontentfeaturesexceptthehistogramoflocalfeatures[7].Notethatfeaturesarenotthemaincontributionofthiswork.Duetospacelimitation,weskipthedetailsofthefeaturesandreferthereadertotherespectivereferences.Forfaircomparisons,alltheexperimentalresultsreportedinthisworkareobtainedbasedonthesamesetoffeatures.4.2.CRF­basedfusionstrategyConditionalRandomFields(CRFs)aregraph-basedmodelsthatarepopularlyusedforlabelingstructureddatasuchastext[11]andwereintroducedincomputervisionby[10].Inthiswork,weuseoutputsofdiscriminativeclassi-erstomodelthepotentialsinCRFsassuggestedinDis-criminativeRandomField(DRF)formulationin[10].Fol-lowingthenotationin[10],wedenotetheobservationsasyandthecorrespondinglabelsasx.AccordingtoCRFs,theconditionaldistributionoverlabelsgiventheobservationsisdenedasaGibbseld:p(xjy)=1 Z(Xi2SAi(xi;y)+Xi2SXj2NiIij(xi;xj;y));(1)whereSisthesetofallthegraphnodes,Niisthesetofneighborsofnodei,andZisanormalizingconstantcalledpartitionfunction.TermsAiandIijaretheunaryandpair-wisepotentialssometimesreferredtoasassociationpoten-tialandinteractionpotentialrespectively[10].4.3.Tree­DRFAsdiscussedearlier,inthisworkweusemultipledatasourcesthatarecombinedbyalatefusionstep.Wewantafusionstrategythatcancombinetheclassieroutputsfromdifferentsourceswhilerespectingthetaxonomyovercate-gories.TheDRFframeworkdescribedabovegivesanat-uralwayofachievingthat.Formally,AilearnstofusetheoutputsofindependentclassierswhileIijenforcesthecat-egoryrelationshipsdenedbythehierarchicaltaxonomy.In[10],DRFisusedforimageclassication,inwhichagraphisbuiltonimageentities,i.e.,pixelsorblocks.Onthecontrary,inourcase,thegraphisdenedoverthehier-archicaltaxonomy(i.e.,atreeovercategories)andanoderepresentsacategory.Eachnodeiisassociatedwithabi-narylabelvariablexi,i.e.,xi2f�1;1gimplyingwhetherithcategorylabelshouldbeassignedtotheinputvideoornot.Thescoresfromdifferentclassiersfortheithcate-goryonagivenvideoareconcatenatedinafeaturevector,whichserveastheobservationyi.Figure4illustratestheproposedtree-DRF. Figure4.Latefusionstrategybasedontree-DRF.Foreachinputvideo,atree-structureovercategoriesisdened.Thebinarylabelattheithnode(xi)representswhetherthatvideoshouldbeas-signedthecategorylabelCi.Theobservationvector(yi)issimplytheconcatenationofclassierscoresonthevideoforthatcategory.Following[10],associationpotentialisdenedas,Ai(xi;y)=log1 1+exp(�xiwTihi(y));(2)wherewiisaparametervectorandhi(y)isafeaturevectoratsitei.Following[10],wedenehi(y)toincludetheclassierscoresandtheirquadraticcombinations.Notethatunlikethehomogeneousformusedin[10],theassociationpotentialinourtree-DRFmodelisinhomoge-neous.Thereisaseparateassociationparameterwforeachnode.Thereasonisthatsinceadifferentsetofclassiersislearnedforeachcategory(i.e,anode),forcingtheweightvectorsdeningcombinationsofsuchdisparatesetsofclas-sierstobethesameforallthenodesistooharsh.Thus,weallowthemodeltochoseadifferentweightvectorforeachcategory.Ofcourse,itleadstomoreparametersinthemodelbutsinceourgraphisfairlysmall(just29nodes),andthesizeofobservationvector,i.e.,thenumberofclassiers,isalsosmall,thecomputationaloverheadwasnegligible.Moreover,overttingisalsonotaconcernsincewehaveenoughtrainingdataforsuchsmallnumberofparameters.Theinteractionpotentialintree-DRFisdenedas,Iij(xi;xj;y)=xixjvTij(y);j2Ni;(3)wherevarethemodelparametersandij(y)isapairwisefeaturevectorfornodesiandj.Inthiswork,weonlyex-ploreddata-independentsmoothingbyforcingij(y)tobeaconstant.Similarly,theparametervwaskepttobethesameforallthenodepairs.Onecaneasilyrelaxthistoallowdirectional(anisotropic)interactionsbetweenparentsandchildrenwhichcanprovidemorepowerfuldirectionalsmoothing.Weplantoexplorethisinthefuture.Weusedthestandardmaximumlikelihoodmethodforparameterlearningintree-DRF.Sincethegraphstructureisatree,exactunaryandpairwisemarginalswerecomputedusingBeliefPropagation(BP).Forinference,weusedsite-wiseMaximumPosteriorMarginal(MPM),againusingBP.Resultsoftree-DRFfusionandcomparisonstoregularfu-sionstrategybasedonSVMandCo-trainingarepresentedinSection5.5.ExperimentsandresultsInordertoverifytheeffectivenessoftheproposedsolu-tion,weperformedextensiveexperimentswithabout80KYouTubevideosandabout70Kwebpages.Werstintro-ducetheexperimentaldataandsettingsinthenextsectionfollowedbyabriefdescriptionoftheevaluationmetric.5.1.ExperimentaldataandsettingAsdescribedinSection3,fourdifferentdatasourcesand29majorcategoriesareusedinourexperiments.Thecate-goriesfollowedbytheirpathinthetaxonomytreeare:“Arts&Entertainment”(1),“News”(2),“People&Society”(3),“Sports”(4),“Celebrities&EntertainmentNews”(1,5),“Comics&Animation”(1,6),“EventsandListings”(1,7),“Humor”(1,8),“Movies”(1,9),“Music&Audio”(1,10),“Offbeat”(1,11),“PerformingArts”(1,12),“TV&Video”(1,13),“TeamSports”(4,14),“Anime&Manga”(1,6,15),“Cartoons”(1,6,16),“Concerts&MusicFestivals”(1,7,17),“Dance&ElectronicMusic”(1,10,18),“MusicRef-erence”(1,10,19),“PopMusic”(1,10,20),“RockMusic”(1,10,21),“Urban&Hip-Hop”(1,10,22),“WorldMusic”(1,20,23),“TVPrograms”(1,13,24),“Soccer”(4,14,25),“SongLyrics&Tabs”(1,10,19,26),“Rap&Hip-Hop”(1,10,22,27),“Soul&R&B”(1,10,22,28),and“TVRealityShows”(1,13,24,29).Inourexperiments,binaryclassiersaretrainedforeachcategoryrespectively.ContentfeaturesandtextfeaturesaretrainedseparatelybyusingAdaBoostandSVM,respec-tively.LibLinear[6]isusedtotrainSVMswhentrainingsamplesexceed10K.TrainedmodelsarethenintegratedusingregularSVMbasedlatefusionstrategy[22].Sincewebpagedatahasonlytextfeatures(nocontentfeatures),onlyasinglemodelislearnedforthisset.Thetrainingdatafromtwosources(i.e.,manually-labeleddataplusonead-ditionaldatasource)iscombinedbeforetrainingtheclas-siers.Afterallthedatasourcesareleveraged,fusionisperformedforcontentandtextfeaturesforthreepairwisecombinations,representedbyveindividualclassiers.Inthetrainingprocess,negativetrainingsamplesforeachcat-egoryarerandomlyselectedfromothercategorieswithanegative-positiveratioof3:1.5.2.EvaluationmetricsWhiletesting,sincebinaryclassiersaretrainedforeachcategory,eachtestsamplereceives29classicationdeci-sions(either“yes”or“no”).Multiplelabelsforasinglesampleareallowed.Asthecategorylabelsformataxon-omystructure,predictedcategories/labelsarealsopropa-gatedtotheirancestorsasdonewhilegeneratingground-truthlabelsforthetrainingdata.Forexample,ifatestsam-plehasaground-truthlabel“Art&Entertainment”/“TV&Video”/“TVPrograms”,itistreatedasatruepositive Table1.Classicationaccuracyofeachdatasource,includingmanuallylabeleddata(M),relateddata(R),searcheddata(S)andwebpagedata(W).WebpagedataachievedthebestperformanceexceptforDepth-2. F-score Depth-1 Depth-2 Depth-3 Depth-4 M 0.80 0:60 0.45 0.41 R 0.74 0.53 0.37 0.34 S 0.73 0.51 0.37 0.31 W 0:84 0.54 0:48 0:45 samplefor“Art&Entertainment”categoryifitisclassiedbyanyofthesethreeclassiers.Forthequantitativeevalua-tion,wecomputePrecision,RecallandF-score.Toperformaggregateassessmentoftheclassicationperformance,wealsocomputeF-scoresforeachdepthlevelofthetaxonomy.5.3.ResultsandanalysisTheobjectiveoftheproposedapproachistoimprovevideoclassicationperformancebymakinguseofdatafrommultiplesourcesofvariedquality.Table1listsclassi-cationaccuracyofeachdatasource(duetospacelimita-tion,weonlyshowF-scoreinalltablesandgures).Per-formancewithjusttherelatedvideos(R)orthesearchedvideos(S)ismuchworsethanthatfrommanually-labeleddata(M).Itshowsthatneitherrelatedvideosorsearchedvideosaresufcientfortrainingareliableclassier.Web-pagedata(W)obtainedfromacompletelydifferentdomain,whichdoesnotevencontainvideocontent,worksbetterthanmanually-labeleddataformosttaxonomydepths.Thisispossiblesinceevennoisytextbasedfeaturesforvideosareusuallymorereliablethanvideocontentfeatures.Inordertoachievebetterresults,wecombineeachoftheadditionaldatasourcespairwisewithmanually-labeledtrainingdata.AsshowninTable2,forrelatedvideosource,pairwisecombinationachievessignicantimprovementsoverjustusingrelatedvideosandevenbetterthantrainingonmanually-labeleddata.Forthesearchedvideos,perfor-manceofpairwisecombinationisalsobetterthanthatforjustthesearcheddata,butworsethanthatofthemanually-labeleddata.Intermsofthewebpagedata,pairwisecombi-nationisnotalwayssuperiortothesinglesources.Overall,therearetwoobservations:1)Pairwisecombinationwithmanually-labeleddatacanimproveclassicationaccuracyofanysingleadditionalsourceinmostcases;2)Introduc-ingadditionaldatasourcesbysimplymergingthemwiththemanually-labeleddatadoesnotguaranteeimprovementforallcasesoverthebaselineconguration,i.e.,usingjustthemanually-labeleddatafortraining.Next,wefusethesingleclassiermodelstrainedfrompairwisecombinationstofurtherboosttheclassicationperformance.FirstrowofTable3showstheresultsofus-ingregularSVMlatefusionstrategy.ComparedtothebestTable2.Classicationaccuracyofeachcombinationofmanually-labeleddatawithoneadditionaldatasource.Thecombinationwithrelateddataachievessignicantimprovementsoverjustusingtherelateddataandevenoutperformsusingonlymanually-labeleddata.Butthelaterobservationisnottruefortheothertwocases(i.e.combinationwithsearcheddataorwebpagedata). F-score Depth-1 Depth-2 Depth-3 Depth-4 M+R 0:86 0:63 0:47 0:49 M+S 0.78 0.57 0.43 0.37 M+W 0.84 0.55 0.45 0.39 Table3.Classicationaccuracyoffusingpairwisecombinationsofdatausingdifferentfusionstrategies.Theproposedtree-DRFapproachoutperformsanysingledatasourceortheirpairwisecombinations.ItisalsosuperiortothetraditionalSVMfusionstrategywiththesamefeatures. F-score Depth-1 Depth-2 Depth-3 Depth-4 All,SVM 0.84 0.65 0.46 0.49 All,Tree-DRF 0:87 0:72 0:57 0:52 M+R,Tree-DRF 0.85 0.66 0.48 0.45 casesinTable2,fusingalldatasourcesdoesnotachieveanyobviousimprovement(forDepth-1andDepth-3,resultsareevenworse).Itisbecause,forSVM,whenthefeaturedi-mensionincreasesbutnottheamountoftrainingdata,thetestperformancemaydegenerateduetoovertting.Thisobservationunderscoresourpreviousassertionthataninap-propriatefusionstrategyforaddingunreliabledatasourcesmayevenharmtheclassicationaccuracy.Resultsoftheproposedtree-DRFfusionstrategyarere-portedinTable3-secondrow.Foralltaxonomydepths,tree-DRFoutperformsregularSVMfusion.EspeciallyforDepth-2andDepth-3,inwhichthecategoriescanben-etfrombothparentcategoriesandchildcategories,itachieves0.07(11%)and0.11(24%)improvementsinF-scores.Comparedtothebaselineperformance(Table1-rstrow),itgains0.07(9%),0.12(20%),0.12(27%),0.11(27%)F-scoreimprovementsforDepth-1toDepth-4re-spectively.Suchsignicantimprovementsareduetothetaxonomytreebasedlearningoftree-DRF.Inotherwords,sinceinteractionsbetweenparentandchildnodesarecon-sidered,noiseintheadditionaldatasourcescanbelargelyltered.Thisisbecauseusefulinformationistypicallycon-sistentforneighboringnodesandthuscanbeemphasizedbytheinteractionpotentialintree-DRF.Foranalyzingtheeffectivenessofincludingadditionaldatasources,weappliedtree-DRFonthepairofmanually-labeleddataandrelateddata(whichgavethebestresultsamongallpairwisecombinationswithregularfusionofcontentmodelsandtextmodels)inthethirdrowofTable3.Comparedtotree-DRFonalldata(secondrowinTable3),resultsareworse,whichdemonstratesthegainfrommulti-pledatasourcesbyusingtree-DRF.Foreasycomparison, Figure5.Comparisonofclassicationaccuraciesfromdifferentdatasourcesandcombinations.Tree-DRFwithallpairwisedatacombi-nationsachievedthebestperformance.M:Manually-labeleddata,R:RelatedVideos,S:SearchedVideos,W:Webpagedata. Figure6.F-scoresof29categoriesonmanually-labeleddata(M),alldatawithSVMfusionandalldatawithtree-DRFfusion.Tree-DRFperformedbetterthantheothertwomethodsformostcategories.[20]G.Schindler,L.Zitnick,andM.Brown.Internetvideocat-egoryrecognition.InProc.ofCVPRWorkshoponInternetVision,2008.[21]A.F.Smeaton,P.Over,andW.Kraaij.Evaluationcampaignsandtrecvid.InProc.ofACMWorkshoponMultimediaIn-formationRetrieval,2006.[22]C.G.M.Snoek,M.Worring,andA.W.M.Smeulders.Earlyversuslatefusioninsemanticvideoanalysis.InProc.ofACMMM,2005.[23]P.ViolaandM.Jones.Rapidobjectdetectionusingaboostedcascadeofsimplefeatures.InProc.ofCVPR,2001.[24]J.Yang,R.Yan,andA.G.Hauptmann.Cross-domainvideoconceptdetectionusingadaptivesvms.InProc.ofACMMM,2007.[25]S.Zanetti,L.Zelnik-Manor,andP.Perona.Awalkthroughtheweb'svideoclips.InProc.ofCVPRWorkshoponInter-netVision,2008.[26]H.Zhang,A.Kankanhalli,andS.W.Smoliar.Automaticpartitioningoffull-motionvideo.MultimdediaSystems,1(1):10–28,1993.[27]X.Zhu.Semi-supervisedlearningliteraturesurvey.Techni-calreport,UniversityofWisconsin-Madison,2008.[28]X.ZhuandZ.Ghahramani.Learningfromlabeledandunla-beleddatawithlabelpropagation.TechnicalReportCMU-CALD-02-107,CarnegieMellonUniversity,2002.