FahadShahbazKhan1JoostvandeWeijer1AndrewDBagdanov12MariaVanrell11CentredeVisioperComputadorComputerScienceDepartment1UniversitatAutonomadeBarcelonaEdifciOCampusUABBellaterraBarcelonaSpain2 ID: 522973
Download Pdf The PPT/PDF document "PortmanteauVocabulariesforMulti-CueImage..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
PortmanteauVocabulariesforMulti-CueImageRepresentation FahadShahbazKhan1,JoostvandeWeijer1,AndrewD.Bagdanov1;2,MariaVanrell11CentredeVisioperComputador,ComputerScienceDepartment1UniversitatAutonomadeBarcelona,EdifciO,CampusUAB(Bellaterra),Barcelona,Spain2MediaIntegrationandCommunicationCenter,UniversityofFlorence,ItalyAbstractWedescribeanoveltechniqueforfeaturecombinationinthebag-of-wordsmodelofimageclassication.Ourapproachbuildsdiscriminativecompoundwordsfromprimitivecueslearnedindependentlyfromtrainingimages.Ourmainobservationisthatmodelingjoint-cuedistributionsindependentlyismorestatisticallyrobustfortypicalclassicationproblemsthanattemptingtoempiricallyestimatethede-pendent,joint-cuedistributiondirectly.WeuseInformationtheoreticvocabularycompressiontonddiscriminativecombinationsofcuesandtheresultingvocab-ularyofportmanteau1wordsiscompact,hasthecuebindingproperty,andsup-portsindividualweightingofcuesinthenalimagerepresentation.State-of-the-artresultsonboththeOxfordFlower-102andCaltech-UCSDBird-200datasetsdemonstratetheeffectivenessofourtechniquecomparedtoother,signicantlymorecomplexapproachestomulti-cueimagerepresentation.1IntroductionImagecategorizationisthetaskofclassifyinganimageascontaininganobjectsfromapredenedlistofcategories.Oneofthemostsuccessfulapproachestothisproblemisthebag-of-words(BOW)[4,15,11,2].Inthebag-of-wordsmodelanimageisrstrepresentedbyacollectionoflocalimagefeaturesdetectedeithersparselyorinaregular,densegrid.Eachlocalfeatureisthenrepresentedbyoneormorecues,eachdescribingoneaspectofasmallregionaroundthecorrespondingfeature.Typicallocalcuesincludecolor,shape,andtexture.Thesecuesarethenquantizedintovisualwordsandthenalimagerepresentationisahistogramoverthesevisualvocabularies.InthenalstageoftheBOWapproachthehistogramrepresentationsaresenttoaclassier.ThesuccessofBOWishighlydependentonthequalityofthevisualvocabulary.Inthispaperweinvestigatevisualvocabularieswhichareusedtorepresentimageswhoselocalfeaturesaredescribedbybothshapeandcolor.ToextendBOWtomultiplecues,twopropertiesareespeciallyimportant:cuebindingandcueweighting.Avisualvocabularyissaidtohavethebindingpropertywhentwoindependentcuesappearingatthesamelocationinanimageremaincoupledinthenalimagerepresentation.Forexample,ifeverylocalpatchinanimageisindependentlydescribedbyashapewordandacolorword,inthenalimagerepresentationusingcompoundwordsthebindingpropertyensuresthatshapeandcolorwordscomingfromthesamefeaturelocationarecoupledinthenalrepresentation.Thetermbindingisborrowedfromtheneuroscienceeldwhereitisusedtodescribethewayinwhichhumansselectandintegratetheseparatecuesofobjectsinthecorrectcombinationsinordertoaccuratelyrecognizethem[17].Thepropertyofcueweightingimpliesthatitispossible 1Aportmanteauisacombinationoftwoormorewordstoformaneologismthatcommunicatesaconceptbetterthananyindividualword(e.g.Skiresort+Konference=Skonference).Weusethetermtodescribeourvocabulariestoemphasizetheconnotationwithcombiningcolorandshapewordsintonew,moremeaningfulrepresentations.1 toadapttherelevanceofeachcuedependingonthedataset.TheimportanceofcueweightingcanbeseenfromthesuccessofMultipleKernelLearning(MKL)techniqueswhereweightsforeachcueareautomaticallylearned[3,13,21,14,1,20].Traditionally,twoapproachesexisttohandlemultiplecuesinBOW.Wheneachcuehasitsownvisualvocabularytheresultisknownasalatefusionimagerepresentationinwhichanimageisrepresentedasonehistogramovershape-wordsandanotherhistogramovercolor-words.Sucharepresentationdoesnothavethecuebindingproperty,meaningthatitisimpossibletoknowexactlywhichcolor-shapeeventsco-occurredatlocalfeatures.Latefusiondoes,however,allowcueweight-ing.Anotherapproach,calledearlyfusion,constructsasinglevisualvocabularyofjointcolor-shapewords.Representationsoverearlyfusionvocabularieshavethecuebindingproperty,meaningthatthespatialco-occurrenceofshapeandcoloreventsispreserved.However,cueweightinginearlyfusionvocabulariesisverycumbersomesincemustbeperformedbeforevocabularyconstructionmakingcross-validationveryexpensive.Recently,Khanetal.[10]proposedamethodwhichcom-binescuebindingandweighting.However,theirnalimagerepresentationsizeisequaltonumberofvocabularywordstimesthenumberofclasses,andisthereforenotfeasibleforthelargedatasetsconsideredinthispaper.Astraightforward,ifcombinatoriallyinconvenient,approachtoensuringthebindingpropertyistocreateanewvocabularythatcontainsonewordforeachcombinationoforiginalshapeandcolorfeature.Consideringthateachoftheoriginalshapeandcolorvocabulariesmaycontainthousandsofwords,theresultingjointvocabularymaycontainmillions.Suchlargevocabulariesareimpracticalasestimatingjointcolor-shapestatisticsisofteninfeasibleduetothedifcultyofsamplingfromlimitedtrainingdata.Furthermore,withsomanyparameterstheresultingclassiersarepronetoovertting.Becauseofthisandotherproblems,thistypeofjointfeaturerepresentationhasnotbeenfurtherpursuedasawayofensuringthatimagerepresentationshavethebindingproperty.Inrecentyearsanumberofvocabularycompressiontechniqueshaveappearedthatderivesmall,discriminativevocabulariesfromverylargeones[16,7,5].Mostofthesetechniquesarebasedoninformationtheoreticclusteringalgorithmsthatattempttocombinewordsthatareequivalentlydis-criminativeforthesetofobjectcategoriesbeingconsidered.Becausethesetechniquesareguidedbythediscriminativepowerofclustersofvisualwords,estimatesofclass-conditionalvisualwordprob-abilitiesareessential.Theserecentdevelopmentsinvocabularycompressionallowustoreconsiderthedirect,Cartesianproductapproachtobuildingcompoundvocabularies.Thesevocabularycompressiontechniqueshavebeendemonstratedonsingle-cuevocabularieswithafewtensofthousandsofwords.Startingfromevenmoderatelysizedshapeandcolorvocabulariesresultsinacompoundshape-colorvocabularyanorderofmagnitudelarger.Insuchcases,robustestimatesoftheunderlyingclass-conditionaljoint-cuedistributionsmaybedifculttoobtain.Weshowthatfortypicaldatasetsastrongindependenceassumptionaboutthejointcolor-shapedistri-butionleadstomorerobustestimatesoftheclass-conditionaldistributionsneededforvocabularycompression.Inaddition,ourestimationtechniqueallowsexiblecue-specicweightingthatcan-notbeeasilyperformedwithothercuecombinationtechniquesthatmaintainthebindingproperty.2PortmanteauvocabulariesInthissectionweproposeanewmulti-cuevocabularyconstructionmethodthatresultsincom-pactvocabularieswhichpossessboththecuebindingandthecueweightingpropertiesdescribedabove.Ourapproachistobuildportmanteauvocabulariesofdiscriminative,compoundshapeandcolorwordschosenfromindependentlylearnedcolorandshapelexicons.Thetermportmanteauisusedinnaturallanguageforwordswhichareablendoftwootherwordsandwhichcombinetheirmeaning.Weusethetermportmanteautodescribethesecompoundtermstoemphasizethefactthat,similarlytotheuseofneologisticportmanteauxinnaturallanguagetocapturecomplexandcompoundconcepts,wecreategroupsofcolorandshapewordstodescribesemanticconceptsinadequatelydescribedbyshapeorcoloralone.Asimplewaytoensurethebindingpropertyisbyconsideringaproductvocabularythatcontainsanewwordforeverycombinationofshapeandcolorterms.AssumethatS=fs1;s2;:::;sMgandC=fc1;c2;:::;cNgrepresentthevisualshapeandcolorvocabularies,respectively.Thenthe2 Figure1:Comparisonoftwoestimatesofthejointcuedistributionp(S;CjR)ontwolargedatasets.ThegraphsplottheJenson-Shannondivergencebetweeneachestimateandthetruejointdistributionasafunctionsofthenumberoftrainingimagesusedtoestimatethem.Thetruejointdistributionisestimatedempiricallyoverallimagesineachdataset.Estimationusingtheindependenceassump-tionofequation(2)yieldssimilarorbetterestimatesthantheirempiricalcounterparts.productvocabularyisgivenbyW=fw1;w2;:::;wTg=ffsi;cjgj1iM;1jNg;whereT=MN.WewillalsousethethenotationsmtoidentifyamemberfromthesetS.AdisadvantageofvocabulariesofcompoundtermsconstructedbyconsideringtheCartesianproductofallprimitiveshapeandcolorwordsisthatthetotalnumberofvisualwordsisequaltothenumberofcolorwordstimesthenumberofshapewords,whichtypicallyresultsinhundredsofthousandsofelementsinthenalvocabulary.Thisisimpracticalfortworeasons.First,thehighdimensionalityoftherepresentationhamperstheuseofcomplexclassierssuchasSVMs.Second,insufcienttrainingdataoftenrendersrobustestimationofparametersverydifcultandtheresultingclassierstendtoovertthetrainingset.Becauseofthesedrawbacks,compoundproductvocabularieshave,tothebestofourknowledge,notbeenpursuedinliterature.Inthenexttwosubsectionswediscussourapproachtoovercomingthesetwodrawbacks.2.1CompactPortmanteauVocabulariesInrecentyears,severalalgorithmsforfeatureclusteringhavebeenproposedwhichcompresslargevocabulariesintosmallones[16,7,5].Toreducethehigh-dimensionalityoftheproductvocabulary,weapplyDivisiveInformation-TheoreticfeatureClustering(DITC)algorithm[5],whichwasshowntooutperformAIB[16].Furthermore,DITChasalsobeensuccessfullyemployedtoconstructcompactpyramidrepresentations[6].TheDITCalgorithmisdesignedtondaxednumberofclusterswhichminimizethelossinmutualinformationbetweenclustersandtheclasslabelsoftrainingsamples.Inouralgorithm,lossinmutualinformationismeasuredbetweenoriginalproductvocabularyandtheresultingclusters.Thealgorithmjoinswordswhichhavesimilardiscriminativepoweroverthesetofclassesintheimagecategorizationproblem.Thisismeasuredbytheprobabilitydistributionsp(Rjwt),whereR=fr1;r2;::rLgisthesetofLclasses.Moreprecisely,thedropinmutualinformationIbetweenthevocabularyWandtheclasslabelsRwhengoingfromtheoriginalsetofvocabularywordsWtotheclusteredrepresentationWR=fW1;W2;:::;WJg(whereeveryWjrepresentsaclusterofwordsfromW)isequaltoI(R;W)IR;WR=JXj=1Xwt2Wjp(wt)KL(p(Rjwt)jjp(RjWj));(1)whereKListheKullback-Leiblerdivergencebetweentwodistributions.Equation(1)statesthatthedropinmutualinformationisequaltotheprior-weightedKL-divergencebetweenawordanditsassignedcluster.TheDITCalgorithmminimizesthisobjectivefunctionbyalternatingcomputation3 0,00E+00 2,00E - 06 4,00E - 06 6,00E - 06 8,00E - 06 1,00E - 05 1,20E - 05 1,40E - 05 1,60E - 05 0 2 4 6 8 10 12 14 16 Bird - 200 Direct Empirical Independence Assumption 0,00E+00 2,00E - 06 4,00E - 06 6,00E - 06 8,00E - 06 1,00E - 05 0 2 4 6 8 10 12 14 16 18 20 22 Flower - 102 Direct Empirical Independence Assumption Figure2:TheeffectofonDITCclusters.Eachofthelargeboxescontains100imagepatchessampledfromonePortmanteauwordontheOxfordFlower-102dataset.Toprow:veclustersfor=0:1.Notehowtheseclustersarerelativelyhomogeneousincolor,whileshapevariesconsiderablywithineach.Middlerow:veclusterssampledfor=0:5.Theclustersshowconsistencyoverbothcolorandshape.Bottomrow:veclusterssampledfor=0:9.Noticehowinthiscaseshapeisinsteadhomogeneouswithineachcluster.oftheclusterdistributionsandassignmentofcompoundvisualwordstotheirclosestcluster.FormoredetailsontheDITCalgorithmwerefertoDhillonetal.[5].HereweapplytheDITCalgorithmtoreducethehigh-dimensionalityofthecompoundvocabularies.WecallthecompactvocabularywhichistheoutputoftheDITCalgorithmtheportmanteauvocabularyanditswordsaccordinglyportmanteauwords.Thenalimagerepresentationp(WR)isadistributionovertheportmanteauwords.2.2JointdistributionestimationInsolvingtheproblemofhigh-dimensionalityofthecompoundvocabulariesweseeminglyfur-thercomplicatedtheestimationproblem.AsDITCisbasedonestimatesoftheclass-conditionaldistributionsp(S;CjR)=p(WjR)overproductvocabularies,wehaveincreasedthenumberofparameterstobeestimatedtoMNL.Thiscaneasilyreachmillionsofparametersforstandardimagedatasets.Tosolvethisproblemweproposetoestimatetheclassconditionaldistributionsbyassumingindependenceofcolorandshape,giventheclass:p(sm;cnjR)/p(smjR)p(cnjR):(2)Notethatwedonotassumeindependenceofthecuesthemselves,butratherthelessrestrictivein-dependenceofthecuesgiventheclass.Insteadofdirectlyestimatingtheempiricaljointdistributionp(S;CjR),wereducethenumberofparameterstoestimateto(M+N)L,whichinthevo-cabularycongurationsdiscussedinthispaperrepresentsareductionincomplexityoftwoordersofmagnitude.Asanadditionaladvantage,wewillshowinsection2.3thatestimatingthejointdistributionp(S;CjR)allowsustointroducecueweighting.Toverifythequalityoftheempiricalestimatesofequation(2)weperformthefollowingexperiment.Ingure1weplottheJensen-Shannon(JS)divergencebetweentheempiricaljointdistributionob-tainedfromthetestimagesandthetwoestimates:directestimationoftheempiricaljointdistributionp(S;CjR)onthetrainingset,andanapproximateestimatemadebyassumingindependenceasin4