Many visual search and matching systems represent images using sparse sets of visual words descriptors that have been quantized by assignment to the bestmatching symbol in a discrete vocabulary Er rors in this quantization procedure propagate throug ID: 28920
Download Pdf The PPT/PDF document "Descriptor Learning for Ecient Retrieval..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
DescriptorLearningforEcientRetrievalJamesPhilbin,MichaelIsard,JosefSivic,andAndrewZissermanVisualGeometryGroup,DepartmentofEngineeringScience,UniversityofOxford DescriptorLearningforEcientRetrieval679adiscussion.ImprovedperformanceisdemonstratedoverSIFTdescriptors[18]onstandarddatasetswithlearntdescriptorsassmallas24-D.2DatasetsandthemAPPerformanceGapTolearnandevaluate,weusetwopubliclyavailabledatasetswithassociatedgroundtruth:(i)theOxfordBuildingsdataset[19];and(ii)theParisBuildingsdataset[20].Weshowthatasignicantperformancegap(themAP-gap)isin-curredbyusingquantizeddescriptorscomparedtousingtheoriginaldescriptors.Itisthisgapthatweaimtoreducebylearningadescriptorprojection.2.1DatasetsandPerformanceMeasureBoththeOxford(5.1Kimages)andParis(6.3Kimages)datasetswereobtainedfromFlickrbyqueryingtheassociatedtexttagsforfamouslandmarks,andbothhaveanassociatedgroundtruthfor55standardqueries:5queriesforeachof11landmarksineachcity.Toevaluateretrievalperformance,theAveragePrecision(AP)iscomputedastheareaundertheprecision-recallcurveforeachquery.Asin[3],anAveragePrecisionscoreiscomputedforeachofthe5queriesforalandmark.Thesescoresareaveraged(over55queryimagesintotalforeachdataset)toobtainanoverallmeanAveragePrecision(mAP)score.Ane-invariantHessianregions[21]arecomputedforeachimage,givingap-proximately3300featuresperimage(1024768pixels).Eachaneregionisrepresentedbya128-DSIFTdescriptor[18].2.2PerformanceLossDuetoQuantizationToassesstheperformancelossduetoquantization,fourretrievalsystems(RS)arecompared:Thebaselineretrievalsystem(RS1):Inthissystemeachimageisrepresentedasabagofvisualwords.Allimagedescriptorsareclusteredusingtheap-proximate-meansalgorithm[3]into500Kvisualwords.Atindexingandquerytimeeachdescriptorisassociatedwithits(approximate)nearestclustercen-tretoformavisualword,andaretrievalrankingscoreisobtainedusingtf-idfweighting.Nospatialvericationisperformed.Notethateachdatasethasitsownvocabulary.Spatialre-rankingtodepth200(RS2):Forthissystemaspatialvericationprocedure[3]isadopted,estimatingananehomographyfromsingleimagecorrespondencesbetweenthequeryimageandeachtargetimage.Thetop200imagesreturnedfromRS1arere-rankedusingthenumberofinliersfoundbe-tweenthequeryandtargetimagesunderthecomputedhomography.SpatialveriÞcationtofulldepth(RS3):ThesamemethodisusedasinRS2,butherealldatasetimagesarerankedusingthenumberofinlierstothecomputedhomography. 680J.Philbinetal. Table1.ThemAPperformancegapbetweenrawSIFTdescriptorsand visualwordsontheOxfordandParisdatasets. Inthespatialcases,anane homographyiscomputedusingRANSACandthedataisre-rankedbythenumber ofinliers.UsingrawSIFTdescriptorscoupledwithLowessecondnearestneighbor test[22]givesa14%retrievalboostoverthebaselinemethodforOxford.(i)-(iii)all usea K =500 , 000vocabularytrainedontheirrespectivedatasets. Item Method OxfordmAP ParismAP i. RS1:Baseline(visualwords,nospatial) 0.613 ± 0.011 0.643 ± 0.002 ii. RS2:Spatial(visualwords,depth=200) 0.647 ± 0.011 0.655 ± 0.002 iii. RS3:Spatial(visualwords,depth=FULL) 0.653 ± 0.012 0.663 ± 0.002 iv. RS4:Spatial(rawdescriptors,depth=FULL) 0.755 0.672 RawSIFTdescriptorswithspatialveriÞcation(RS4): Putativematchesonthe rawSIFTdescriptors(noquantization)arefoundbetweenthequeryandevery imageinthedatasetusingLowessecondnearestneighbourtest[18](threshold =0 . 8).SpatialvericationasinRS3isappliedtothesetofputativematches. ItshouldbenotedthatthemethodsRS3andRS4exhaustivelymatchdoc- umentpairsandsoareinfeasiblyslowforreal-time,largescaleretrieval.RS3 is 10timesslowerandRS4is 100timesslowerthanRS2evenonthe5.1K Oxforddataset.Theserun-timegapsincreaselinearlyforlargerdatasets. Theresultsforallfourmethodsareshownintable1.Formethodsbased onvisualwords,themeanandstandarddeviationover3runsof k -meanswith dierentinitializationsareshown.Goingfrombaseline(i)tobaselineplusspa- tial(ii)givesmoderateimprovementstobothdatasets,butrerankingsigni- cantlymoredocumentsgiveslittleappreci ablefurthergain.Incontrast,using therawSIFTdescriptorsgivesalargeboostinretrievalperformanceforboth datasets,demonstratingthatthe mAP-gap isprincipallyduetoquantization errors.Thisimpliesthatalackofvisualwordmatchescontributessubstantially moretomissedretrievalsthanrerankingtoofewdocumentsatquerytime.The raw-descriptormatchingprocedurewillbeusedtogeneratepointpairsforour learningalgorithm,soTable1(iv)givesaroughupperboundtotheretrievalim- provementwecanhopetoachieveusinganylearningalgorithmbasedonthose traininginputs. 3AutomaticTrainingDataGeneration Inthissection,wedescribeourmethodtoautomaticallygeneratetrainingdata forthedescriptorprojectionlearningp rocedure.Thetrainin gdataisgenerated bypair-wiseimagematching,amuchcheaperalternativetothefullmulti-view reconstructionusedin[16,17],allowingustogeneratealargenumber(3M+)of trainingpairs.Inadditiontopositive(matched)examples,weseparatelycollect hardandeasynegativeexamplesandshowlaterthatmakingthisdistinction cansignicantlyimprovethelearntprojections. Weproceedasfollows:(i)Animagepair ischosenatrandomf romthedataset; (ii)Asetof putative matchesiscomputedbetweentheimagepair.Eachputative DescriptorLearningforEcientRetrieval681matchconsistsofapairofellipticalfeatures,oneineachimage,thatpassLowessecondnearestneighbourratiotest[18]ontheirSIFTdescriptors;(iii)RANSACisusedtoestimateananetransformbetweentheimagestogetherwithanumberofinliersconsistentwiththattransform.Pointpairsareonlytakenfromimagematcheswithgreaterthan20veriedinliers.Theratiotestensuresthatputativematchesaredistinctiveforthatparticularpairofimages.Thisproceduregeneratesthreesetsofpointpairs,showninFigure1,thatwetreatdistinctlyinthelearningalgorithm:1.Positives:ThesearethepointpairsfoundasinliersbyRANSAC.2.Nearestneighbournegatives(nnN):Thesearepairsmarkedasout-liersbyRANSACtheyaregenerallycloseindescriptorspaceastheywerefoundtobedescriptor-spacenearestneighborsbetweenthetwoimages,butarespatiallyinconsistentwiththebest-ttinganetransformationfoundbetweentheimages.3.Randomnegatives(ranN):Thesearepairswhicharenotdescriptor-spacenearestneighbours,i.e.randomsetsoffeaturesgenerallyfarapartintheoriginaldescriptorspace.AhistogramofSIFTdistancesforthethreedierentsetsofpointpairsontheOxforddatasetisshowninFigure2(b).Asexpected,theoriginalSIFTdescrip-toreasilyseparatestherandomnegativesfromthepositiveandNNnegativepointpairs,butstronglyconfusesthepositivesandNNnegatives.Section5willshowthatthebestretrievalperformanceariseswhenthepositiveandNNnega-tivepairsareseparatedwhilstsimultaneouslykeepingtherandomnegativepairsdistant.Itisimportanttonotethat,duetothepotentialforrepeatedstructureandthelimitationsofthespatialmatchingmethod(onlyaneplanarhomogra-phiesareconsidered),someofthennNpointpairsmightbeincorrectlylabelledpositives thiscanleadtosignicantnoiseinthetrainingdata.Wecollect3MtrainingpairsfromtheOxforddatasetsplitequallyintopositive,NNnegativeandrandomnegativepairs,andwealsohaveaseparatesetof300Kpairsusedasavalidationsettodetermineregularizationparameters.4LearningtheDescriptorProjectionFunctionOurobjectivehereistoimproveonabaselinedistancemeasurethatpartiallyconfusessomepairsofpointsthatshouldbekeptapart(thenearestneighbornegativespairs)withthosethatshouldbematched(thepositivepairs),asshowningure2(b).Thereisadangerinlearningaprojectionusingonlythesetrainingpointsthatareconfusedintheoriginaldescriptorspace:althoughwemightlearnafunctiontobringthesepointsclosertogether,theprojectionmight(especiallyifitisnon-linear)drawinotherpointssothataparticularpairofpointsarenolongernearestneighbours.Beinganearestneighbourexplicitlydependsonallotherpointsinthespace,sogreatcaremustbeexercisedwhenignoringotherpoints.Here,weaimtoovercometheseproblemsbyincorporatingthedistancesbe-tweenalargesetofrandompointpairsdirectlyintoourcostfunction.These DescriptorLearningforEcientRetrieval687 (a) Fig.5.(a)Linearmodel:mAPperformanceasthenaldimensionisvaried.(b)Non-linearmodel:mAPperformanceasthehiddenlayerdimensionisvaried.Theoutputdimensionisxedto32.Choosingthemarginratio:Figure3examinestheretrievalperformanceasafunctionofthemarginratioforanon-linearmodelwithonehiddenlayerofsize384projectingdownto32-D.Thisratiocontrolstheextenttowhichtherandomnegativepairsshouldbeseparatedfromthepositivepairs.At0,bothmarginsarethesame,whichmimicspreviousmethodsthatusejusttwotypesofpointpairs:iftheratioissettoolow,therandomnegativepairsstarttobeclusteredwiththepositivepairs;ifitissettoohighthenthelearningalgorithmfocusesallitsattentiononseparatingtherandomnegativesandisntabletoseparatethepositiveandNNnegativepairs.DistancehistogramsfordierentmarginratiosareshowninFigure4.Astheratioisincreased,thereisapeakinperformancebetween16and17.Inallsubsequentexperiments,thisratioissetto16with=200.Theseresultsclearlydemonstratethevalueofconsideringbothsetsofnegativepointpairs.Linearmodel:ResultsforthelinearmodelaregiveninTable2andareshowninFigure5(a).Performanceincreasesonlyupto64-Dandthenplateaus.At64-Dtheperformancewithoutspatialre-rankingis0002,animprovementof4%overRS1.Withspatialre-rankingthemAPis0003,animprove-mentof18%overRS2.Therefore,alearnedlinearprojectionleadstoaslightbutsignicantperformanceimprovement,andwecanreducethedimensionalityoftheoriginaldescriptorsbyusingthislinearprojectionwithnodegradationinperformance.WecomparetothelineardiscriminantmethodofHuaetal.[16],usingalocalimplementationoftheiralgorithmonourdata.Forthismethod,weusedtheranNpairsasthenegativesfortraining(performancewasworsewhennnNpairswereusedasthenegatives).Using1Mpositiveand1Mrandomnegativepairs,reducingtheoutputdimensionto32-D,givesaperformanceof0585withoutspatialre-ranking;and0625withspatialre-ranking.ThisisslightlyworsethanourlinearresultswhichgivesanmAPof0600and0634respectively.Thedierenceinperformancecanbeexplainedbyouruseofadierentmargin-basedcostfunctionandtheconsiderationofboththennNandranNpointpairs. 690J.Philbinetal.WehaveillustratedthemethodforSIFTandfortwotypesofprojectionfunctions,butclearlytheframeworkofautomaticallygeneratingtrainingdataandlearningtheprojectionfunctionthroughoptimizationof(2)couldbeappliedtootherdescriptors,e.g.theDAISYdescriptorof[29]orevendirectlytoimagepatches.Acknowledgements.WearegratefulfornancialsupportfromtheEPSRC,theRoyalAcademyofEngineering,Microsoft,ERCgrantVisRecno.228180,ANRprojectHFIBMR(ANR-07-BLAN-0331-01)andtheMSR-INRIAlaboratory.References1.Jegou,H.,Douze,M.,Schmid,C.:Hammingembeddingandweakgeometriccon-sistencyforlargescaleimagesearch.In:Forsyth,D.,Torr,P.,Zisserman,A.(eds.)ECCV2008,PartI.LNCS,vol.5302,pp.304 317.Springer,Heidelberg(2008)2.Nister,D.,Stewenius,H.:Scalablerecognitionwithavocabularytree.In:Proc.CVPR(2006)3.Philbin,J.,Chum,O.,Isard,M.,Sivic,J.,Zisserman,A.:Objectretrievalwithlargevocabulariesandfastspatialmatching.In:Proc.CVPR(2007)4.Sivic,J.,Zisserman,A.:VideoGoogle:ATextRetrievalApproachtoObjectMatchinginVideos.In:Proc.ICCV(2003)5.Boiman,O.,Shechtman,E.,Irani,M.:Indefenceofnearest-neighborbasedimageclassication.In:Proc.CVPR(2008)6.Philbin,J.,Chum,O.,Isard,M.,Sivic,J.,Zisserman,A.:Lostinquantization:Im-provingparticularobjectretrievalinlargescaleimagedatabases.In:Proc.CVPR(2008)7.vanGemert,J.,Geusebroek,J.M.,Veenman,C.,Smeulders,A.:Kernelcodebooksforscenecategorization.In:Forsyth,D.,Torr,P.,Zisserman,A.(eds.)ECCV2008,PartIII.LNCS,vol.5304,pp.696 709.Springer,Heidelberg(2008)8.Chum,O.,Philbin,J.,Sivic,J.,Isard,M.,Zisserman,A.:Totalrecall:Automaticqueryexpansionwithagenerativefeaturemodelforobjectretrieval.In:Proc.ICCV(2007)9.Schultz,M.,Joachims,T.:Learningadistancemetricfromrelativecomparisons.In:NIPS(2003)10.Weinberger,K.,Blitzer,J.,Saul,L.:Distancemetriclearningforlargemarginnearestneighborclassication.In:NIPS(2005)11.Kumar,P.,Torr,P.,Zisserman,A.:Aninvariantlargemarginnearestneighbourclassier.In:Proc.ICCV(2007)12.Frome,A.,Singer,Y.,Sha,F.,Malik,J.:Learningglobally-consistentlocaldistancefunctionsforshape-basedimageretrievalandclassication.In:Proc.ICCV(2007)13.Salakhutdinov,R.,Hinton,G.:Learninganonlinearembeddingbypreservingclassneighbourhoodstructure.In:AIandstatistics(2007)14.Mikolajczyk,K.,Matas,J.:Improvingdescriptorsforfasttreematchingbyoptimallinearprojection.In:Proc.ICCV(2007)15.Ramanan,D.,Baker,S.:Localdistancefunctions:Ataxonomy,newalgorithms,andanevaluation.In:Proc.ICCV(2009)16.Hua,G.,Brown,M.,Winder,S.:Discriminantembeddingforlocalimagedescrip-tors.In:Proc.ICCV(2007)