161K - views

Scalable Nearest Neighbor Algorithms for High Dimensional Data Marius Muja Member IEEE and David G

Lowe Member IEEE Abstract For many computer vision and machine learning problems large training sets are key for good performance However the most computationally expensive part of many computer vision and machine learning algorithms consists of 642

Embed :
Pdf Download Link

Download Pdf - The PPT/PDF document "Scalable Nearest Neighbor Algorithms for..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Scalable Nearest Neighbor Algorithms for High Dimensional Data Marius Muja Member IEEE and David G






Presentation on theme: "Scalable Nearest Neighbor Algorithms for High Dimensional Data Marius Muja Member IEEE and David G"— Presentation transcript:

Formanycomputervisionandmachinelearningproblems,largetrainingsetsarekeyforgoodperformance.However,themostcomputationallyexpensivepartofmanycomputervisionandmachinelearningalgorithmsconsistsofndingnearestneighbormatchestohighdimensionalvectorsthatrepresentthetrainingdata.Weproposenewalgorithmsforapproximatenearestneighbormatchingandevaluateandcomparethemwithpreviousalgorithms.Formatchinghighdimensionalfeatures,wendtwoalgorithmsto referredtoasnearestneighbormatching.Havinganef-cientalgorithmforperformingfastnearestneighbormatchinginlargedatasetscanbringspeedimprovements PsuchthattheoperationNN"q;P#canbeperformedefciently.Weareofteninterestedinndingnotjusttherstclos-estneighbor,butseveralclosestneighbors.Inthiscase,thesearchcanbeperformedinseveralways,dependingonthenumberofneighborsreturnedandtheirdistancetothequerypoint:K-nearestneighbor(KNN)searchwherethegoalistondtheclosestKpointsfromthequerypointandradiusnearestneighborsearch(RNN),wherethegoalistondallthepointslocatedcloserthansomedistanceRfrom Nearest-neighborsearchisafundamentalpartofmanycomputervisionalgorithmsandofsignicantimportanceinmanyotherelds,soithasbeenwidelystudied.Thissec-tionpresentsareviewofpreviousworkinthisarea.2.1NearestNeighborMatchingAlgorithmsWereviewthemostwidelyusednearestneighbortechni-ques,classiedinthreecategories:partitioningtrees,hash-ingtechniquesandneighboringgraphtechniques.2.1.1PartitioningTrees paperwedescribeamodiedk-meanstreealgorithmthatwehavefoundtogivethebestresultsforsomedatasets,whilerandomizedk-dtreesarebestforothers.J!egouetal.[27]proposetheproductquantizationapproachinwhichtheydecomposethespaceintolowdimensionalsubspacesandrepresentthedatasetspointsbycompactcodescomputedasquantizationindicesinthesesubspaces.Thecompactcodesareefcientlycomparedtothequerypointsusinganasymmetricapproximatedis-tance.BabenkoandLempitsky[28]proposetheinvertedmulti-index,obtainedbyreplacingthestandardquantiza-tioninaninvertedindexwithproductquantization,obtain- isimportantforobtaininggoodsearchperformance.InSection3.4weproposeanalgorithmforndingtheoptimumalgorithmparameters,includingtheoptimumbranchingfactor.Fig.3containsavisualisationofseveralhierarchicalk-meansdecomposi-tionswithdifferentbranchingfactors.Anotherparameteroftheprioritysearchk-meanstreeisImax,themaximumnumberofiterationstoperforminthek-meansclusteringloop.Performingfeweriterationscansubstantiallyreducethetreebuildtimeandresultsinaslightlylessthanoptimalclustering(ifweconsiderthesumofsquarederrorsfromthepointstotheclustercentresasthemeasureofoptimality).However,wehaveobservedthatevenwhenusingasmallnumberofiterations,thenear-estneighborsearchperformanceissimilartothatofthetreeconstructedbyrunningtheclusteringuntilconvergence,asillustratedbyFig.4.Itcanbeseenthatusingasfewasseveniterationswegetmorethan90percentofthenearest-neigh-borperformanceofthetreeconstructedusingfullconver-gence,butrequiringlessthan10percentofthebuildtime.Thealgorithmtousewhenpickingtheinitialcentresinthek-meansclusteringcanbecontrolledbytheCalgparame-ter.Inourexperiments(andintheFLANNlibrary)wehaveFig.3.Projectionsofprioritysearchk-meanstreesconstructedusingdifferentbranchingfactors:4,32,128.Theprojectionsareconstructedusingthesametechniqueasin[26],grayvaluesindicatingtheratiobetweenthedistancestothenearestandthesecond-nearestclustercentreateachtreelevel,sothatthedarkestvalues(ratio!1)fallneartheboundariesbetweenk-meansregions.Fig.4.Theinuencethatthenumberofk-meansiterationshasonthesearchspeedofthek-meanstree.Figureshowstherelativesearchtimecomparedtothecaseofusingfullconvergence. distancestoalltheclustercentresofthechildnodes,anO!Kd"operation.Theunexploredbranchesareaddedtoapriorityqueue,whichcanbeaccomplishedinO!K"amor-tizedcostwhenusingbinomialheaps.FortheleafnodethedistancebetweenthequeryandallthepointsintheleafneedstobecomputedwhichtakesO!Kd"time.InsummarytheoverallsearchcostisO!Ld!logn=logK"".3.3TheHierarchicalClusteringTreeMatchingbinaryfeaturesisofincreasinginterestinthecom-putervisioncommunitywithmanybinaryvisualdescriptorsbeingrecentlyproposed:BRIEF[49],ORB[50],BRISK[51].Manyalgorithmssuitableformatchingvectorbasedfea-tures,suchastherandomizedkd-treeandprioritysearchk-meanstree,areeithernotefcientornotsuitableformatch-ingbinaryfeatures(forexample,theprioritysearchk-meanstreerequiresthepointstobeinavectorspacewheretheirdimensionscanbeindependentlyaveraged).BinarydescriptorsaretypicallycomparedusingtheHammingdistance,whichforbinarydatacanbecomputedasabitwiseXORoperationfollowedbyabitcountontheresult(veryefcientoncomputerswithhardwaresupportforcountingthenumberofbitssetinaword1).Thissectionbrieypresentsanewdatastructureandalgorithm,calledthehierarchicalclusteringtree,whichwefoundtobeveryeffectiveatmatchingbinaryfeatures.Foramoredetaileddescriptionofthisalgorithmthereaderisencouragedtoconsult[47]and[52].Thehierarchicalclusteringtreeperformsadecomposi-tionofthesearchspacebyrecursivelyclusteringtheinputdatasetusingrandomdatapointsastheclustercentersofthenon-leafnodes(seeAlgorithm3).Incontrasttotheprioritysearchk-meanstreepresentedabove,forwhichusingmorethanonetreedidnotbringsigni-cantimprovements,wehavefoundthatbuildingmultiplehierarchicalclusteringtreesandsearchingtheminparallelusingacommonpriorityqueue(thesameapproachthathasbeenfoundtoworkwellforrandomizedkd-trees[13])resultedinsignicantimprovementsinthesearchperformance.3.4AutomaticSelectionoftheOptimalAlgorithmOurexperimentshaverevealedthattheoptimalalgorithmforapproximatenearestneighborsearchishighlydepen-dentonseveralfactorssuchasthedatadimensionality,sizeandstructureofthedataset(whetherthereisanycorrela-tionbetweenthefeaturesinthedataset)andthedesiredsearchprecision.Additionally,eachalgorithmhasasetofparametersthathavesignicantinuenceonthesearchper-formance(e.g.,numberofrandomizedtrees,branchingfac-tor,numberofk-meansiterations).AswealreadymentioninSection2.2,theoptimumparametersforanearestneighboralgorithmaretypicallychosenmanually,usingvariousheuristics.Inthissectionweproposeamethodforautomaticselectionofthebestnearestneighboralgorithmtouseforaparticulardatasetandforchoosingitsoptimumparameters. beingacostfunctionindicatinghowwellthesearchalgorithmA,conguredwiththeparametersu,per-formsonthegiveninputdata.1.ThePOPCNTinstructionformodernx86_64architectures. Meaddownhillsimplexmethod[43]tofurtherlocallyexploretheparameterspaceandne-tunethebestsolutionobtainedintherststep.Althoughthisdoesnotguaranteeaglobalminimum,ourexperimentshaveshownthattheparametervaluesobtainedareclosetooptimuminpractice.Weuserandomsub-samplingcross-validationtogener-atethedataandthequerypointswhenweruntheoptimiza-tion.InFLANNtheoptimizationcanberunonthefulldatasetforthemostaccurateresultsorusingjustafractionofthedatasettohaveafasterauto-tuningprocess.Theparameterselectionneedstoonlybeperformedonceforeachtypeofdataset,andtheoptimumparametervaluescanbesavedandappliedtoallfuturedatasetsofthesametype.4EXPERIMENTSFortheexperimentspresentedinthissectionweusedaselectionofdatasetswithawiderangeofsizesanddatadimensionality.AmongthedatasetsusedaretheWinder/Brownpatchdataset[53],datasetsofrandomlysampleddataofdifferentdimensionality,datasetsofSIFTfeatures thememoryusageisnotaconcern, thedatainthememoryofasinglemachineforverylargedatasets.Storingthedataonthediskinvolvessignicantperformancepenaltiesduetotheperformancegapbetweenmemoryanddiskaccesstimes.InFLANNweusedtheapproachofperformingdistributednearestneighborsearchacrossmultiplemachines.5.1SearchingonaComputeClusterInordertoscaletoverylargedatasets,weusetheapproachofdistributingthedatatomultiplemachinesinacomputeclusterandperformthenearestneighborsearchusingallthemachinesinparallel.Thedataisdistributedequallybetweenthemachines,suchthatforaclusterofNmachineseachofthemwillonlyhavetoindexandsearchofthewholedataset(althoughtheratioscanbechangedtohavemoredataonsomemachinesthanothers).Thenalresultofthenearestneighborsearchisobtainedbymergingthepartialresultsfromallthemachinesintheclusteroncetheyhavecompletedthesearch.InordertodistributethenearestneighbormatchingonacomputeclusterweimplementedaMap-Reducelikealgo-rithmusingthemessagepassinginterface(MPI)specication. theexperimentonasinglemachine.Fig.15showstheperformanceobtainedbyusingeightparallelprocessesonone,twoorthreemachines.Eventhoughthesamenumberofparallelprocessesareused,itcanbeseenthattheperformanceincreaseswhenthosepro-cessesaredistributedonmoremachines.Thiscanalsobeexplainedbythememoryaccessoverhead,sincewhenmoremachinesareused,fewerprocessesarerunningoneachmachine,requiringfewermemoryaccesses. ,2005,vol.1,pp.2633.[7]A.Torralba,R.Fergus,andW.T.Freeman,80milliontiny ,2009,pp.248255.[9]J.L.Bentley,Multidimensionalbinarysearchtreesusedforasso-ciativesearching,Commun.ACM,vol.18,no.9,pp.509517,1975.[10]J.H.Friedman,J.L.Bentley,andR.A.Finkel,Analgorithmforndingbestmatchesinlogarithmicexpectedtime,ACMTrans.Math.Softw.,vol.3,no.3,pp.209226,1977.[11]S.Arya,D.M.Mount,N.S.Netanyahu,R.Silverman,andA.Y.Wu,Anoptimalalgorithmforapproximatenearestneighborsearchinginxeddimensions,J.ACM,vol.45,no.6,pp.891923,1998.[12]J.S.BeisandD.G.Lowe,Shapeindexingusingapproximatenearest-neighboursearchinhigh-dimensionalspaces,inProc.IEEEConf.Comput.Vis.PatternRecog.,1997,pp.10001006.[13]C.Silpa-AnanandR.Hartley,OptimisedKD-treesforfastimage ,2008,pp.537546.[17]Y.Jia,J.Wang,G.Zeng,H.Zha,andX.S.Hua,Optimizingkd-treesforscalablevisualdescriptorindexing,inProc.IEEEConf.Comput.Vis.PatternRecog.,2010,pp.33923399.[18]K.FukunagaandP.M.Narendra,Abranchandboundalgo- [23]T.Liu,A.Moore,A.Gray,K.Yang,Aninvestigationofpracticalapproximatenearestneighboralgorithms,presentedattheAdvancesinNeuralInformationProcessingSystems,Vancouver,BC,Canada,2004.[24]D.NisterandH.Stewenius,Scalablerecognitionwithavocabu-larytree,inProc.IEEEConf.Comput.Vis.PatternRecog.,2006,pp.21612168.[25]B.Leibe,K.Mikolajczyk,andB.Schiele,Efcientclusteringandmatchingforobjectclassrecognition,inProc.BritishMach.Vis.Conf.,2006,pp.789798.[26]G.Schindler,M.Brown,andR.Szeliski,City-Scalelocationrec-ognition,inProc.IEEEConf.Comput.Vis.PatternRecog.,2007,pp.17.[27]H.J!egou,M.Douze,andC.Schmid,Productquantizationfornearestneighborsearch,IEEETrans.PatternAnal.Mach.Intell.,vol.32,no.1,pp.115,Jan.2010.[28]A.BabenkoandV.Lempitsky,Theinvertedmulti-index,inProc.IEEEConf.Comput.Vis.PatternRecog.,2012,pp.30693076.[29]A.AndoniandP.Indyk,Near-optimalhashingalgorithmsforapproximatenearestneighborinhighdimensions,Commun.ACM,vol.51,no.1,pp.117122,2008.[30]Q.Lv,W.Josephson,Z.Wang,M.Charikar,andK.Li,Multi-probeLSH:Efcientindexingforhigh-dimensionalsimilaritysearch,inProc.Int.Conf.VeryLargeDataBases Comput.Vis.PatternRecog.,2012,pp.11061113.[43]J.A.NelderandR.Mead,Asimplexmethodforfunctionmini-mization,Comput.J.,vol.7,no.4,pp.308313,1965.[44]F.Hutter,Automatedcongurationofalgorithmsforsolvinghardcomputationalproblems,Ph.D.dissertation,Comput.Sci.Dept.,Univ.BritishColumbia,Vancouver,BC,Canada,2009.[45]F.Hutter,H.H.Hoos,andK.Leyton-Brown,ParamILS:Anauto-maticalgorithmcongurationframework,J.Artif.Intell.Res. [47]M.Muja,Scalablenearestneighbourmethodsforhighdimen-sionaldata,Ph.D.dissertation,Comput.Sci.Dept.,Univ.BritishColumbia,Vancouver,BC,Canada,2013.[48]D.ArthurandS.Vassilvitskii,K-Means++:Theadvantagesofcarefulseeding,inProc.Symp.DiscreteAlgorithms,2007,pp.10271035.[49]M.Calonder,V.Lepetit,C.Strecha,andP.Fua,BRIEF:Binary 2011,pp.25482555.[52]M.MujaandD.G.Lowe,Fastmatchingofbinaryfeatures,inProc.9thConf.Comput.RobotVis.,2012,pp.404410.[53]S.WinderandM.Brown,Learninglocalimagedescriptors,inProc.IEEEConf.Comput.Vis.PatternRecog.,2007,pp.18.[54]K.MikolajczykandJ.Matas,Improvingdescriptorsforfasttree