/
Scalable Nearest Neighbor Algorithms for High Dimensional Data Marius Muja Member IEEE Scalable Nearest Neighbor Algorithms for High Dimensional Data Marius Muja Member IEEE

Scalable Nearest Neighbor Algorithms for High Dimensional Data Marius Muja Member IEEE - PDF document

conchita-marotz
conchita-marotz . @conchita-marotz
Follow
551 views
Uploaded On 2014-10-18

Scalable Nearest Neighbor Algorithms for High Dimensional Data Marius Muja Member IEEE - PPT Presentation

Lowe Member IEEE Abstract For many computer vision and machine learning problems large training sets are key for good performance However the most computationally expensive part of many computer vision and machine learning algorithms consists of 642 ID: 5768

Lowe Member IEEE Abstract

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Scalable Nearest Neighbor Algorithms for..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Formanycomputervisionandmachinelearningproblems,largetrainingsetsarekeyforgoodperformance.However,themostcomputationallyexpensivepartofmanycomputervisionandmachinelearningalgorithmsconsistsofÞndingnearestneighbormatchestohighdimensionalvectorsthatrepresentthetrainingdata.Weproposenewalgorithmsforapproximatenearestneighbormatchingandevaluateandcomparethemwithpreviousalgorithms.Formatchinghighdimensionalfeatures,weÞndtwoalgorithmsto referredtoasnearestneighbormatching.HavinganefÞ-cientalgorithmforperformingfastnearestneighbormatchinginlargedatasetscanbringspeedimprovements PsuchthattheoperationNN"q;P#canbeperformedefÞciently.WeareofteninterestedinÞndingnotjusttheÞrstclos-estneighbor,butseveralclosestneighbors.Inthiscase,thesearchcanbeperformedinseveralways,dependingonthenumberofneighborsreturnedandtheirdistancetothequerypoint:K-nearestneighbor(KNN)searchwherethegoalistoÞndtheclosestKpointsfromthequerypointandradiusnearestneighborsearch(RNN),wherethegoalistoÞndallthepointslocatedcloserthansomedistanceRfrom Nearest-neighborsearchisafundamentalpartofmanycomputervisionalgorithmsandofsigniÞcantimportanceinmanyotherÞelds,soithasbeenwidelystudied.Thissec-tionpresentsareviewofpreviousworkinthisarea.2.1NearestNeighborMatchingAlgorithmsWereviewthemostwidelyusednearestneighbortechni-ques,classiÞedinthreecategories:partitioningtrees,hash-ingtechniquesandneighboringgraphtechniques.2.1.1PartitioningTrees paperwedescribeamodiÞedk-meanstreealgorithmthatwehavefoundtogivethebestresultsforsomedatasets,whilerandomizedk-dtreesarebestforothers.J!egouetal.[27]proposetheproductquantizationapproachinwhichtheydecomposethespaceintolowdimensionalsubspacesandrepresentthedatasetspointsbycompactcodescomputedasquantizationindicesinthesesubspaces.ThecompactcodesareefÞcientlycomparedtothequerypointsusinganasymmetricapproximatedis-tance.BabenkoandLempitsky[28]proposetheinvertedmulti-index,obtainedbyreplacingthestandardquantiza-tioninaninvertedindexwithproductquantization,obtain- isimportantforobtaininggoodsearchperformance.InSection3.4weproposeanalgorithmforÞndingtheoptimumalgorithmparameters,includingtheoptimumbranchingfactor.Fig.3containsavisualisationofseveralhierarchicalk-meansdecomposi-tionswithdifferentbranchingfactors.Anotherparameteroftheprioritysearchk-meanstreeisImax,themaximumnumberofiterationstoperforminthek-meansclusteringloop.Performingfeweriterationscansubstantiallyreducethetreebuildtimeandresultsinaslightlylessthanoptimalclustering(ifweconsiderthesumofsquarederrorsfromthepointstotheclustercentresasthemeasureofoptimality).However,wehaveobservedthatevenwhenusingasmallnumberofiterations,thenear-estneighborsearchperformanceissimilartothatofthetreeconstructedbyrunningtheclusteringuntilconvergence,asillustratedbyFig.4.Itcanbeseenthatusingasfewasseveniterationswegetmorethan90percentofthenearest-neigh-borperformanceofthetreeconstructedusingfullconver-gence,butrequiringlessthan10percentofthebuildtime.Thealgorithmtousewhenpickingtheinitialcentresinthek-meansclusteringcanbecontrolledbytheCalgparame-ter.Inourexperiments(andintheFLANNlibrary)wehaveFig.3.Projectionsofprioritysearchk-meanstreesconstructedusingdifferentbranchingfactors:4,32,128.Theprojectionsareconstructedusingthesametechniqueasin[26],grayvaluesindicatingtheratiobetweenthedistancestothenearestandthesecond-nearestclustercentreateachtreelevel,sothatthedarkestvalues(ratio!1)fallneartheboundariesbetweenk-meansregions.Fig.4.Theinßuencethatthenumberofk-meansiterationshasonthesearchspeedofthek-meanstree.Figureshowstherelativesearchtimecomparedtothecaseofusingfullconvergence. distancestoalltheclustercentresofthechildnodes,anO!Kd"operation.Theunexploredbranchesareaddedtoapriorityqueue,whichcanbeaccomplishedinO!K"amor-tizedcostwhenusingbinomialheaps.FortheleafnodethedistancebetweenthequeryandallthepointsintheleafneedstobecomputedwhichtakesO!Kd"time.InsummarytheoverallsearchcostisO!Ld!logn=logK"".3.3TheHierarchicalClusteringTreeMatchingbinaryfeaturesisofincreasinginterestinthecom-putervisioncommunitywithmanybinaryvisualdescriptorsbeingrecentlyproposed:BRIEF[49],ORB[50],BRISK[51].Manyalgorithmssuitableformatchingvectorbasedfea-tures,suchastherandomizedkd-treeandprioritysearchk-meanstree,areeithernotefÞcientornotsuitableformatch-ingbinaryfeatures(forexample,theprioritysearchk-meanstreerequiresthepointstobeinavectorspacewheretheirdimensionscanbeindependentlyaveraged).BinarydescriptorsaretypicallycomparedusingtheHammingdistance,whichforbinarydatacanbecomputedasabitwiseXORoperationfollowedbyabitcountontheresult(veryefÞcientoncomputerswithhardwaresupportforcountingthenumberofbitssetinaword1).Thissectionbrießypresentsanewdatastructureandalgorithm,calledthehierarchicalclusteringtree,whichwefoundtobeveryeffectiveatmatchingbinaryfeatures.Foramoredetaileddescriptionofthisalgorithmthereaderisencouragedtoconsult[47]and[52].Thehierarchicalclusteringtreeperformsadecomposi-tionofthesearchspacebyrecursivelyclusteringtheinputdatasetusingrandomdatapointsastheclustercentersofthenon-leafnodes(seeAlgorithm3).Incontrasttotheprioritysearchk-meanstreepresentedabove,forwhichusingmorethanonetreedidnotbringsigniÞ-cantimprovements,wehavefoundthatbuildingmultiplehierarchicalclusteringtreesandsearchingtheminparallelusingacommonpriorityqueue(thesameapproachthathasbeenfoundtoworkwellforrandomizedkd-trees[13])resultedinsigniÞcantimprovementsinthesearchperformance.3.4AutomaticSelectionoftheOptimalAlgorithmOurexperimentshaverevealedthattheoptimalalgorithmforapproximatenearestneighborsearchishighlydepen-dentonseveralfactorssuchasthedatadimensionality,sizeandstructureofthedataset(whetherthereisanycorrela-tionbetweenthefeaturesinthedataset)andthedesiredsearchprecision.Additionally,eachalgorithmhasasetofparametersthathavesigniÞcantinßuenceonthesearchper-formance(e.g.,numberofrandomizedtrees,branchingfac-tor,numberofk-meansiterations).AswealreadymentioninSection2.2,theoptimumparametersforanearestneighboralgorithmaretypicallychosenmanually,usingvariousheuristics.Inthissectionweproposeamethodforautomaticselectionofthebestnearestneighboralgorithmtouseforaparticulardatasetandforchoosingitsoptimumparameters. beingacostfunctionindicatinghowwellthesearchalgorithmA,conÞguredwiththeparametersu,per-formsonthegiveninputdata.1.ThePOPCNTinstructionformodernx86_64architectures. Meaddownhillsimplexmethod[43]tofurtherlocallyexploretheparameterspaceandÞne-tunethebestsolutionobtainedintheÞrststep.Althoughthisdoesnotguaranteeaglobalminimum,ourexperimentshaveshownthattheparametervaluesobtainedareclosetooptimuminpractice.Weuserandomsub-samplingcross-validationtogener-atethedataandthequerypointswhenweruntheoptimiza-tion.InFLANNtheoptimizationcanberunonthefulldatasetforthemostaccurateresultsorusingjustafractionofthedatasettohaveafasterauto-tuningprocess.Theparameterselectionneedstoonlybeperformedonceforeachtypeofdataset,andtheoptimumparametervaluescanbesavedandappliedtoallfuturedatasetsofthesametype.4EXPERIMENTSFortheexperimentspresentedinthissectionweusedaselectionofdatasetswithawiderangeofsizesanddatadimensionality.AmongthedatasetsusedaretheWinder/Brownpatchdataset[53],datasetsofrandomlysampleddataofdifferentdimensionality,datasetsofSIFTfeatures thememoryusageisnotaconcern, thedatainthememoryofasinglemachineforverylargedatasets.StoringthedataonthediskinvolvessigniÞcantperformancepenaltiesduetotheperformancegapbetweenmemoryanddiskaccesstimes.InFLANNweusedtheapproachofperformingdistributednearestneighborsearchacrossmultiplemachines.5.1SearchingonaComputeClusterInordertoscaletoverylargedatasets,weusetheapproachofdistributingthedatatomultiplemachinesinacomputeclusterandperformthenearestneighborsearchusingallthemachinesinparallel.Thedataisdistributedequallybetweenthemachines,suchthatforaclusterofNmachineseachofthemwillonlyhavetoindexandsearchofthewholedataset(althoughtheratioscanbechangedtohavemoredataonsomemachinesthanothers).TheÞnalresultofthenearestneighborsearchisobtainedbymergingthepartialresultsfromallthemachinesintheclusteroncetheyhavecompletedthesearch.InordertodistributethenearestneighbormatchingonacomputeclusterweimplementedaMap-Reducelikealgo-rithmusingthemessagepassinginterface(MPI)speciÞcation. theexperimentonasinglemachine.Fig.15showstheperformanceobtainedbyusingeightparallelprocessesonone,twoorthreemachines.Eventhoughthesamenumberofparallelprocessesareused,itcanbeseenthattheperformanceincreaseswhenthosepro-cessesaredistributedonmoremachines.Thiscanalsobeexplainedbythememoryaccessoverhead,sincewhenmoremachinesareused,fewerprocessesarerunningoneachmachine,requiringfewermemoryaccesses. ,2005,vol.1,pp.26Ð33.[7]A.Torralba,R.Fergus,andW.T.Freeman,Ò80milliontiny ,2009,pp.248Ð255.[9]J.L.Bentley,ÒMultidimensionalbinarysearchtreesusedforasso-ciativesearching,ÓCommun.ACM,vol.18,no.9,pp.509Ð517,1975.[10]J.H.Friedman,J.L.Bentley,andR.A.Finkel,ÒAnalgorithmforÞndingbestmatchesinlogarithmicexpectedtime,ÓACMTrans.Math.Softw.,vol.3,no.3,pp.209Ð226,1977.[11]S.Arya,D.M.Mount,N.S.Netanyahu,R.Silverman,andA.Y.Wu,ÒAnoptimalalgorithmforapproximatenearestneighborsearchinginÞxeddimensions,ÓJ.ACM,vol.45,no.6,pp.891Ð923,1998.[12]J.S.BeisandD.G.Lowe,ÒShapeindexingusingapproximatenearest-neighboursearchinhigh-dimensionalspaces,ÓinProc.IEEEConf.Comput.Vis.PatternRecog.,1997,pp.1000Ð1006.[13]C.Silpa-AnanandR.Hartley,ÒOptimisedKD-treesforfastimage ,2008,pp.537Ð546.[17]Y.Jia,J.Wang,G.Zeng,H.Zha,andX.S.Hua,ÒOptimizingkd-treesforscalablevisualdescriptorindexing,ÓinProc.IEEEConf.Comput.Vis.PatternRecog.,2010,pp.3392Ð3399.[18]K.FukunagaandP.M.Narendra,ÒAbranchandboundalgo- [23]T.Liu,A.Moore,A.Gray,K.Yang,ÒAninvestigationofpracticalapproximatenearestneighboralgorithms,ÓpresentedattheAdvancesinNeuralInformationProcessingSystems,Vancouver,BC,Canada,2004.[24]D.NisterandH.Stewenius,ÒScalablerecognitionwithavocabu-larytree,ÓinProc.IEEEConf.Comput.Vis.PatternRecog.,2006,pp.2161Ð2168.[25]B.Leibe,K.Mikolajczyk,andB.Schiele,ÒEfÞcientclusteringandmatchingforobjectclassrecognition,ÓinProc.BritishMach.Vis.Conf.,2006,pp.789Ð798.[26]G.Schindler,M.Brown,andR.Szeliski,ÒCity-Scalelocationrec-ognition,ÓinProc.IEEEConf.Comput.Vis.PatternRecog.,2007,pp.1Ð7.[27]H.J!egou,M.Douze,andC.Schmid,ÒProductquantizationfornearestneighborsearch,ÓIEEETrans.PatternAnal.Mach.Intell.,vol.32,no.1,pp.1Ð15,Jan.2010.[28]A.BabenkoandV.Lempitsky,ÒTheinvertedmulti-index,ÓinProc.IEEEConf.Comput.Vis.PatternRecog.,2012,pp.3069Ð3076.[29]A.AndoniandP.Indyk,ÒNear-optimalhashingalgorithmsforapproximatenearestneighborinhighdimensions,ÓCommun.ACM,vol.51,no.1,pp.117Ð122,2008.[30]Q.Lv,W.Josephson,Z.Wang,M.Charikar,andK.Li,ÒMulti-probeLSH:EfÞcientindexingforhigh-dimensionalsimilaritysearch,ÓinProc.Int.Conf.VeryLargeDataBases Comput.Vis.PatternRecog.,2012,pp.1106Ð1113.[43]J.A.NelderandR.Mead,ÒAsimplexmethodforfunctionmini-mization,ÓComput.J.,vol.7,no.4,pp.308Ð313,1965.[44]F.Hutter,ÒAutomatedconÞgurationofalgorithmsforsolvinghardcomputationalproblems,ÓPh.D.dissertation,Comput.Sci.Dept.,Univ.BritishColumbia,Vancouver,BC,Canada,2009.[45]F.Hutter,H.H.Hoos,andK.Leyton-Brown,ÒParamILS:Anauto-maticalgorithmconÞgurationframework,ÓJ.Artif.Intell.Res. [47]M.Muja,ÒScalablenearestneighbourmethodsforhighdimen-sionaldata,ÓPh.D.dissertation,Comput.Sci.Dept.,Univ.BritishColumbia,Vancouver,BC,Canada,2013.[48]D.ArthurandS.Vassilvitskii,ÒK-Means++:Theadvantagesofcarefulseeding,ÓinProc.Symp.DiscreteAlgorithms,2007,pp.1027Ð1035.[49]M.Calonder,V.Lepetit,C.Strecha,andP.Fua,ÒBRIEF:Binary 2011,pp.2548Ð2555.[52]M.MujaandD.G.Lowe,ÒFastmatchingofbinaryfeatures,ÓinProc.9thConf.Comput.RobotVis.,2012,pp.404Ð410.[53]S.WinderandM.Brown,ÒLearninglocalimagedescriptors,ÓinProc.IEEEConf.Comput.Vis.PatternRecog.,2007,pp.1Ð8.[54]K.MikolajczykandJ.Matas,ÒImprovingdescriptorsforfasttree