/
NearOptimal Hashing Algorithms for Approximate Nearest N eighbor in High Dimensi NearOptimal Hashing Algorithms for Approximate Nearest N eighbor in High Dimensi

NearOptimal Hashing Algorithms for Approximate Nearest N eighbor in High Dimensi - PDF document

yoshiko-marsland
yoshiko-marsland . @yoshiko-marsland
Follow
428 views
Uploaded On 2014-10-03

NearOptimal Hashing Algorithms for Approximate Nearest N eighbor in High Dimensi - PPT Presentation

edu Piotr Indyk MIT indykmitedu Abstract We present an algorithm for the approximate near est neighbor problem in a dimensional Euclidean space achieving query time of dn c 1 and space dn 11 c 1 This almost matches the lower bound for hashingbased a ID: 2306

edu Piotr Indyk MIT indykmitedu

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "NearOptimal Hashing Algorithms for Appro..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Near-OptimalHashingAlgorithmsforApproximateNearestNeighborinHighDimensionsAlexandrAndoniMITandoni@mit.eduPiotrIndykMITindyk@mit.eduAbstractWepresentanalgorithmforthec-approximatenear-estneighborprobleminad-dimensionalEuclideanspace,achievingquerytimeofO(dn1=c2+o(1))andspaceO(dn+n1+1=c2+o(1)).Thisalmostmatchesthelowerboundforhashing-basedalgorithmrecentlyobtainedin[27].Wealsoobtainaspace-efcientversionofthealgorithm,whichusesdn+nlogO(1)nspace,withaquerytimeofdnO(1=c2).Fi-nally,wediscusspracticalvariantsofthealgorithmsthatutilizefastbounded-distancedecodersfortheLeechLat-tice.1.IntroductionThenearestneighborproblemisdenedasfollows:givenacollectionofnpoints,buildadatastructurewhich,givenanyquerypoint,reportsthedatapointthatisclos-esttothequery.Aparticularlyinterestingandwell-studiedinstanceiswherethedatapointsliveinad-dimensionalEu-clideanspace.Thisproblemisofmajorimportanceinsev-eralareas;someexamplesare:datacompression,databasesanddatamining,informationretrieval,imageandvideodatabases,machinelearning,patternrecognition,statisticsanddataanalysis.Typically,thefeaturesofeachobjectofinterest(document,image,etc)arerepresentedasapointindandthedistancemetricisusedtomeasuresimilarityofobjects.Thebasicproblemthenistoperformindexingorsimilaritysearchingforqueryobjects.Thenumberoffea-tures(i.e.,thedimensionality)rangesanywherefromtenstothousands.Thereareseveralefcientalgorithmsknownforthecasewhenthedimensiondis“low”(see[30]foranoverview).Thereforethemainissueisthatofdealingwithalargenum-berofdimensions.Despitedecadesofintensiveeffort,the ThisworkwassupportedinpartbyNSFCAREERawardCCR-0133849,DavidandLucillePackardFellowshipandAlfredP.SloanFel-lowship.currentsolutionssufferfromeitherspaceorquerytimethatisexponentialind.Infact,forlargeenoughd,intheoryorinpractice,theyoftenprovidelittleimprovementoveralinearalgorithmthatcomparesaquerytoeachpointfromthedatabase.Thisphenomenonisoftencalled“thecurseofdimensionality”.Inrecentyears,severalresearchersproposedmethodsforovercomingtherunningtimebottleneckbyusingapproxi-mation(e.g.,[7,23,21,25,16,24,17,12,8,28,1],seealso[31]).Inthatformulation,thealgorithmisallowedtoreturnapoint,whosedistancefromthequeryisatmostctimesthedistancefromthequerytoitsnearestpoints;c=1+�1iscalledtheapproximationfactor.Theappealofthisapproachisthat,inmanycases,anapproximatenear-estneighborisalmostasgoodastheexactone.Inpartic-ular,ifthedistancemeasureaccuratelycapturesthenotionofuserquality,thensmalldifferencesinthedistanceshouldnotmatter.Moreover,anefcientapproximationalgorithmcanbeusedtosolvetheexactnearestneighborproblem,byenumeratingallapproximatenearestneighborsandchoos-ingtheclosestpoint.Formanydatasetsthisapproachre-sultsinveryefcientalgorithms(seee.g.,[4]).In[25,21,8,1],theauthorsconstructeddatastructuresforthe(1+)-approximatenearestneighborproblemwhichavoidedthecurseofdimensionality.Specically,foranyconstant�0,thedatastructuressupportqueriesintimeO(dlogn),andusespacewhichispolynomialinn.Unfor-tunately,theexponentinthespaceboundsisroughlyC=2(for1),whereCisa“non-negligible”constant.Thus,evenfor,say,=1,thespaceusedbythedatastructureislargeenoughsothatthealgorithmbecomesimpracticalevenforrelativelysmalldatasets.Infact,inarecentpa-per[6]weshowthat,inordertosolvethedecisionversionoftheapproximatenearestneighborproblem,theexponentinthespaceboundmustbeatleast\n(1=2),aslongasthesearchalgorithmperformsonlyaconstantnumberof“probes”tothedatastructure.Thealgorithmsof[25,21]useonlyoneprobe.In[21,15],theauthorsintroducedanalternativeap-proach,whichusesmuchsmallerspacewhilepreserving1 sub-linearquerytime(seeFigure1fortherunningtimes).Itreliesonaconceptoflocality-sensitivehashing(LSH).Thekeyideaistohashthepointsusingseveralhashfunc-tionssoastoensurethat,foreachfunction,theprobabilityofcollisionismuchhigherforobjectswhichareclosetoeachotherthanforthosewhicharefarapart.Then,onecandeterminenearneighborsbyhashingthequerypointandretrievingelementsstoredinbucketscontainingthatpoint.In[21,15]theauthorsprovidedsuchlocality-sensitivehashfunctionsforthecasewhenthepointsliveinbinaryHam-mingspace1f0;1gd.Inafollowupwork[12],theauthorsintroducedLSHfunctionsthatworkdirectlyinEuclideanspaceandresultina(slightly)fasterrunningtime.ThelatteralgorithmformsthebasisofE2LSHpackage[5]forhigh-dimensionalsimilaritysearch,whichhasbeenusedinseveralappliedscenarios.Recently,[28]proposedadif-ferentmethodofutilizinglocality-sensitivehashfunctions,whichresultsinnear-linearspace,atthecostofsomewhathigherquerytime.Thenaturalquestionraisedbythislineofresearchis:whatisthesmallestexponentachievableviathelocality-sensitivehashingapproach?Itwasconjecturedbythesec-ondauthor(e.g.,in[19])that1=c2.Theconjecturewasmotivatedbythefactthatanalgorithmwithsuchexponentexistsforthecloselyrelatedproblemofndingthefurthestneighbor[20].Ourresults.Inthispaperweessentiallyresolvetheis-suebyprovidinganalgorithmwithquerytimedn(c)usingspacedn+n1+(c),where(c)=1=c2+O(loglogn=log1=3n):Thissignicantlyimprovesovertheearlierrunningtimeof[12].Inparticular,forc=2,ourexponenttendsto0:25,whiletheexponentin[12]wasaround0:45.Moreover,are-centpaper[27]showsthathashing-basedalgorithms(asde-scribedinSection2.3)cannotachieve0:462=c2.Thus,therunningtimeexponentofouralgorithmisessentiallyoptimal,uptoasmallconstantfactor.Ourresultimmediatelyimpliesimprovedalgorithmsforseveralapproximateproblemsinhighdimensionalspaces.Forexample,itisknown[21,18]thatthec-approximateminimumspanningtree(MST)problemfornpointsinld2canbecomputedbyusingO(nlogn)callstothec-approximatenearneighbororacleforthatspace.Thus,ourresultimpliesadn1+1=c2+o(1)-timealgorithmforthec-approximateMSTproblem.Otherproblemsforwhichsim-ilarimprovementisobtainedincludedynamicclosestpairandfacilitylocation[18].Unfortunately,theconvergenceoftheexponenttothe1=c2limitisratherslow.Tobemoreprecise:therunning 1Thealgorithmcanbeextendedtoothernorms,suchasl2,byusingembeddings.However,thisextensionaddsadditionalcomplexitytothealgorithm. Paper Metric Space Querytime Comments [21,15] Hamming n1+1=c dn1=c [12] Euclidean n1+0(c) dn0(c) 0(c)1=c [28] Euclidean n dn00(c) 00(c) c!2:09 here Euclidean n1+1=c2+o(1) dn1=c2+o(1) Euclidean n dnO(1=c2) Figure1.SpaceandtimeboundsforLSH-baseddatastructures.Factorspolynomialinlognand1=,aswellasanadditivetermofdninthestoragebound,areomittedforclarity.timeofthealgorithmisboundedbytheformulatO(t)n1=c2+O(logt)=t1=2wheretisaparameterchosentominimizetheexpression.ThetO(t)factorappearsduetothefactthatouralgorithmexploitscertaincongurationsofpointsinat-dimensionalspace;the“quality”ofthecongurationsincreaseswitht.Onecanobservethattheparametertneedstobesomewhatlargefortheexponenttobecompetitiveagainsttheearlierbounds.ButthenthefactortO(t)becomesverylarge,eras-ingthespeedupgainedfromtheimprovedexponent(unlessnisreallylarge).Toovercomethisdifculty,wemodifythealgorithmtomakeitefcientformoremoderatevaluesofn.Specif-ically,wereplacetheaforementionedcongurationsofpointsbyknownconstructionsof“nice”point-setsinspe-cicdimensions.Inparticular,byutilizingLeechLat-tice[26]in24dimensions,weobtainanalgorithmwithex-ponent(c)suchthat(2)0:37,whiletheleadingtermintherunningtimeisreducedtoonlyfewhundred.More-over,ifthedimensionddoesnotexceed24,theexponentisreduced2further,andweachieve(2)0:27.Theleadingtermintherunningtimeremainsthesame.Finally,weshowthattheLSHfunctionscanbeusedasin[28]todesignadatastructurewithnearly-linearspaceofO(dn+nlogO(1)n)andquerytimednO(1=c2).ThisimprovesovertheearlierboundofdnO(1=c)dueto[28].1.1TechniquesWeobtainourresultbycarefullydesigningafamilyoflocality-sensitivehashfunctionsinl2.Thestartingpointof 2Anastutereaderwillobservethatifboththedimensiondandapprox-imationfactorcarexedconstants,onecanobtainadatastructurewithconstantquerytime,essentiallyviatablelookup.However,thisapproachleadsto“big-Oh”constantsthatareexponentialinthedimension,whichdefeatsourgoalofachievingapracticalalgorithm. ourconstructionisthemethodof[12].There,apointpwasmappedinto1byusingrandomprojection.Then,theline1waspartitionedintoequal-lengthintervalsoflengthw,wherewisaparameter.Thehashfunctionforpreturnedtheindexoftheintervalcontainingtheprojectionofp.Ananalysisin[12]showedthatthequerytimeexponenthasaninterestingdependenceontheparameterw.Ifwtendstoinnity,theexponenttendsto1=c,whichyieldsnoimprovementover[21,15].However,forsmallvaluesofw,theexponentliesslightlybelow1=c.Infact,theuniqueminimumexistsforeachc.Inthispaperweutilizea”multi-dimensionalversion”oftheaforementionedapproach.Specically,werstperformrandomprojectionintot,wheretissuper-constant,butrelativelysmall(i.e.,t=o(logn)).Thenwepartitionthespacetintocells.Thehashfunctionfunctionreturnstheindexofthecellwhichcontainsprojectedpointp.Thepartitioningofthespacetissomewhatmorein-volvedthanitsone-dimensionalcounterpart.First,observethatthenaturalideaofpartitioningusingagriddoesnotwork.Thisisbecausethisprocessroughlycorrespondstohashingusingconcatenationofseveralone-dimensionalfunctions(asin[12]).SincetheLSHalgorithmsperformssuchconcatenationanyway(seePreliminaries),gridparti-tioningdoesnotresultinanyimprovement.Instead,weusethemethodof”ballpartitioning”,introducedin[9]inthecontextofembeddingsintotreemetrics(asimilartech-niquewasalsousedintheSDP-basedapproximationalgo-rithmforgraphcoloring[22]).Itsideaisasfollows.CreateasequenceofballsB1;B2:::,eachofradiusw,withcen-terschosenindependently“atrandom”.EachballBithendenesacell,containingpointsBi[jiBj.Inordertoapplythismethodinourcontext,weneedtotakecareofafewissues.First,wecannotusethemethodasgiven,sincelocatingacellcontainingagivenpointcouldtakealongtime.Instead,weshowthatonecansimulatetheaboveprocedurebyreplacingeachballbya”gridofballs”.Itisnotdifcultthentoobservethatanite(albeitexponentialint)numberofsuchgridssufcestocoverallpointsint.Thesecondandthemainissueisthechoiceofw.Again,itturnsoutthatforlargew,themethodyieldsonlytheex-ponentof1=c.Specically,itwasshownin[9]thatforanytwopointsp;q2t,theprobabilitythattheparti-tioningseparatespandqisatmostOp tkpqk=w.Thisformulacanbeshowedtobetightfortherangeofwwhereitmakessenseasalowerbound,thatis,forw=\np tkpqk.However,aslongasthesepara-tionprobabilitydependslinearlyonthedistancebetweenpandq,theexponentisstillequalto1=c.Fortunately,amorecarefulanalysisshowsthat,asintheone-dimensionalcase,theminimumisachievedfornitew.Forthatvalueofw,theexponenttendsto1=c2asttendstoinnity.LeechLatticeLSH.Inordertoobtainamorepracticalalgorithm,weintroduceadifferentpartitioningmethodthatavoidsthetO(t)factor.Specically,weusetessellationsinducedby(randomlyshifted)Voronoidiagramsofxedt-dimensionalpointconstellationswhichhavethefollowingtwoniceproperties:Theclosestconstellationpointtoagivenpointcanbefoundefciently,andTheexponentinducedbytheconstellationisascloseto1=c2aspossible.Thepartitioningisthenimplementedbyrandomlypro-jectingthepointsintod,andusingtheVoronoidiagram.Wediscoveredthataconstellationin24dimensionsknownasLeechLattice[26]satisestheabovepropertiesquitewell.First,thenearestpointinthelatticecanbefoundbyusinga(bounded)decoderof[2]whichperformonly519oatingpointoperationsperdecodedpoint.Second,theex-ponent(c)guaranteedbythatdecoderisquiteattractive:forc=2theexponent(2)islessthan0.37.TheintuitivereasonforthatisthatLeechLatticeisa“verysymmetric”constellation,andthusitsVoronoicellsarevery“round”.Moreover,ifthedimensionddoesnotexceed24,thenwecanskipthedimensionalityreductionpart.Inthatcaseweget(2)0:27,whiletheleadingtermintherunningtimeremainsthesame.Near-linearspacealgorithm.ThisresultisachievedbypluggingournewLSHfunctionintothealgorithmof[28].Unlikethealgorithmof[21](whichusednindependenthashtables),hisalgorithmusesonlyonehashtabletostorethedatasetP.Thehashtableisthenprobedbyhash-ingnotjustthequerypointq(asin[21])butbyhashingseveralpointschosenrandomlyfromtheneighborhoodofq.Theintuitionbehindthisapproachisasfollows.Letp2Pbeapointwithindistance1fromq.IfarandomLSHfunctioncausescollisionbetweenpandqwithprob-ability1=n,thenitisplausiblethat,withconstantproba-bility3,arandomhashfunctioncausescollisionbetweenqanda“non-negligible”(say,1=n)fractionofthepointsintheunitballaroundq.Sinceqandpare“close”,itfollowsthat,withconstantprobability,arandomhashfunc-tioncausescollisionbetweenpanda“non-negligible”(al-thoughslightlysmaller)fractionoftheunitballaroundq,whichisexactlywhatthealgorithmof[28]needs.Convertingthisintuitionintoaformalproofissomewhattechnical.ThisismostlyduetothefactthatthenewLSHfunctionsaremorecomplexthantheonesfrom[12](usedin[28]),andthuswehadtoextendhisframeworktoamoregeneralsetting.Wedefertheproofstothefullversionofthispaper. 3Intheactualproof,theprobabilityis1=logO(1)n. 2.Preliminaries2.1.NotationInthispaper,weworkintheEuclideanspace.Forapointp2d,wedenotebyB(p;r)theballcenteredatpwithradiusr,andwecallB(p;r)itssurface.Foraballwithradiusrind,wecallitssurfaceareaSurd(r)anditsvolumeVold(r).WenotethatSurd=Sdrd1andVold(r)=Sdrd d,whereSdisthesurfaceareaofaballofradiusone(see,forexample,[29],page11).Wealsoneeda(standard)boundonthevolumeofthecapofaballB(p;r).LetC(u;r)bethevolumeofthecapatdistanceufromthecenteroftheball.Alternatively,C(u;r)isthehalfofthevolumeoftheintersectionoftwoballsofradiusrwithcentersatdistance2u.Furthermore,letI(u;r)=C(u;r) Vold(r)bethecapvolumerelativetothevolumeoftheentiresphere.WecanboundI(u;r)asfollows.Lemma2.1.Foranyd2and0ur,Al p d1u r2d 2I(u;r)1u r2d 2Theproofisdeferredtotheappendix.WealsousethefollowingstandardfactsaboutrandomprojectionsinEuclideanspaces(forproofssee,e.g.,[21]).LetA2Mt;dbearandomprojectionfromdtot;specif-icallyeachelementofAischosenfromnormaldistributionN(0;1),multipliedbyascalingfactor1 p t.Fact2.2.Foranyvectorv2d,thevaluekAvk2=kvk2isdistributedwithprobabilitydensitytP2(xt),whereP2(x)=xt=21ex=2 (t=2)2t=2isthechi-squareddistributionwithtdegreesoffreedom.TheexpectationofkAvk2=kvk2isequalto1.Fact2.3.Foranyvectorv2d,PrA[kAvk&#x-0.4;腓倀2kvk]exp[\n(t)].Fact2.4.Foranyvectorv2dandanyconstant &#x-0.4;腓倀10,PrA[kAvk&#x-0.4;腓倀 kvk]exp[\n(tp )].2.2.Problemde nitionInthispaper,wesolvethec-approximatenearneighborprobleminl2,theEuclideanspace.Denition2.5(c-approximatenearneighbor,orc-NN).GivenasetPofpointsinad-dimensionalEuclideanspaced,andparametersR&#x-0.4;腓倀0,&#x-0.4;腓倀0,constructadatastruc-turewhich,givenanyquerypointq,doesthefollowingwithprobability1:ifthereexistsanR-nearneighborofqinP,itreportssomecR-nearneighborofqinP.Inthefollowing,wewillassumethatisanabsoluteconstantboundedawayfrom1.Notethattheprobabilityofsuccesscanbeampliedbybuildingandqueryingseveralinstancesofthedatastructure.Formally,anR-nearneighborofqisapointpsuchthatjjpqjj2R.NotethatwecanscaledownthecoordinatesofallpointsbyR,inwhichcaseweneedonlytosolvethec-NNproblemforR=1.Thus,wewillconsiderthatR=1fortherestofthepaper.2.3.Locality-SensitiveHashingTosolvethec-approximatenearneighbor,weusethelocality-sensitivehashingscheme(LSH).Belowwede-scribethegeneralLSHscheme,asitwasrstproposedin[21].InthispaperwereusethesameLSHscheme,butweintroduceanewfamilyoflocality-sensitivehashfunc-tions.TheLSHschemereliesonexistenceoflocality-sensitivehashfunctions.ConsiderafamilyHofhashfunctionsmap-pingdtosomeuniverseU.Denition2.6(Locality-sensitivehashing).AfamilyHiscalled(R;cR;p1;p2)-sensitiveifforanyp;q2difkpqkRthenPrH[h(q)=h(p)]p1,ifkpqkcRthenPrH[h(q)=h(p)]p2.TheLSHfunctionscanbeusedtosolvethec-NNprob-lem,asperthefollowingtheoremof[21].Let=log(1=p1) log(1=p2).Fact2.7.Givenafamilyof(1;c;p1;p2)-sensitivehashfunctionsford,whereeachfunctioncanbeevaluatedintime,onecanconstructadatastructureforc-NNwithO((d+)nlog1=p2n)querytimeandspaceO(dn+n1+).3.MainalgorithmOurnewalgorithmforc-NNusesanewfamilyofLSHfunctionsforl2,whilereusingtheLSHschemeofsection2.3.Thisnewfamilyispresentedbelow.Oncewede-scribethenewfamilyofLSHfunctions,weprovethatthequerytimeisO(n1=c2+o(1))byshowingthatL=n=O(n1=c2+o(1)),k=O(logn),andthat=O(dno(1)).3.1.LSHFamilyforl2Werstdescribean“ideal”LSHfamilyforl2.Althoughthisapproachhassomedeciencies,weshowhowtoover-comethem,andobtainagoodfamilyofLSHfunctions.The naldescriptionoftheLSHfamilyispresentedinthegure2.IdealLSHfamily.Constructahashfunction~hasfol-lows.ConsiderGd,aregularinnitegridofballsind:eachballhasradiuswandhasthecenterat4w d.LetGdu,forpositiveintegeru,bethegridGdshifteduni-formlyatrandom;inotherwords,Gdu=Gd+su,wheresu2[0;4w]d.NowwechooseasmanyGdu'sasareneededtocovertheentirespaced(i.e.,untileachpointfromdbelongstoatleastoneoftheballs).SupposeweneedUsuchgridstocovertheentirespacewithhighprobability.Wedene~honapointpasatuple(u;x1;x2;:::xd),u2[1;U]and(x1;:::xd)2Gdu.Thetuple(u;x1;x2;:::xd)speciestheballwhichcontainsthepointp:p2B((x1;x2;:::xn);w).Ifthereareseveralballsthatcontainp,thenwetaketheonewiththesmallestvalueu.Comput-ing~h(p)canbedonein=O(U)time:weiteratethroughallGd1;Gd2;:::GdU,andndtherstGdusuchthatpisinsideaballwiththecenterfromGdu.Intuitively,thisfamilysatisesourlocality-sensitivedef-inition:thecloserarethepointsp;q,thehigheristheprobabilitythatp;qbelongtothesameball.Indeed,ifwechooseasuitableradiusw1=2,thenwewillgetL=n=O(n1=c2+o(1)).However,thedeciencyofthisfamilyisthatthetimetocompute~h(p)mightbetoolargeifd=\n(logn)sinceweneedtosetU=\n(2d)(seelemma3.1).Weshowhowtocircumventthisdeciencynext.ActualLSHfamily.Ouractualconstructionutilizesthe“ideal”familydescribedabove,whileintroducinganad-ditionalstep,necessarytoreduceU,thenumberofgridscoveringthespace.ThealgorithmisgiveninFigure2.ToreduceU,weprojectdtoalower-dimensionalspacetviaarandomdimensionalityreduction.Thepa-rametertiso(logn),suchthatfactorsexponentialintareo(n).Afterperformingtheprojection,wechoosethegridsGt1;Gt2;:::GtUinthelower-dimensionalspacet.Now,tocomputeh(p),wecomputetheprojectionofpontothelowerdimensionalspacet,andprocessthepro-jectedpointasdescribedearlier.Inshort,theactualhashfunctionish(p)=~h(Ap),whereAisarandommatrixrepresentingthedimensionalityreductionmapping,and~hworksinthet-dimensionalspace.Notethatbecomes=O(dt)+O(Ut)correspondingtotheprojectionandthebucket-computationstagesrespectively.3.2.AnalysisoftheLSHfamilyWestartbyboundingthenumberofgridsGdneededtocovertheentirespaced,foranydimensiond.Lemma3.1.Considerad-dimensionalspaced.LetGdbearegularinnitegridofballsofradiuswplacedatco-ordinateswd,where2dO(1).DeneGdu,for Initializationofahashfunctionh2H1.Foru=1toU,choosearandomshiftsu2[0;4w]t,whichspeciesthegridGtu=Gt+suinthet-dimensionalEu-clideanspace.2.ChooseamatrixA2Mt;d,whereeachelementAijisdis-tributedaccordingtothenormaldistributionN(0;1)timesascalingfactor,1 p t.ThematrixArepresentsarandomprojec-tionfromtot.Computingh()onapointp21.Letp0=Apbetheprojectionofthepointpontothet-dimensionalsubspacegivenbyA.2.Foreachu=1;2;:::U3.CheckwhetherB(p0;w)\Gtu=;,i.e.,whetherthereexistsome(x1;x2;:::xt)2Gtusuchthatp2B((x1;x2;:::xt);w).4.Oncewendsuch(x1;x2;:::xt),seth(p)=(u;x1;x2;:::xt),andstop.5.Return0t+1ifwedonotndanysuchball. Figure2.AlgorithmsforinitializingahashfunctionhfromtheLSHhashfamily,andforcomputingh(p)forapointp2d.positiveintegeru,asGdu=Gd+su,wheresu2[0;w]disarandomshiftofthegridGd.IfUd=2O(dlogd)logn,then,thegridsGd1;Gd2;:::GdUdcovertheentirespaced,w.h.p.Proof.First,observethattheentirespaceiscoveredifandonlyifthehypercube[0;w]discoveredbygridsGdu(duetotheregularityofthegrids).Toprovethat[0;w]discovered,wepartitionthehy-percube[0;w]dintosmaller“micro-cubes”andprovethateachofthemiscoveredwithahighenoughprobabil-ity.Specically,wepartitionthehypercube[0;w]dintosmallermicro-cubes,eachofsizew p dw p dw p d.ThereareN=(w)d (w=p d)d=(p d)dsuchmicro-cubesintotal.Letxbetheprobabilitythatamicro-cubeiscoveredbyonegridGdu.Thenx(w=p d)d (w)d=1=Nbecause,foramicro-cubetobecovered,itsufcesthatthecenteroftheballB(0d+su;w)fallsinsidethemicro-cube,whichhappenswithprobability1=N.Furthermore,ifxUistheprobabilitythatamicro-cubeiscoveredbyanyoftheUdgridsGdu,thenxU1(1x)Ud.Thus,wecancomputetheprobabilitythatthereexistsatleastoneuncoveredmicro-cube,whichisalsotheprob-abilitythattheentire[0;w]dhypercubeisuncovered.SetUd=aN(logn+logN)forasuitableconstanta.Usingunionbound,weobtainthattheprobabilitythattheentire hypercubeisnotcoveredisatmostN(1x)UdN(11=N)UdN(11=N)aN(logn+logN)N2lognlogN1=nConcluding:withprobabilityatleast11=nwecovertheentirespacewiththegridsGd1;:::GdUd,ifwechooseUd=O(N(logn+logN))=2O(dlogd)logn. Thenextlemmastatesthemaintechnicalresultofthispaper.Lemma3.2.Considerthehashfunctionhdescribedinthegure2,andletp;qbesomepointsind.Letp1betheprobabilitythath(p)=h(q)giventhatjjpqjj1,andletp2betheprobabilitythath(p)=h(q)giventhatjjpqjjc.Then,forw=4p t,weobtain=log1=p1 log1=p2=1=c2+Ologt t1=2.Proof.Theproofproceedsinthreestages.Fixsomepointsp;q2datdistance.Firstwecomputetheprobabilityofcollisionofpandqgiventhattheirdistanceaftertheprojectionisequaltosomexedvalue0.Next,foreachofthecaseswhen1andc,wecomputethecollisionprobabilities(p1andp2,resp.)byintegratingovertherangeofthepossible(distorted)distances0.Finally,giventheboundsonp1andp2,wecomputethevalueof=log1=p1 log1=p2.Supposepointspandqareprojectedunderthedimen-sionalityreductionintopointsp0=Apandq0=Aq,p0;q02t,with0=kp0q0k;theprobabilityofcol-lisionofpandqcanbededucedasfollows.ConsiderthesequenceofgridsGt1;Gt2;:::;GtU,andletGtubetherstgridsuchthatp0orq0areinsideaballB(x;w)withcenterinGtu.Notethatthepositionofthisballdeneswhetherh(p)=h(q)ornot.Inparticular,ifp0;q02B(x;w)thenh(p)=h(q)and,otherwise,ifexactlyoneofp0;q0isinsideB(x;w)thenh(p)=h(q).Thus,wecanconcludethattheprobabilityofcollisionofpointsp;qisPr[h(p)=h(q)jkp0q0k=0]=Pr[p0;q02B(x;w)jp02B(x;w)_q02B(x;w)]=jB(p0;w)\B(q0;w)j jB(p0;w)[B(q0;w)j=2C(0=2;w) 2Volt(w)2C(0=2;w)=I(0=2;w) 1I(0=2;w)(1)whereC(0=2;w)andI(0=2;w)arerespectivelythecapvolumeandtherelativecapvolume,asdenedinthepreliminaries.Inthenextstepweboundp2.Thisisdonebyintegratingthecollisionprobabilityoverallpossiblevaluesof0,i.e.,overalldistortionsofkpqk.Asnotedinfact2.2,thedistortionofkpqk2isdistributedwithprobabilitydensityP2.p2=1R0Pr[h(p)=h(q)jkp0q0k=p x tc]P2(x)dxR0p x tc2wP2(x)I(1 2p x tc;w) 1I(1 2p x tc;w)dx1R0P2(x)2I1 2p x tc;wdx21R0P2(x) 11 2p x tc w2!t 2dx21R0P2(x)expht 2xc2 4w2tidx=21R0P2(x)exph1 2xc2 4w2idxwhere,forthethirdinequality,weusedlemma2.1.Set-ting=1 4w2,andreplacingtheexpressionforP2,weobtainp221R0xt=21 (t=2)2t=2expx 2exphxc2 2idx=21R0(x(1+c2))t=21 (1+c2)t=21(t=2)2t=2exphx(1+c2) 2idx=2 (1+c2)t=21R0P2(x(1+c2))d(x(1+c2))=2 (1+c2)t=2(2)Weboundp1frombelowinasimilarwayp1R0p x t2wP2(x)I(1 2p x t;w) 1I(1 2p x t;w)dx4w2tR0P2(x)I(1 2p x t;w)dxt=R0P2(x)Al p t 11 2p x t w2!t 2dx=Al p tt=R0P2(x)1x tt 2dx=Al p tt=R0P2(x)1 1+x=t 1x=tt 2dxAl p tt=R0P2(x)expht 2x=t 1x=tidxAl p t4tR0P2(x)expt 2x t(1+8)dxAl p t1R0P2(x)expx 2(1+8)dx1R4tP2(x)dx(3)NotethatthetermR14tP2(x)dxrepresentstheproba-bilityofexpansionbymorethanafactorof2,whichisatmostexp[\n(t)]byfact2.3.Furthermore,replacingthe expressionforP2,weobtainp1Al p t1R0P2(x)expx 2(1+8)dxe\n(t)Al p t1R0xt 21 (t 2)2t 2ex=2expx 2(1+8)dxe\n(t)=Al p t1R0(x(1+(1+8)))t 21 (1+(1+8))t 21(t 2)2t 2exphx(1+(1+8)) 2idxe\n(t)=Al p t1 (1+(1+8))t=2e\n(t)(4)For=1=4w2=o(1),weobtainp1Al 2p t1 (1++82)t=2.Finally,wecanboundasfollows:=log1=p1 log1=p2log2p t Al(1+(1+8))t=2 log1 2(1+c2)t=2=log(1+(1+8))t=2+log2p t Al log(1+c2)t=2log2=log(1+(1+8))+2log2p t=Al t log(1+c2)2log2 tlog(1+(1+8)) log(1+c2)1+2log2p t=Al tlog(1+(1+8))1+O2log2 tlog(1+c2)(1+8) c2(c2)2=21+Ologt tlog(1+)1 c2(1+8)1+O(c2=2)1+Ologt t1 c21+O+w2logt t1 c21+Ologt t1=2(5)forw=4p t,whichalsoimplies=O(t1=2)=o(1). Theorem3.3.Thereexistsanalgorithmsolvingc-NNprob-leminld2thatachievesO(dn1=c2+o(1))querytimeandO(dn1+1=c2+o(1))spaceandpreprocessing.Proof.TheresultfollowsbyusingtheLSHfamilying-ure2withthegeneralLSHschemedescribedinsection2.3.Bylemma3.2,fort=log2=3n,wehave=1=c2+Ologlogn log1=3n.Furthermore,kcanbeboundedas(usingeqn.(2))k=logn log1=p2logn log(1+c2=4w2)t=2=2=Ologn t=w2O(1)=O(logn)Finally,bylemma3.1for=4,=O(dt)+O(Ut)=O(dt)+O2tlogtlogn=O(dt)+2O(log2=3nloglogn)logn=O(dno(1)).Thetheoremfol-lows. 4.AcknowledgmentsTheauthorswouldliketothankAssafNaorforstimu-latingdiscussionsabouttheballpartitioningmethod.Also,theythankOferAmraniforsupplyingtheboundeddistancedecodercode,andtoAlexVardyandErikAgrellforan-sweringnumerousquestionsaboutLeechLatticedecoders.References[1]N.AilonandB.Chazelle.Approximatenearestneighborsandthefastjohnson-lindenstrausstransform.ProceedingsoftheSymposiumonTheoryofComputing,2006.[2]O.AmraniandY.Be'ery.Efcientbounded-distancedecod-ingofthehexacodeandassociateddecodersfortheleechlatticeandthegolaycode.IEEETransactionsonCommuni-cations,44:534–537,May1996.[3]O.Amrani,Y.Be'ery,A.Vardy,F.-W.Sun,andH.C.A.vanTilborg.Theleechlatticeandthegolaycode:Bounded-distancedecodingandmultilevelconstructions.IEEETrans-actionsonInformationTheory,40:1030–1043,July1994.[4]A.Andoni,M.Datar,N.Immorlica,P.Indyk,andV.Mir-rokni.Locality-sensitivehashingschemebasedonp-stabledistributions.NearestNeighborMethodsforLearningandVision,NeuralProcessingInformationSeries,MITPress,2005.[5]A.AndoniandP.Indyk.E2lsh:Exacteuclideanlocality-sensitivehashing.Implementationavailableathttp://web.mit.edu/andoni/www/LSH/index.html,2004.[6]A.Andoni,P.Indyk,andM.Patras¸cu.Ontheoptimalityofthedimensionalityreductionmethod.Manuscript,2006.[7]S.Arya,D.Mount,N.Netanyahu,R.Silverman,andA.Wu.Anoptimalalgorithmforapproximatenearestneighborsearching.ProceedingsoftheFifthAnnualACM-SIAMSym-posiumonDiscreteAlgorithms,pages573–582,1994.[8]A.ChakrabartiandO.Regev.Anoptimalrandomisedcellprobelowerboundsforapproximatenearestneighborsearching.ProceedingsoftheSymposiumonFoundationsofComputerScience,2004.[9]M.Charikar,C.Chekuri,A.Goel,S.Guha,andS.Plotkin.Approximatinganitemetricbyasmallnumberoftreemet-rics.ProceedingsoftheSymposiumonFoundationsofCom-puterScience,1998.[10]J.H.ConwayandJ.A.Sloane.Softdecodingtechniquesforcodesandlattices,includingthegolaycodeandtheleechlattice.IEEETrans.Inf.Theor.,32(1):41–50,1986.[11]J.H.ConwayandJ.A.Sloane.SpherePackings,Lattices,andGroups.Springer-Verlag,NewYork,1993.[12]M.Datar,N.Immorlica,P.Indyk,andV.Mirrokni.Locality-sensitivehashingschemebasedonp-stabledistributions.ProceedingsoftheACMSymposiumonComputationalGe-ometry,2004.[13]U.FeigeandG.Schechtman.Ontheoptimalityoftheran-domhyperplaneroundingtechniqueformaxcut.RandomStruct.Algorithms,20(3):403–440,2002. [14]G.ForneyandA.Vardy.Generalizedminimumdistancede-codingofeuclidean-spacecodesandlattices.IEEETrans-actionsonInformationTheory,42:1992–2026,November1996.[15]A.Gionis,P.Indyk,andR.Motwani.Similaritysearchinhighdimensionsviahashing.Proceedingsofthe25thIn-ternationalConferenceonVeryLargeDataBases(VLDB),1999.[16]S.Har-Peled.Areplacementforvoronoidiagramsofnearlinearsize.ProceedingsoftheSymposiumonFoundationsofComputerScience,2001.[17]S.Har-PeledandS.Mazumdar.Coresetsfork-meansandk-mediansandtheirapplications.ProceedingsoftheSym-posiumonTheoryofComputing,2004.[18]P.Indyk.High-dimensionalcomputationalgeometry.De-partmentofComputerScience,StanfordUniversity,2001.[19]P.Indyk.Approximatealgorithmsforhigh-dimensionalgeometricproblems.InvitedtalkatDIMACSWork-shoponComputationalGeometry.Availableathttp://theory.csail.mit.edu/˜indyk/high.ps,2002.[20]P.Indyk.Betteralgorithmsforhigh-dimensionalproximityproblemsviaasymmetricembeddings.ProceedingsoftheACM-SIAMSymposiumonDiscreteAlgorithms,2003.[21]P.IndykandR.Motwani.Approximatenearestneighbor:towardsremovingthecurseofdimensionality.ProceedingsoftheSymposiumonTheoryofComputing,1998.[22]D.Karger,R.Motwani,andM.Sudan.Approximategraphcoloringbysemideniteprogramming.Proceedingsofthe35thIEEESymposiumonFoundationsofComputerScience,pages2–13,1994.[23]J.Kleinberg.Twoalgorithmsfornearest-neighborsearchinhighdimensions.ProceedingsoftheTwenty-NinthAnnualACMSymposiumonTheoryofComputing,1997.[24]R.KrauthgamerandJ.R.Lee.Navigatingnets:Simpleal-gorithmsforproximitysearch.ProceedingsoftheACM-SIAMSymposiumonDiscreteAlgorithms,2004.[25]E.Kushilevitz,R.Ostrovsky,andY.Rabani.Efcientsearchforapproximatenearestneighborinhighdimensionalspaces.ProceedingsoftheThirtiethACMSymposiumonTheoryofComputing,pages614–623,1998.[26]J.Leech.Notesonspherepackings.CanadianJournalofMathematics,1967.[27]R.Motwani,A.Naor,andR.Panigrahy.Lowerboundsonlocalitysensitivehashing.ProceedingsoftheACMSympo-siumonComputationalGeometry,2006.[28]R.Panigrahy.Entropy-basednearestneighboralgorithminhighdimensions.ProceedingsoftheACM-SIAMSympo-siumonDiscreteAlgorithms,2006.[29]G.Pisier.ThevolumeofconvexbodiesandBanachspacegeometry.CambridgeUniversityPress,1989.[30]H.Samet.FoundationsofMultidimensionalandMetricDataStructures.Elsevier,2006.[31]G.Shakhnarovich,T.Darrell,andP.Indyk,editors.NearestNeighborMethodsinLearningandVision.NeuralProcess-ingInformationSeries,MITPress,2006.[32]A.VardyandY.Be'ery.Maximum-likelihooddecodingoftheleechlattice.IEEETransactionsonInformationTheory,39:1435–1444,July1993.A.CapofthesphereProof.Theresultfollowsimmediatelyfromthelemma9of[13],whichgivesboundsontheratioofthesurfaceareasofthecaptothatoftheball.Specically,notethatI(u;r)hasthefollowingformI(u;r)=C(u;r)(Vold(r))1=rRuSd1 d1(r2y2)d1 2dySd drd1=d d1rRuSd1(r2y2)d1 2dySdrd1ThequantityRruSd1(r2y2)d1 2dySdrd1rep-resentspreciselytheratioofthesurfaceareaofthecapC(u;r)(excludingthebase)tothesurfaceareaofaballofradiusrinthe(d+1)-dimensionalspace.Thisratioisbounded[13]asAl p d+11u r2d 2rZuSd1(r2y2)d1 2dySdrt11 21u r2d 2Thus,multiplyingtheaboveboundsbyd d1,weobtainthatd d1Al p d+11u r2d 2I(u;r)d d11 21u r2d 2whichimpliesthelemma. B.Lattice-basedLSHfamilyInthissectionwedescribeapracticalvariantoftheLSHfamilybasedonlatticesinEuclideanspaces.Although,the-oretically,thesefamiliesarenotasymptoticallybetterthantheonesdescribedearlier,theyarelikelytoperformbetterinpractice,duetomuchlower“big-Oh”constants.Westartbypresentingthegenerallattice-basedap-proach.Then,wegiveanalgorithmbasedonaconcrete24-dimensionallattice,calledLeechLattice[26].FortheLeechlattice-basedalgorithm,weincludetheactualvaluesoftheresultingexponent,themainindicatoroftheper-formance.B.1.LatticesinarbitrarydimensionThealgorithminthissectionusesanarbitrarylatticeinsomet-dimensionalspace.Anexampleofat-dimensional latticeistheregulargridofpointsint,although,asmen-tionedintheintroduction,itdoesnotservewellourpur-poses.Foragivenlattice,weneedanefcientlatticedecod-ingfunction,towhichwereferasLATTICEDECODE(x).ThefunctionLATTICEDECODE(x)takesasinputapointx2tandreturnsthelatticepointthatistheclosesttox.GivenaspeciclatticewithadecodingfunctionLATTICEDECODE(x),anLSHfunctionisconstructedasfollows(formallypresentedingureB.1).First,ifd&#x-0.4;腓倀t,wechoosearandomprojectionfromd-dimensionalspacetot-dimensionalspace,whichwerepresentasamatrixAofdimensiontd.Ifdt,then,instead,wechoosearandomrotationinthet-dimensionalspace,whichwealsorepresentasamatrixAofdimensiontd(here,Aisequaltotherstdcolumnsofanrandomortonormalmatrixofdimensiontt).Finally,wechoosearandomtranslationinthet-dimensionalspace,whichwerepresentasavectorTofdimensiont1.ThevaluesofA;TidentifyanLSHfunction.ForanLSHfunctionhspeciedbyvaluesofAandT,wedeneh(p)asbeingh(p)=LATTICEDECODE(Ap+T).Or,inwords,forp2d,werstprojectpintotusingA(orrotateitifdt);then,wetranslatetheprojec-tionusingT;and,nally,wendtheclosestpointinlatticeusingLATTICEDECODE.TheoutputofLATTICEDECODEgivesthevalueofh(p). Initializationofahashfunctionh2H1.Ifd�t,choosearandomprojectionfromd-dimensionalspacetot-dimensionalspace.TheprojectionisrepresentedbyamatrixA2Mt;d,whereeachelementAijisdistributedaccordingtothenormaldistributionN(0;1)timesascalingfactor,1 p t.2.Ifdt,choosearandomrotationinthet-dimensionalspace.TherotationisrepresentedbythematrixA,whichisequaltotherstdcoordinatesofanttortonormalmatrix.3.Choosearandomtranslationinthet-dimensionalspace.ThetranslationisrepresentedbyavectorT2Mt;1.Computingh()onapointp21.Letx=Ap+T.2.ReturnLATTICEDECODE(x). Figure3.AlgorithmsforinitializinganLSHfunctionhandforcomputingh(p)forapointp2d.TheperformanceoftheresultingLSHschemedependsheavilyonthechoiceofthelattice.Intuitively,wewouldlikealatticethatlivesintforhight,is“dense”4,andhas 4Ameasureof“density”is,forexample,thedensityofhyperspherepackinginducedbythelattice.Thedensityofhyperspherepackingisthepercentofthespacethatiscoveredbynon-overlappingballscenteredatlatticepoints.afastdecodingfunctionLATTICEDECODE.Withahighert,thedimensionalityreductionismoreaccurate.A“denser”latticegivesasharperdifferenceincollisionprobabilitiesofcloseandfarpoints.B.2.LeechLatticeInthissection,wefocusonaparticularlatticein24di-mensionalspace,theLeechLattice[26].WegivenumericalvaluesforthewhenweusetheLeechLatticeinthealgo-rithmB.1withaspecicdecoderdescribedbelow.TheLeechLatticehasbeenstudiedextensively(see,e.g.,[11,10,2,32,14])andisknowntobethelat-ticethatgivesthedensest(lattice)hyperspherepackingin24dimensions.Below,wedenotetheLeechLat-ticeby24andcallthecorrespondingdecodingfunctionLATTICEDECODE24(x).SeveralefcientdecodersfortheLeechLatticeareknown;thebestofthem[32]requires3595oatingpointoperationstodecodeonepoint.How-ever,evenfasterdecodersareknown(e.g.,see[3,2,14])forthebounded-distancedecodingproblem.Abounded-distancedecoderguaranteestoreturnthecorrectresultonlywhenthequerypointxissufcientlyclosetooneofthelatticepoints;otherwisethedecodergivesnoguarantees.Notethatabounded-distancedecoderyieldsanLSHfunc-tion,albeitnotnecessarilyasgoodastheperfectdecoder.Wehaveinvestigatedthebounded-distancedecoderof[2],whichwecallLATTICEDECODEB24(x).Theirimple-mentationusesatmost519realoperationsperdecodedpoint.Forthatdecoder,wecomputedthevaluesofthere-sultingcollisionprobabilities(forthecased�24).There-sultsaredepictedinTable1.TheprobabilitiesarecomputedusingMonte-Carlosimulationwith107trials.Specically,inatrial,wegeneratearandompointpandsomeotherpointq,suchthatpqisdrawnfroma24-dimensionalGaussiandistribution,scaledby1 p 24timestheradius.ThepointspandqcollideiffLATTICEDECODEB24(p)=LATTICEDECODEB24(q).Table1summarizestheesti-matedprobabilitiesofcollisionfordifferentvaluesofradii(thecondenceintervalsarecomputedwith95%accuracy).TheseprobabilitiesyieldvaluesforthataresummarizedinTable2.Thetableshowsmaximumlikelihoodandcon-servative.Themaxlikehoodistheratioofcorrespond-ingmaxlikehoodvaluesofp1andp2(fromthemiddlecol-umn).Theconservativeistheratiooflowestestimateofp1fromthecondenceintervaltothehighestestimateofp2inthecondenceinterval.Forthecasewhend24,thecollisionprobabilitiesaresummarizedintable3.Themethodforcomputingtheprobabilitiesisasbefore,exceptforthegenerationofthepointq.Inthiscase,thevectorqpisarandomvectorofxedlength.TheresultingvaluesofaresummarizedinTable4. Radius Est.collisionprob. Condenceinterval 0.7 0.0853465 [0.0853409,0.0853521] 0.8 0.0525858 [0.0525813,0.0525903] 0.9 0.0311720 [0.0311685,0.0311755] 1.0 0.0177896 [0.0177869,0.0177923] 1.1 0.0097459 [0.0097439,0.0097479] 1.2 0.0051508 [0.0051493,0.0051523] 1.3 0.0026622 [0.0026611,0.0026633] 1.4 0.0013332 [0.0013324,0.0013340] 1.5 0.0006675 [0.0006670,0.0006681] 1.6 0.0003269 [0.0003265,0.0003273] 1.7 0.0001550 [0.0001547,0.0001553] 1.8 0.0000771 [0.0000769,0.0000773] 1.9 0.0000368 [0.0000366,0.0000370] 2.0 0.0000156 [0.0000155,0.0000157] Table1.Probabilitiesofcollisionoftwopoints,ford�24,underthehashfunc-tiondescribedingureB.1withbounded-distanceLeechLatticedecoder.ThevalueswereobtainedthroughMonte-Carlosimula-tionfor107trials.Condenceintervalcorre-spondsto95%accuracy. c Maxlikelihoodof Conservative RadiusR 1.5 0.5563 0.5565 1.2 2.0 0.3641 0.3643 1.0 Table2.Thevaluesof=logp1 logp2correspond-ingtothecollisionprobabilitiesinTable1(d�24).Probabilitiesp1andp2arethecol-lisionprobabilitiescorrespondingtoradiiRandcR,respectively. Radius Est.collisionprob. Condenceinterval 0.7 0.0744600 [0.0744548,0.0744653] 0.8 0.0424745 [0.0424705,0.0424786] 0.9 0.0223114 [0.0223084,0.0223144] 1.0 0.0107606 [0.0107585,0.0107627] 1.1 0.0046653 [0.0046639,0.0046667] 1.2 0.0017847 [0.0017838,0.0017856] 1.3 0.0005885 [0.0005880,0.0005890] 1.4 0.0001602 [0.0001599,0.0001605] 1.5 0.0000338 [0.0000337,0.0000340] 1.6 0.0000073 [0.0000072,0.0000074] 1.7 0.0000009 [0.0000008,0.0000010] 1.8 0.0000000 [0.0000000,0.0000001] Table3.Probabilitiesofcollisionoftwopoints,ford24,underthehashfunctionde-scribedingureB.1withbounded-distanceLeechdecoder.ThevalueswereobtainedthroughMonte-Carlosimulationfor107trials.Condenceintervalcorrespondsto95%ac-curacy. c Maxlikelihoodof Conservative RadiusR 1.5 0.4402 0.4405 1 2.0 0.2671 0.2674 0.8 Table4.Thevaluesof=logp1 logp2correspond-ingtothecollisionprobabilitiesintable1(d24).Probabilitiesp1andp2arethecol-lisionprobabilitiescorrespondingtoradiiRandcRrespectively.