ucsdedu Department of Computer Science and Engineering University of California San Diego 9500 Gilman Drive La Jolla CA 92093 Kaushik Sinha kaushiksinhawichitaedu Department of Electrical Engineering and Computer Science Wichita State University 1845 ID: 5772
Download Pdf The PPT/PDF document "JMLR Workshop and Conference Proceedings..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
DasguptaSinhatree,whichtakesO(n)spaceandanswersquerieswithanapproximationfactorc=1+inO((6=)dlogn)time(Aryaetal.,1998).Insomeoftheseresults,anexponentialdependenceondimensionisevident,andindeedthisisafamiliarblotonthenearestneighborlandscape.Onewaytomitigatethecurseofdimensionalityistoconsidersituationsinwhichdatahavelowintrinsicdimensiondo,eveniftheyhappentolieinRdforddoorinageneralmetricspace.Acommonassumptionisthatthedataaredrawnfromadoublingmeasureofdimensiondo(orequivalently,haveexpansionrate2do);thisisdenedinSection4.1below.Underthiscondition,KargerandRuhl(2002)haveaschemethatgivesexactanswerstonearestneighborqueriesintimeO(23dologn),usingadatastructureofsizeO(23don).Themorerecentcovertreealgorithm(Beygelzimeretal.,2006),whichhasbeenusedquitewidely,createsadatastructureinspaceO(n)andanswersqueriesintimeO(2dologn).Thereisalsoworkthatcombinesintrinsicdimensionandapproximatesearch.Thenavigatingnet(KrauthgamerandLee,2004),givendatafromametricspaceofdoublingdimensiondo,hassizeO(2O(do)n)andgivesa(1+)-approximateanswertoqueriesintimeO(2O(do)logn+(1=)O(do));thecrucialadvantagehereisthatdoublingdimensionisamoregeneralandrobustnotionthandoublingmeasure.Despitetheseandmanyotherresults,therearetwosignicantdecienciesinthenearestneighborliteraturethathavemotivatedthepresentpaper.First,existinganalyseshavesucceededatidentifying,foragivendatastructure,highlyspecicfamiliesofdataforwhichecientexactNNsearchispossible|forinstance,datafromdoublingmeasures|buthavefailedtoprovideamoregeneralcharacterization.Second,thereremainsaclassofnearestneighbordatastructuresthatarepopularandsuccessfulinpractice,butthathavenotbeenanalyzedthoroughly.Thesestructurescombineclassicalk-dtreepartitioningwithrandomizationandoverlappingcells,andarethesubjectofthispaper.1.1.ThreerandomizedtreestructuresforexactNNsearchThek-dtreeisapartitionofRdintohyper-rectangularcells,basedonasetofdatapoints(Bentley,1975).Therootofthetreeisasinglecellcorrespondingtotheentirespace.Acoordinatedirectionischosen,andthecellissplitatthemedianofthedataalongthisdirection(Figure1,left).Theprocessisthenrecursedonthetwonewlycreatedcells,andcontinuesuntilallleafcellscontainatmostsomepredeterminednumbernoofpoints.Whentherearendatapoints,thedepthofthetreeisatmostaboutlog(n=no).Givenak-dtreebuiltfromdatapointsS,thereareseveralwaystoansweranearestneighborqueryq.Thequickestanddirtiestoftheseistomoveqdownthetreetoitsappropriateleafcell,andthenreturnthenearestneighborinthatcell.ThisdefeatistsearchtakestimejustO(no+log(n=no)),whichisO(logn)forconstantno.Theproblemisthatq'snearestneighbormaywelllieinadierentcell,forinstancewhenthedatahappentobeconcentratednearcellboundaries.Consequently,thefailureprobabilityofthisschemecanbeunacceptablyhigh.Overtheyears,somesimpletrickshaveemerged,fromvarioussources,forreducingthefailureprobability.ThesearenicelylaidoutbyLiuetal.(2004),whoshowexperimentallythattheresultingalgorithmsareeectiveinpractice.2 DasguptaSinha Figure3:Threetypesofsplit.Thefractionsrefertoprobabilitymass.issomeconstant,whileischosenuniformlyatrandomfrom[1=4;3=4].unitsphere.Butnow,threesplitpointsarenoted:themedianm(C)ofthedataalongdirectionU,the(1=2)fractilevaluel(C),andthe(1=2)+fractilevaluer(C).Hereisasmallconstant,like0:05or0:1.Theideaistosimultaneouslyentertainamediansplitleft=fx:xUm(C)gright=fx:xUm(C)gandanoverlappingsplit(withthemiddle2fractionofthedatafallingonbothsides)left=fx:xUr(C)gright=fx:xUl(C)g:Inthespilltree(Liuetal.,2004),eachdatapointinSisstoredinmultipleleaves,byfollowingtheoverlappingsplits.Aqueryisthenanswereddefeatist-style,byroutingittoasingleleafusingmediansplits.BoththeRPtreeandthespilltreehavequerytimesofO(no+log(n=no)),butthelattercanbeexpectedtohavealowerfailureprobability,andwewillseethisintheboundsweobtain.Ontheotherhand,theRPtreerequiresjustlinearspace,whilethesizeofthespilltreeisO(n1=(1lg(1+2))).When=0:05,forinstance,thesizeisO(n1:159).Inviewofthesetradeos,weconsiderafurthervariant,whichwecallthevirtualspilltree.Itstoreseachdatapointinasingleleaf,followingmediansplits,andhencehaslinearsize.However,eachqueryisroutedtomultipleleaves,usingoverlappingsplits,andthereturnvalueisitsnearestneighborintheunionoftheseleaves.ThevarioussplitsaresummarizedinFigure3,andthethreetreesusethemasfollows: RoutingdataRoutingqueries RPtree PerturbedsplitPerturbedsplitSpilltree OverlappingsplitMediansplitVirtualspilltree MediansplitOverlappingsplitOnesmalltechnicality:if,forinstance,thereareduplicatesamongthedatapoints,itmightnotbepossibletoachieveamediansplit,orasplitatadesiredfractile.Wewillignorethesediscretizationproblems.4 DasguptaSinha2.ApotentialfunctionforpointcongurationsTomotivatethepotentialfunction,westartbyconsideringwhathappenswhentherearejusttwodatapointsandonequerypoint.2.1.HowrandomprojectionaectstherelativeplacementofthreepointsConsideranyq;x;y2Rd,suchthatxisclosertoqthanisy;thatis,kqxkkqyk.NowsupposethatarandomdirectionUischosenfromtheunitsphereSd1,andthatthepointsareprojectedontothisdirection.Whatistheprobabilitythatyfallsbetweenqandxonthisline?Thefollowinglemmaanswersthisquestionexactly.Anapproximatesolution,withdierentproofmethod,wasgivenearlierbyKleinberg(1997).Lemma1Pickanyq;x;y2Rdwithkqxkkqyk.PickarandomunitdirectionU.Theprobability,overU,thatyUfalls(strictly)betweenqUandxUis1 arcsin0@kqxk kqyks 1(qx)(yx) kqxkkyxk21A:ProofWemayassumeUisdrawnfromN(0;Id),thed-dimensionalGaussianwithmeanzeroandunitcovariance.ThisgivestherightdistributionifwescaleUtounitlength,butwecanskipthislaststepsinceithasnoeectonthequestionathand.Wecanalsoassume,withoutlossofgenerality,thatqliesattheoriginandthatxliesalongthe(positive)x1-axis:thatis,q=0andx=kxke1.ItwillthenbehelpfultosplitthedirectionUintotwopieces,itscomponentU1inthex1-direction,andtheremainingd1coordinatesUR.Likewise,wewillwritey=(y1;yR).IfyR=0thenx,y,andqarecollinear,andtheprojectionofycannotpossiblyfallbetweenthoseofxandq.HenceforthassumeyR6=0.LetEdenotetheeventofinterest:EyUfallsbetweenqU(thatis,0)andxU(thatis,kxkU1)yRURfallsbetweeny1U1and(kxky1)U1Theintervalofinterestiseither(y1jU1j;(kxky1)jU1j),ifU10,or((kxky1)jU1j;y1jU1j),ifU10.NowyRURisindependentofU1andisdistributedasN(0;kyRk2),whichissymmetricandthusassignsthesameprobabilitymasstothetwointervals.ThereforePrU(E)=PrU1PrUR(y1jU1jyRUR(kxky1)jU1j):LetZandZ0beindependentstandardnormalsN(0;1).SinceU1isdistributedasZandyRURisdistributedaskyRkZ0,PrU(E)=Pr(y1jZjkyRkZ0(kxky1)jZj)=PrZ0 jZj2y1 kyRk;kxky1 kyRk:6 DasguptaSinhaInthetreedatastructuresweanalyze,mostcellscontainonlyasubsetofthedatafx1;:::;xng.Foracellthatcontainsmofthesepoints,theappropriatevariantofism(q;fx1;:::;xng)=1 mmXi=2kqx(1)k kqx(i)k:Corollary4Pickanypointsq;x1;:::;xnandletSdenoteanysubsetofthexithatincludesx(1).IfqandthepointsinSareprojectedtoadirectionUchosenatrandomfromtheunitsphere,thenforany01,theprobability(overU)thatatleastanfractionoftheprojectedSfallsbetweenqandx(1)isupper-boundedby(1=2)jSj(q;fx1;:::;xng).ProofApplyTheorem3toS,notingthatthecorrespondingvalueofismaximizedwhenSconsistsofthepointsclosesttoq;andthenapplyMarkov'sinequality. 2.3.ExtensiontoknearestneighborsIfweareinterestedinndingtheknearestneighbors,asuitablegeneralizationofmisk;m(q;fx1;:::;xng)=1 mmXi=k+1(kqx(1)k++kqx(k)k)=k kqx(i)k:Theorem5Pickanypointsq;x1;:::;xnandletSdenoteasubsetofthexithatincludesx(1);:::;x(k).SupposeqandthepointsinSareprojectedtoarandomunitdirectionU.Then,forany(k1)=jSj1,theprobability(overU)thatintheprojection,thereissome1jkforwhichmpointsliebetweenx(j)andqisatmostk 2((k1)=jSj)k;jSj(q;fx1;:::;xng):Thistheorem,andmanyoftheothersthatfollow,areprovedintheappendix.3.RandomizedpartitiontreesWe'llnowseethatthefailureprobabilityoftherandomprojectiontreeisproportionaltoln(1=),whilethatofthetwospilltreesisproportionalto.Westartwiththesecondresult,sinceitisthemorestraightforwardofthetwo.3.1.RandomizedspilltreesInarandomizedspilltree,eachcellissplitalongadirectionchosenuniformlyatrandomfromtheunitsphere.Twokindsofsplitsaresimultaneouslyconsidered:(1)asplitatthemedian(alongtherandomdirection),and(2)anoverlappingsplitwithonepartcontainingthebottom1=2+fractionofthecell'spoints,andtheotherpartcontainingthetop1=2+fraction,where01=2(recallFigure3).8 RandomizedtreesforNNsearch3.3.Couldcoordinatedirectionsbeused?Thetreedatastructureswehavestudiedmakecrucialuseofrandomprojectionforsplittingcells.Itwouldnotsucetousecoordinatedirections,asink-dtrees.Toseethis,considerasimpleexample.Letq,thequerypoint,betheorigin,andsupposethedatapointsx1;:::;xn2Rdarechosenasfollows:x1istheall-onesvector.Eachxi;i1,ischosenbypickingacoordinateatrandom,settingitsvaluetoM,andthensettingallremainingcoordinatestouniform-randomnumbersintherange(0;1).HereMissomeverylargeconstant.ForlargeenoughM,thenearestneighborofqisx1.BylettingMgrowfurther,wecanlet(q;fx1;:::;xng)getarbitrarilyclosetozero,whichmeansthattherandomprojectionmethodswillworkwell.However,anycoordinateprojectionwillcreateadisastrouslylargeseparationbetweenqandx1:onaverage,a(11=d)fractionofthedatapointswillfallbetweenthem.4.BoundingTheexactnearestneighborschemesweanalyzehaveerrorprobabilitiesrelatedto,whichliesintherange[0;1].Theworstcaseiswhenallpointsareequidistant,inwhichcaseisexactly1,butthisisapathologicalsituation.Isitpossibletoboundundersimpleassumptionsonthedata?Inthissectionwestudytwosuchassumptions.Ineachcase,querypointsarearbitrary,butthedataareassumedtohavebeendrawni.i.d.fromanunderlyingdistribution.4.1.DatadrawnfromadoublingmeasureSupposethedatapointsaredrawnfromadistributiononRdwhichisadoublingmeasure:thatis,thereexistaconstantC0andasubsetXRdsuchthat(B(x;2r))C(B(x;r))forallx2Xandallr0:HereB(x;r)istheclosedEuclideanballofradiusrcenteredatx.Tounderstandthiscondition,itishelpfultoalsolookatanalternativeformulationthatisessentiallyequivalent:thereexistaconstantdo0andasubsetXRdsuchthatforallx2X,allr0,andall1,wehave(B(x;r))do(B(x;r)).Inotherwords,theprobabilitymassofaballgrowspolynomiallyintheradius.Comparingthistothestandardformulaforthevolumeofaball,weseethatthedegreeofthispolynomial,do(=log2C),canreasonablybethoughtofasthe\dimension"ofmeasure.Theorem8SupposeiscontinuousonRdandisadoublingmeasureofdimensiondo2.Pickanyq2Xanddrawx1;:::;xn.Pickany01=2.Withprobability1311 RandomizedtreesforNNsearchTheoveralldistributionisthusamixture=w11++wttwhosejthcomponentisaBernoulliproductdistributionj=B(p(j)1)B(p(j)N).HereB(p)isashorthandforthedistributiononf0;1gwithexpectedvaluep.Itwillsimplifythingstoassumethat0p(j)i1=2;thisisnotahugeassumptionif,say,stopwordshavebeenremoved.Forthepurposesofbounding,weareinterestedinthedistributionofdH(q;X),whereXischosenfromanddHdenotesHammingdistance.Thisisasumofsmallindependentquantities,anditiscustomarytoapproximatesuchsumsbyaPoissondistribution.Inthecurrentcontext,however,thisapproximationisratherpoor,andweinsteadusecountingargumentstodirectlyboundhowrapidlythedistributiongrows.Theresultsstandinstarkcontrasttothoseweobtainedfordoublingmeasures,andrevealthistobeasubstantiallymoredicultsettingfornearestneighborsearch.Foradoublingmeasure,theprobabilitymassofaballB(q;r)doubleswheneverrismultipliedbyaconstant.Inourpresentsetting,itdoubleswheneverrisincreasedbyanadditiveconstant:Theorem10Supposethatallp(j)i2(0;1=2).LetLj=Pip(j)idenotetheexpectednumberofwordsinadocumentfromtopicj,andletL=min(L1;:::;Lt).Pickanyqueryq2f0;1gN,anddrawX.Forany`0,Pr(dH(q;X)=`+1) Pr(dH(q;X)=`)L`=2 `+1:Now,xaparticularqueryq2f0;1gN,anddrawx1;:::;xnfromdistribution.Lemma11Thereisanabsoluteconstantcoforwhichthefollowingholds.Pickany01andanyk1,andletvdenotethesmallestintegerforwhichPrX(dH(q;X)v)(8=n)max(k;ln1=).Thenwithprobabilityatleast13overthechoiceofx1;:::;xn,foranymn,k;m(q;fx1;:::;xng)4r v coLlog2(n=m):Theimplicationofthislemmaisthatforanyofthethreetreedatastructures,thefailureprobabilityatasinglelevelisroughlyp v=L.ThismeansthatthetreecanonlybegrowntodepthO(p L=v),andthusthequerytimeisdominatedbyno=n2O(p L=v).Whennislarge,weexpectvtobesmall,andthusthequerytimeimprovesoverexhaustivesearchbyafactorofroughly2p L.AcknowledgmentsWethanktheNationalScienceFoundationforsupportundergrantIIS-1162581.ReferencesN.AilonandB.Chazelle.ThefastJohnson-Lindenstrausstransformandapproximatenearestneighbors.SIAMJournalonComputing,39:302{322,2009.13 RandomizedtreesforNNsearchAppendixB.ProofofTheorem7Consideranyinternalnodeofthetreethatcontainsqaswellasmofthedatapoints,includingx(1).Whatistheprobabilitythatthesplitatthatnodeseparatesqfromx(1)?Toanalyzethis,letFdenotethefractionofthempointsthatfallbetweenqandx(1)alongtherandomly-chosensplitdirection.Sincethesplitpointischosenatrandomfromanintervalofmass1=2,theprobabilitythatitseparatesqfromx(1)isatmostF=(1=2).IntegratingoutF,wegetPr(qisseparatedfromx(1))Z10Pr(F=f)f 1=2df=2Z10Pr(Ff)df2Z10min1;m 2fdf=2Zm=20df+2Z1m=2m 2fdf=mln2e m;wherethesecondinequalityusesCorollary4.Thelemmafollowsbytakingaunionboundoverthepaththatconveysqfromroottoleaf,inwhichthenumberofdatapointsperlevelshrinksgeometrically,byafactorof3=4orbetter.Thesamereasoninggeneralizestoknearestneighbors.Thistime,Fisdenedtobethefractionofthempointsthatliebetweenqandthefurthestofx(1);:::;x(k)alongtherandomsplittingdirection.ThenqisseparatedfromoneoftheseneighborsonlyifthesplitpointliesinanintervalofmassFoneithersideofq,aneventthatoccurswithprobabilityatmost2F=(1=2).UsingTheorem5,Pr(qisseparatedfromsomex(j),1jk)Z10Pr(F=f)2f 1=2df=4Z10Pr(Ff)df4Z10min1;kk;m 2(f(k1)=m)df4Z(kk;m=2)+(k1)=m0df+4Z1(kk;m=2)+(k1)=mkk;m 2(f(k1)=m)df2kk;mln2e kk;m+4(k1) m;andasbefore,wesumthisoveraroot-to-leafpathinthetree.AppendixC.ProofofTheorem8C.1.Thek=1caseWewillconsideracollectionofballsBo;B1;B2;:::centeredatq,withgeometricallyin-creasingradiiro;r1;r2;:::,respectively.Fori1,wewilltakeri=2iro.Thusbythedoublingcondition,(Bi)Ci(Bo),whereC=2do4.15 RandomizedtreesforNNsearchwherethelastinequalitycomesfrom(*).Tolower-bound2`,weagainuse(*)togetC`m=(2n(Bo)),whereupon2`m 2n(Bo)1=log2C=m 2ln(1=)1=log2Candwe'redone.C.2.Thek1caseTheonlybigchangeisinthedenitionofro;itisnowtheradiusforwhich(Bo)=4 nmaxk;ln1 :Thus,whenx1;:::;xnaredrawnindependentlyatrandomfrom,theexpectednumberofthemthatfallinBoisatleast4k,andbyamultiplicativeChernoboundisatleastkwithprobability1.TheballsB1;B2;:::aredenedasbefore,andonceagain,wecanconcludethatwithprobability12,eachBicontainsatmost2nCi(Bo)ofthedatapoints.Anypointx(i)62BoliesinsomeannulusBjnBj1,anditscontributiontothesummationink;mis(kqx(1)k++kqx(k)k)=k kqx(i)k1 2j1:Therelationship(*)andtheremainderoftheargumentareexactlyasbefore.AppendixD.AusefultechnicallemmaLemma12SupposethatforsomeconstantsA;B0anddo1,F(m)AB m1=doforallmno.Pickany01anddene`=log1=(n=no).Then:`Xi=0F(in)Ado 1B no1=doand,ifnoB(A=2)do,`Xi=0F(in)ln2e F(in)Ado 1B no1=do1 1ln1 +ln2e A+1 dolnno B:17 RandomizedtreesforNNsearchTounderstandthisdistribution,westartwithageneralresultaboutsumsofBernoullirandomvariables.Noticethattheresultisexactlycorrectinthesituationwhereallpi=1=2.Lemma13SupposeZ1;:::;ZNareindependent,whereZi2f0;1gisaBernoullirandomvariablewithmean0ai1,anda1a2aN.LetZ=Z1++ZN.Thenforany`0,Pr(Z=`+1) Pr(Z=`)1 `+1NXi=`+1ai 1ai:ProofDeneri=ai=(1ai)2(0;1);thenr1r2rN.Now,forany`0,Pr(Z=`)=Xfi1;:::;i`g[N]ai1ai2ai`Yj62fi1;:::;i`g(1aj)=NYi=1(1ai)Xfi1;:::;i`g[N]ai1 1ai1ai2 1ai2ai` 1ai`=NYi=1(1ai)Xfi1;:::;i`g[N]ri1ri2ri`wherethesummationsareoversubsetsfi1;:::;i`gof`distinctelementsof[N].Inthenalline,theproductofthe(1ai)doesnotdependupon`andcanbeignored.Let'sfocusonthesummation;callitS`.WewouldliketocompareittoS`+1.S`+1isthesumofN`+1distinctterms,eachtheproductof`+1ri's.ThesetermsalsoappearinthequantityS`(r1++rN);infact,eachtermofS`+1appearsmultipletimes,`+1timestobeprecise.TheremainingtermsinS`(r1++rN)eachcontain`1uniqueelementsandoneduplicatedelement.Byaccountinginthisway,wegetS`(r1++rN)=(`+1)S`+1+Xfi1;:::;i`g[N]ri1ri2ri`(ri1++ri`)(`+1)S`+1+S`(r1++r`)sincetheri'sarearrangedindecreasingorder.HencePr(Z=`+1) Pr(Z=`)=S`+1 S`1 `+1(r`+1++rN);asclaimed. WenowapplythisresultdirectlytothesumofBernoullivariablesZ=dH(q;X).Lemma14Supposethatp1;:::;pN2(0;1=2).Pickanyqueryq2f0;1gN,anddrawXfromdistribution=B(p1)B(pN).Thenforany`0,Pr(dH(q;X)=`+1) Pr(dH(q;X)=`)L`=2 `+1;whereL=PipiistheexpectednumberofwordsinX.19 RandomizedtreesforNNsearchSupposethatforsomeik,pointx(i)isatHammingdistance`fromq,thatis,x(i)2S`.Then(kqx(1)k++kqx(k)k)=k kqx(i)kr v `sinceEuclideandistanceisthesquarerootofHammingdistance.Inboundingk;m,weneedtogaugetherangeofHammingdistancesspannedbyx(k+1);:::;x(m).Thegeometricgrowthrateofpart(c)impliesthatmostpointslieatHammingdistancecoLorgreaterfromq.ItalsomeansthatdH(q;x(m))coLlog2(n=m).Thus,k;m(q;fx1;:::;xng)=1 mXik(kqx(1)k++kqx(k)k)=k kqx(i)k1 mX`vjS`\fx(1);:::;x(m)gjr v `4r v coLlog2(n=m)wherethelaststepfollowsbylower-boundingjS`jbyanincreasinggeometricseries.21