JMLR Workshop and Conference Proceedings vol Randomized partition trees for exact nearest neighbor search Sanjoy Dasgupta dasguptacs - Pdf

121K - views

JMLR Workshop and Conference Proceedings vol Randomized partition trees for exact nearest neighbor search Sanjoy Dasgupta dasguptacs

ucsdedu Department of Computer Science and Engineering University of California San Diego 9500 Gilman Drive La Jolla CA 92093 Kaushik Sinha kaushiksinhawichitaedu Department of Electrical Engineering and Computer Science Wichita State University 1845

Embed :
Pdf Download Link

Download Pdf - The PPT/PDF document "JMLR Workshop and Conference Proceedings..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

JMLR Workshop and Conference Proceedings vol Randomized partition trees for exact nearest neighbor search Sanjoy Dasgupta dasguptacs






Presentation on theme: "JMLR Workshop and Conference Proceedings vol Randomized partition trees for exact nearest neighbor search Sanjoy Dasgupta dasguptacs"— Presentation transcript:

DasguptaSinhatree,whichtakesO(n)spaceandanswersquerieswithanapproximationfactorc=1+inO((6=)dlogn)time(Aryaetal.,1998).Insomeoftheseresults,anexponentialdependenceondimensionisevident,andindeedthisisafamiliarblotonthenearestneighborlandscape.Onewaytomitigatethecurseofdimensionalityistoconsidersituationsinwhichdatahavelowintrinsicdimensiondo,eveniftheyhappentolieinRdforddoorinageneralmetricspace.Acommonassumptionisthatthedataaredrawnfromadoublingmeasureofdimensiondo(orequivalently,haveexpansionrate2do);thisisde nedinSection4.1below.Underthiscondition,KargerandRuhl(2002)haveaschemethatgivesexactanswerstonearestneighborqueriesintimeO(23dologn),usingadatastructureofsizeO(23don).Themorerecentcovertreealgorithm(Beygelzimeretal.,2006),whichhasbeenusedquitewidely,createsadatastructureinspaceO(n)andanswersqueriesintimeO(2dologn).Thereisalsoworkthatcombinesintrinsicdimensionandapproximatesearch.Thenavigatingnet(KrauthgamerandLee,2004),givendatafromametricspaceofdoublingdimensiondo,hassizeO(2O(do)n)andgivesa(1+)-approximateanswertoqueriesintimeO(2O(do)logn+(1=)O(do));thecrucialadvantagehereisthatdoublingdimensionisamoregeneralandrobustnotionthandoublingmeasure.Despitetheseandmanyotherresults,therearetwosigni cantde cienciesinthenearestneighborliteraturethathavemotivatedthepresentpaper.First,existinganalyseshavesucceededatidentifying,foragivendatastructure,highlyspeci cfamiliesofdataforwhichecientexactNNsearchispossible|forinstance,datafromdoublingmeasures|buthavefailedtoprovideamoregeneralcharacterization.Second,thereremainsaclassofnearestneighbordatastructuresthatarepopularandsuccessfulinpractice,butthathavenotbeenanalyzedthoroughly.Thesestructurescombineclassicalk-dtreepartitioningwithrandomizationandoverlappingcells,andarethesubjectofthispaper.1.1.ThreerandomizedtreestructuresforexactNNsearchThek-dtreeisapartitionofRdintohyper-rectangularcells,basedonasetofdatapoints(Bentley,1975).Therootofthetreeisasinglecellcorrespondingtotheentirespace.Acoordinatedirectionischosen,andthecellissplitatthemedianofthedataalongthisdirection(Figure1,left).Theprocessisthenrecursedonthetwonewlycreatedcells,andcontinuesuntilallleafcellscontainatmostsomepredeterminednumbernoofpoints.Whentherearendatapoints,thedepthofthetreeisatmostaboutlog(n=no).Givenak-dtreebuiltfromdatapointsS,thereareseveralwaystoansweranearestneighborqueryq.Thequickestanddirtiestoftheseistomoveqdownthetreetoitsappropriateleafcell,andthenreturnthenearestneighborinthatcell.ThisdefeatistsearchtakestimejustO(no+log(n=no)),whichisO(logn)forconstantno.Theproblemisthatq'snearestneighbormaywelllieinadi erentcell,forinstancewhenthedatahappentobeconcentratednearcellboundaries.Consequently,thefailureprobabilityofthisschemecanbeunacceptablyhigh.Overtheyears,somesimpletrickshaveemerged,fromvarioussources,forreducingthefailureprobability.ThesearenicelylaidoutbyLiuetal.(2004),whoshowexperimentallythattheresultingalgorithmsaree ectiveinpractice.2 DasguptaSinha Figure3:Threetypesofsplit.Thefractionsrefertoprobabilitymass. issomeconstant,while ischosenuniformlyatrandomfrom[1=4;3=4].unitsphere.Butnow,threesplitpointsarenoted:themedianm(C)ofthedataalongdirectionU,the(1=2)� fractilevaluel(C),andthe(1=2)+ fractilevaluer(C).Here isasmallconstant,like0:05or0:1.Theideaistosimultaneouslyentertainamediansplitleft=fx:xUm(C)gright=fx:xUm(C)gandanoverlappingsplit(withthemiddle2 fractionofthedatafallingonbothsides)left=fx:xUr(C)gright=fx:xUl(C)g:Inthespilltree(Liuetal.,2004),eachdatapointinSisstoredinmultipleleaves,byfollowingtheoverlappingsplits.Aqueryisthenanswereddefeatist-style,byroutingittoasingleleafusingmediansplits.BoththeRPtreeandthespilltreehavequerytimesofO(no+log(n=no)),butthelattercanbeexpectedtohavealowerfailureprobability,andwewillseethisintheboundsweobtain.Ontheotherhand,theRPtreerequiresjustlinearspace,whilethesizeofthespilltreeisO(n1=(1�lg(1+2 ))).When =0:05,forinstance,thesizeisO(n1:159).Inviewofthesetradeo s,weconsiderafurthervariant,whichwecallthevirtualspilltree.Itstoreseachdatapointinasingleleaf,followingmediansplits,andhencehaslinearsize.However,eachqueryisroutedtomultipleleaves,usingoverlappingsplits,andthereturnvalueisitsnearestneighborintheunionoftheseleaves.ThevarioussplitsaresummarizedinFigure3,andthethreetreesusethemasfollows: RoutingdataRoutingqueries RPtree PerturbedsplitPerturbedsplitSpilltree OverlappingsplitMediansplitVirtualspilltree MediansplitOverlappingsplitOnesmalltechnicality:if,forinstance,thereareduplicatesamongthedatapoints,itmightnotbepossibletoachieveamediansplit,orasplitatadesiredfractile.Wewillignorethesediscretizationproblems.4 DasguptaSinha2.Apotentialfunctionforpointcon gurationsTomotivatethepotentialfunction,westartbyconsideringwhathappenswhentherearejusttwodatapointsandonequerypoint.2.1.Howrandomprojectiona ectstherelativeplacementofthreepointsConsideranyq;x;y2Rd,suchthatxisclosertoqthanisy;thatis,kq�xkkq�yk.NowsupposethatarandomdirectionUischosenfromtheunitsphereSd�1,andthatthepointsareprojectedontothisdirection.Whatistheprobabilitythatyfallsbetweenqandxonthisline?Thefollowinglemmaanswersthisquestionexactly.Anapproximatesolution,withdi erentproofmethod,wasgivenearlierbyKleinberg(1997).Lemma1Pickanyq;x;y2Rdwithkq�xkkq�yk.PickarandomunitdirectionU.Theprobability,overU,thatyUfalls(strictly)betweenqUandxUis1 arcsin0@kq�xk kq�yks 1�(q�x)(y�x) kq�xkky�xk21A:ProofWemayassumeUisdrawnfromN(0;Id),thed-dimensionalGaussianwithmeanzeroandunitcovariance.ThisgivestherightdistributionifwescaleUtounitlength,butwecanskipthislaststepsinceithasnoe ectonthequestionathand.Wecanalsoassume,withoutlossofgenerality,thatqliesattheoriginandthatxliesalongthe(positive)x1-axis:thatis,q=0andx=kxke1.ItwillthenbehelpfultosplitthedirectionUintotwopieces,itscomponentU1inthex1-direction,andtheremainingd�1coordinatesUR.Likewise,wewillwritey=(y1;yR).IfyR=0thenx,y,andqarecollinear,andtheprojectionofycannotpossiblyfallbetweenthoseofxandq.HenceforthassumeyR6=0.LetEdenotetheeventofinterest:EyUfallsbetweenqU(thatis,0)andxU(thatis,kxkU1)yRURfallsbetween�y1U1and(kxk�y1)U1Theintervalofinterestiseither(�y1jU1j;(kxk�y1)jU1j),ifU10,or(�(kxk�y1)jU1j;y1jU1j),ifU10.NowyRURisindependentofU1andisdistributedasN(0;kyRk2),whichissymmetricandthusassignsthesameprobabilitymasstothetwointervals.ThereforePrU(E)=PrU1PrUR(�y1jU1jyRUR(kxk�y1)jU1j):LetZandZ0beindependentstandardnormalsN(0;1).SinceU1isdistributedasZandyRURisdistributedaskyRkZ0,PrU(E)=Pr(�y1jZjkyRkZ0(kxk�y1)jZj)=PrZ0 jZj2�y1 kyRk;kxk�y1 kyRk:6 DasguptaSinhaInthetreedatastructuresweanalyze,mostcellscontainonlyasubsetofthedatafx1;:::;xng.Foracellthatcontainsmofthesepoints,theappropriatevariantofism(q;fx1;:::;xng)=1 mmXi=2kq�x(1)k kq�x(i)k:Corollary4Pickanypointsq;x1;:::;xnandletSdenoteanysubsetofthexithatincludesx(1).IfqandthepointsinSareprojectedtoadirectionUchosenatrandomfromtheunitsphere,thenforany0 1,theprobability(overU)thatatleastan fractionoftheprojectedSfallsbetweenqandx(1)isupper-boundedby(1=2 )jSj(q;fx1;:::;xng).ProofApplyTheorem3toS,notingthatthecorrespondingvalueofismaximizedwhenSconsistsofthepointsclosesttoq;andthenapplyMarkov'sinequality. 2.3.ExtensiontoknearestneighborsIfweareinterestedin ndingtheknearestneighbors,asuitablegeneralizationofmisk;m(q;fx1;:::;xng)=1 mmXi=k+1(kq�x(1)k++kq�x(k)k)=k kq�x(i)k:Theorem5Pickanypointsq;x1;:::;xnandletSdenoteasubsetofthexithatincludesx(1);:::;x(k).SupposeqandthepointsinSareprojectedtoarandomunitdirectionU.Then,forany(k�1)=jSj 1,theprobability(overU)thatintheprojection,thereissome1jkforwhich mpointsliebetweenx(j)andqisatmostk 2( �(k�1)=jSj)k;jSj(q;fx1;:::;xng):Thistheorem,andmanyoftheothersthatfollow,areprovedintheappendix.3.RandomizedpartitiontreesWe'llnowseethatthefailureprobabilityoftherandomprojectiontreeisproportionaltoln(1=),whilethatofthetwospilltreesisproportionalto.Westartwiththesecondresult,sinceitisthemorestraightforwardofthetwo.3.1.RandomizedspilltreesInarandomizedspilltree,eachcellissplitalongadirectionchosenuniformlyatrandomfromtheunitsphere.Twokindsofsplitsaresimultaneouslyconsidered:(1)asplitatthemedian(alongtherandomdirection),and(2)anoverlappingsplitwithonepartcontainingthebottom1=2+ fractionofthecell'spoints,andtheotherpartcontainingthetop1=2+ fraction,where0 1=2(recallFigure3).8 RandomizedtreesforNNsearch3.3.Couldcoordinatedirectionsbeused?Thetreedatastructureswehavestudiedmakecrucialuseofrandomprojectionforsplittingcells.Itwouldnotsucetousecoordinatedirections,asink-dtrees.Toseethis,considerasimpleexample.Letq,thequerypoint,betheorigin,andsupposethedatapointsx1;:::;xn2Rdarechosenasfollows:x1istheall-onesvector.Eachxi;i�1,ischosenbypickingacoordinateatrandom,settingitsvaluetoM,andthensettingallremainingcoordinatestouniform-randomnumbersintherange(0;1).HereMissomeverylargeconstant.ForlargeenoughM,thenearestneighborofqisx1.BylettingMgrowfurther,wecanlet(q;fx1;:::;xng)getarbitrarilyclosetozero,whichmeansthattherandomprojectionmethodswillworkwell.However,anycoordinateprojectionwillcreateadisastrouslylargeseparationbetweenqandx1:onaverage,a(1�1=d)fractionofthedatapointswillfallbetweenthem.4.BoundingTheexactnearestneighborschemesweanalyzehaveerrorprobabilitiesrelatedto,whichliesintherange[0;1].Theworstcaseiswhenallpointsareequidistant,inwhichcaseisexactly1,butthisisapathologicalsituation.Isitpossibletoboundundersimpleassumptionsonthedata?Inthissectionwestudytwosuchassumptions.Ineachcase,querypointsarearbitrary,butthedataareassumedtohavebeendrawni.i.d.fromanunderlyingdistribution.4.1.DatadrawnfromadoublingmeasureSupposethedatapointsaredrawnfromadistributiononRdwhichisadoublingmeasure:thatis,thereexistaconstantC�0andasubsetXRdsuchthat(B(x;2r))C(B(x;r))forallx2Xandallr�0:HereB(x;r)istheclosedEuclideanballofradiusrcenteredatx.Tounderstandthiscondition,itishelpfultoalsolookatanalternativeformulationthatisessentiallyequivalent:thereexistaconstantdo�0andasubsetXRdsuchthatforallx2X,allr�0,andall 1,wehave(B(x; r)) do(B(x;r)).Inotherwords,theprobabilitymassofaballgrowspolynomiallyintheradius.Comparingthistothestandardformulaforthevolumeofaball,weseethatthedegreeofthispolynomial,do(=log2C),canreasonablybethoughtofasthe\dimension"ofmeasure.Theorem8SupposeiscontinuousonRdandisadoublingmeasureofdimensiondo2.Pickanyq2Xanddrawx1;:::;xn.Pickany01=2.Withprobability1�311 RandomizedtreesforNNsearchTheoveralldistributionisthusamixture=w11++wttwhosejthcomponentisaBernoulliproductdistributionj=B(p(j)1)B(p(j)N).HereB(p)isashorthandforthedistributiononf0;1gwithexpectedvaluep.Itwillsimplifythingstoassumethat0p(j)i1=2;thisisnotahugeassumptionif,say,stopwordshavebeenremoved.Forthepurposesofbounding,weareinterestedinthedistributionofdH(q;X),whereXischosenfromanddHdenotesHammingdistance.Thisisasumofsmallindependentquantities,anditiscustomarytoapproximatesuchsumsbyaPoissondistribution.Inthecurrentcontext,however,thisapproximationisratherpoor,andweinsteadusecountingargumentstodirectlyboundhowrapidlythedistributiongrows.Theresultsstandinstarkcontrasttothoseweobtainedfordoublingmeasures,andrevealthistobeasubstantiallymoredicultsettingfornearestneighborsearch.Foradoublingmeasure,theprobabilitymassofaballB(q;r)doubleswheneverrismultipliedbyaconstant.Inourpresentsetting,itdoubleswheneverrisincreasedbyanadditiveconstant:Theorem10Supposethatallp(j)i2(0;1=2).LetLj=Pip(j)idenotetheexpectednumberofwordsinadocumentfromtopicj,andletL=min(L1;:::;Lt).Pickanyqueryq2f0;1gN,anddrawX.Forany`0,Pr(dH(q;X)=`+1) Pr(dH(q;X)=`)L�`=2 `+1:Now, xaparticularqueryq2f0;1gN,anddrawx1;:::;xnfromdistribution.Lemma11Thereisanabsoluteconstantcoforwhichthefollowingholds.Pickany01andanyk1,andletvdenotethesmallestintegerforwhichPrX(dH(q;X)v)(8=n)max(k;ln1=).Thenwithprobabilityatleast1�3overthechoiceofx1;:::;xn,foranymn,k;m(q;fx1;:::;xng)4r v coL�log2(n=m):Theimplicationofthislemmaisthatforanyofthethreetreedatastructures,thefailureprobabilityatasinglelevelisroughlyp v=L.ThismeansthatthetreecanonlybegrowntodepthO(p L=v),andthusthequerytimeisdominatedbyno=n2�O(p L=v).Whennislarge,weexpectvtobesmall,andthusthequerytimeimprovesoverexhaustivesearchbyafactorofroughly2�p L.AcknowledgmentsWethanktheNationalScienceFoundationforsupportundergrantIIS-1162581.ReferencesN.AilonandB.Chazelle.ThefastJohnson-Lindenstrausstransformandapproximatenearestneighbors.SIAMJournalonComputing,39:302{322,2009.13 RandomizedtreesforNNsearchAppendixB.ProofofTheorem7Consideranyinternalnodeofthetreethatcontainsqaswellasmofthedatapoints,includingx(1).Whatistheprobabilitythatthesplitatthatnodeseparatesqfromx(1)?Toanalyzethis,letFdenotethefractionofthempointsthatfallbetweenqandx(1)alongtherandomly-chosensplitdirection.Sincethesplitpointischosenatrandomfromanintervalofmass1=2,theprobabilitythatitseparatesqfromx(1)isatmostF=(1=2).IntegratingoutF,wegetPr(qisseparatedfromx(1))Z10Pr(F=f)f 1=2df=2Z10Pr(F�f)df2Z10min1;m 2fdf=2Zm=20df+2Z1m=2m 2fdf=mln2e m;wherethesecondinequalityusesCorollary4.Thelemmafollowsbytakingaunionboundoverthepaththatconveysqfromroottoleaf,inwhichthenumberofdatapointsperlevelshrinksgeometrically,byafactorof3=4orbetter.Thesamereasoninggeneralizestoknearestneighbors.Thistime,Fisde nedtobethefractionofthempointsthatliebetweenqandthefurthestofx(1);:::;x(k)alongtherandomsplittingdirection.ThenqisseparatedfromoneoftheseneighborsonlyifthesplitpointliesinanintervalofmassFoneithersideofq,aneventthatoccurswithprobabilityatmost2F=(1=2).UsingTheorem5,Pr(qisseparatedfromsomex(j),1jk)Z10Pr(F=f)2f 1=2df=4Z10Pr(F�f)df4Z10min1;kk;m 2(f�(k�1)=m)df4Z(kk;m=2)+(k�1)=m0df+4Z1(kk;m=2)+(k�1)=mkk;m 2(f�(k�1)=m)df2kk;mln2e kk;m+4(k�1) m;andasbefore,wesumthisoveraroot-to-leafpathinthetree.AppendixC.ProofofTheorem8C.1.Thek=1caseWewillconsideracollectionofballsBo;B1;B2;:::centeredatq,withgeometricallyin-creasingradiiro;r1;r2;:::,respectively.Fori1,wewilltakeri=2iro.Thusbythedoublingcondition,(Bi)Ci(Bo),whereC=2do4.15 RandomizedtreesforNNsearchwherethelastinequalitycomesfrom(*).Tolower-bound2`,weagainuse(*)togetC`m=(2n(Bo)),whereupon2`m 2n(Bo)1=log2C=m 2ln(1=)1=log2Candwe'redone.C.2.Thek�1caseTheonlybigchangeisinthede nitionofro;itisnowtheradiusforwhich(Bo)=4 nmaxk;ln1 :Thus,whenx1;:::;xnaredrawnindependentlyatrandomfrom,theexpectednumberofthemthatfallinBoisatleast4k,andbyamultiplicativeCherno boundisatleastkwithprobability1�.TheballsB1;B2;:::arede nedasbefore,andonceagain,wecanconcludethatwithprobability1�2,eachBicontainsatmost2nCi(Bo)ofthedatapoints.Anypointx(i)62BoliesinsomeannulusBjnBj�1,anditscontributiontothesummationink;mis(kq�x(1)k++kq�x(k)k)=k kq�x(i)k1 2j�1:Therelationship(*)andtheremainderoftheargumentareexactlyasbefore.AppendixD.AusefultechnicallemmaLemma12SupposethatforsomeconstantsA;B�0anddo1,F(m)AB m1=doforallmno.Pickany0 1andde ne`=log1= (n=no).Then:`Xi=0F( in)Ado 1� B no1=doand,ifnoB(A=2)do,`Xi=0F( in)ln2e F( in)Ado 1� B no1=do1 1� ln1 +ln2e A+1 dolnno B:17 RandomizedtreesforNNsearchTounderstandthisdistribution,westartwithageneralresultaboutsumsofBernoullirandomvariables.Noticethattheresultisexactlycorrectinthesituationwhereallpi=1=2.Lemma13SupposeZ1;:::;ZNareindependent,whereZi2f0;1gisaBernoullirandomvariablewithmean0ai1,anda1a2aN.LetZ=Z1++ZN.Thenforany`0,Pr(Z=`+1) Pr(Z=`)1 `+1NXi=`+1ai 1�ai:ProofDe neri=ai=(1�ai)2(0;1);thenr1r2rN.Now,forany`0,Pr(Z=`)=Xfi1;:::;i`g[N]ai1ai2ai`Yj62fi1;:::;i`g(1�aj)=NYi=1(1�ai)Xfi1;:::;i`g[N]ai1 1�ai1ai2 1�ai2ai` 1�ai`=NYi=1(1�ai)Xfi1;:::;i`g[N]ri1ri2ri`wherethesummationsareoversubsetsfi1;:::;i`gof`distinctelementsof[N].Inthe nalline,theproductofthe(1�ai)doesnotdependupon`andcanbeignored.Let'sfocusonthesummation;callitS`.WewouldliketocompareittoS`+1.S`+1isthesumof�N`+1distinctterms,eachtheproductof`+1ri's.ThesetermsalsoappearinthequantityS`(r1++rN);infact,eachtermofS`+1appearsmultipletimes,`+1timestobeprecise.TheremainingtermsinS`(r1++rN)eachcontain`�1uniqueelementsandoneduplicatedelement.Byaccountinginthisway,wegetS`(r1++rN)=(`+1)S`+1+Xfi1;:::;i`g[N]ri1ri2ri`(ri1++ri`)(`+1)S`+1+S`(r1++r`)sincetheri'sarearrangedindecreasingorder.HencePr(Z=`+1) Pr(Z=`)=S`+1 S`1 `+1(r`+1++rN);asclaimed. WenowapplythisresultdirectlytothesumofBernoullivariablesZ=dH(q;X).Lemma14Supposethatp1;:::;pN2(0;1=2).Pickanyqueryq2f0;1gN,anddrawXfromdistribution=B(p1)B(pN).Thenforany`0,Pr(dH(q;X)=`+1) Pr(dH(q;X)=`)L�`=2 `+1;whereL=PipiistheexpectednumberofwordsinX.19 RandomizedtreesforNNsearchSupposethatforsomei�k,pointx(i)isatHammingdistance`fromq,thatis,x(i)2S`.Then(kq�x(1)k++kq�x(k)k)=k kq�x(i)kr v `sinceEuclideandistanceisthesquarerootofHammingdistance.Inboundingk;m,weneedtogaugetherangeofHammingdistancesspannedbyx(k+1);:::;x(m).Thegeometricgrowthrateofpart(c)impliesthatmostpointslieatHammingdistancecoLorgreaterfromq.ItalsomeansthatdH(q;x(m))�coL�log2(n=m).Thus,k;m(q;fx1;:::;xng)=1 mX�ik(kq�x(1)k++kq�x(k)k)=k kq�x(i)k1 mX`vjS`\fx(1);:::;x(m)gjr v `4r v coL�log2(n=m)wherethelaststepfollowsbylower-boundingjS`jbyanincreasinggeometricseries.21