JMLR Workshop and Conference Proceedings vol Randomized partition trees for exact nearest neighbor search Sanjoy Dasgupta dasguptacs - PDF document

Download presentation
JMLR Workshop and Conference Proceedings vol     Randomized partition trees for exact nearest neighbor search Sanjoy Dasgupta dasguptacs
JMLR Workshop and Conference Proceedings vol     Randomized partition trees for exact nearest neighbor search Sanjoy Dasgupta dasguptacs

JMLR Workshop and Conference Proceedings vol Randomized partition trees for exact nearest neighbor search Sanjoy Dasgupta dasguptacs - Description


ucsdedu Department of Computer Science and Engineering University of California San Diego 9500 Gilman Drive La Jolla CA 92093 Kaushik Sinha kaushiksinhawichitaedu Department of Electrical Engineering and Computer Science Wichita State University 1845 ID: 5772 Download Pdf

Tags

ucsdedu Department Computer Science

Embed / Share - JMLR Workshop and Conference Proceedings vol Randomized partition trees for exact nearest neighbor search Sanjoy Dasgupta dasguptacs


Presentation on theme: "JMLR Workshop and Conference Proceedings vol Randomized partition trees for exact nearest neighbor search Sanjoy Dasgupta dasguptacs"— Presentation transcript


DasguptaSinhatree,whichtakesO(n)spaceandanswersquerieswithanapproximationfactorc=1+inO((6=)dlogn)time(Aryaetal.,1998).Insomeoftheseresults,anexponentialdependenceondimensionisevident,andindeedthisisafamiliarblotonthenearestneighborlandscape.Onewaytomitigatethecurseofdimensionalityistoconsidersituationsinwhichdatahavelowintrinsicdimensiondo,eveniftheyhappentolieinRdforddoorinageneralmetricspace.Acommonassumptionisthatthedataaredrawnfromadoublingmeasureofdimensiondo(orequivalently,haveexpansionrate2do);thisisde nedinSection4.1below.Underthiscondition,KargerandRuhl(2002)haveaschemethatgivesexactanswerstonearestneighborqueriesintimeO(23dologn),usingadatastructureofsizeO(23don).Themorerecentcovertreealgorithm(Beygelzimeretal.,2006),whichhasbeenusedquitewidely,createsadatastructureinspaceO(n)andanswersqueriesintimeO(2dologn).Thereisalsoworkthatcombinesintrinsicdimensionandapproximatesearch.Thenavigatingnet(KrauthgamerandLee,2004),givendatafromametricspaceofdoublingdimensiondo,hassizeO(2O(do)n)andgivesa(1+)-approximateanswertoqueriesintimeO(2O(do)logn+(1=)O(do));thecrucialadvantagehereisthatdoublingdimensionisamoregeneralandrobustnotionthandoublingmeasure.Despitetheseandmanyotherresults,therearetwosigni cantde cienciesinthenearestneighborliteraturethathavemotivatedthepresentpaper.First,existinganalyseshavesucceededatidentifying,foragivendatastructure,highlyspeci cfamiliesofdataforwhichecientexactNNsearchispossible|forinstance,datafromdoublingmeasures|buthavefailedtoprovideamoregeneralcharacterization.Second,thereremainsaclassofnearestneighbordatastructuresthatarepopularandsuccessfulinpractice,butthathavenotbeenanalyzedthoroughly.Thesestructurescombineclassicalk-dtreepartitioningwithrandomizationandoverlappingcells,andarethesubjectofthispaper.1.1.ThreerandomizedtreestructuresforexactNNsearchThek-dtreeisapartitionofRdintohyper-rectangularcells,basedonasetofdatapoints(Bentley,1975).Therootofthetreeisasinglecellcorrespondingtotheentirespace.Acoordinatedirectionischosen,andthecellissplitatthemedianofthedataalongthisdirection(Figure1,left).Theprocessisthenrecursedonthetwonewlycreatedcells,andcontinuesuntilallleafcellscontainatmostsomepredeterminednumbernoofpoints.Whentherearendatapoints,thedepthofthetreeisatmostaboutlog(n=no).Givenak-dtreebuiltfromdatapointsS,thereareseveralwaystoansweranearestneighborqueryq.Thequickestanddirtiestoftheseistomoveqdownthetreetoitsappropriateleafcell,andthenreturnthenearestneighborinthatcell.ThisdefeatistsearchtakestimejustO(no+log(n=no)),whichisO(logn)forconstantno.Theproblemisthatq'snearestneighbormaywelllieinadi erentcell,forinstancewhenthedatahappentobeconcentratednearcellboundaries.Consequently,thefailureprobabilityofthisschemecanbeunacceptablyhigh.Overtheyears,somesimpletrickshaveemerged,fromvarioussources,forreducingthefailureprobability.ThesearenicelylaidoutbyLiuetal.(2004),whoshowexperimentallythattheresultingalgorithmsaree ectiveinpractice.2 DasguptaSinha Figure3:Threetypesofsplit.Thefractionsrefertoprobabilitymass. issomeconstant,while ischosenuniformlyatrandomfrom[1=4;3=4].unitsphere.Butnow,threesplitpointsarenoted:themedianm(C)ofthedataalongdirectionU,the(1=2)� fractilevaluel(C),andthe(1=2)+ fractilevaluer(C).Here isasmallconstant,like0:05or0:1.Theideaistosimultaneouslyentertainamediansplitleft=fx:xUm(C)gright=fx:xUm(C)gandanoverlappingsplit(withthemiddle2 fractionofthedatafallingonbothsides)left=fx:xUr(C)gright=fx:xUl(C)g:Inthespilltree(Liuetal.,2004),eachdatapointinSisstoredinmultipleleaves,byfollowingtheoverlappingsplits.Aqueryisthenanswereddefeatist-style,byroutingittoasingleleafusingmediansplits.BoththeRPtreeandthespilltreehavequerytimesofO(no+log(n=no)),butthelattercanbeexpectedtohavealowerfailureprobability,andwewillseethisintheboundsweobtain.Ontheotherhand,theRPtreerequiresjustlinearspace,whilethesizeofthespilltreeisO(n1=(1�lg(1+2 ))).When =0:05,forinstance,thesizeisO(n1:159).Inviewofthesetradeo s,weconsiderafurthervariant,whichwecallthevirtualspilltree.Itstoreseachdatapointinasingleleaf,followingmediansplits,andhencehaslinearsize.However,eachqueryisroutedtomultipleleaves,usingoverlappingsplits,andthereturnvalueisitsnearestneighborintheunionoftheseleaves.ThevarioussplitsaresummarizedinFigure3,andthethreetreesusethemasfollows: RoutingdataRoutingqueries RPtree PerturbedsplitPerturbedsplitSpilltree OverlappingsplitMediansplitVirtualspilltree MediansplitOverlappingsplitOnesmalltechnicality:if,forinstance,thereareduplicatesamongthedatapoints,itmightnotbepossibletoachieveamediansplit,orasplitatadesiredfractile.Wewillignorethesediscretizationproblems.4 DasguptaSinha2.Apotentialfunctionforpointcon gurationsTomotivatethepotentialfunction,westartbyconsideringwhathappenswhentherearejusttwodatapointsandonequerypoint.2.1.Howrandomprojectiona ectstherelativeplacementofthreepointsConsideranyq;x;y2Rd,suchthatxisclosertoqthanisy;thatis,kq�xkkq�yk.NowsupposethatarandomdirectionUischosenfromtheunitsphereSd�1,andthatthepointsareprojectedontothisdirection.Whatistheprobabilitythatyfallsbetweenqandxonthisline?Thefollowinglemmaanswersthisquestionexactly.Anapproximatesolution,withdi erentproofmethod,wasgivenearlierbyKleinberg(1997).Lemma1Pickanyq;x;y2Rdwithkq�xkkq�yk.PickarandomunitdirectionU.Theprobability,overU,thatyUfalls(strictly)betweenqUandxUis1 arcsin0@kq�xk kq�yks 1�(q�x)(y�x) kq�xkky�xk21A:ProofWemayassumeUisdrawnfromN(0;Id),thed-dimensionalGaussianwithmeanzeroandunitcovariance.ThisgivestherightdistributionifwescaleUtounitlength,butwecanskipthislaststepsinceithasnoe ectonthequestionathand.Wecanalsoassume,withoutlossofgenerality,thatqliesattheoriginandthatxliesalongthe(positive)x1-axis:thatis,q=0andx=kxke1.ItwillthenbehelpfultosplitthedirectionUintotwopieces,itscomponentU1inthex1-direction,andtheremainingd�1coordinatesUR.Likewise,wewillwritey=(y1;yR).IfyR=0thenx,y,andqarecollinear,andtheprojectionofycannotpossiblyfallbetweenthoseofxandq.HenceforthassumeyR6=0.LetEdenotetheeventofinterest:EyUfallsbetweenqU(thatis,0)andxU(thatis,kxkU1)yRURfallsbetween�y1U1and(kxk�y1)U1Theintervalofinterestiseither(�y1jU1j;(kxk�y1)jU1j),ifU10,or(�(kxk�y1)jU1j;y1jU1j),ifU10.NowyRURisindependentofU1andisdistributedasN(0;kyRk2),whichissymmetricandthusassignsthesameprobabilitymasstothetwointervals.ThereforePrU(E)=PrU1PrUR(�y1jU1jyRUR(kxk�y1)jU1j):LetZandZ0beindependentstandardnormalsN(0;1).SinceU1isdistributedasZandyRURisdistributedaskyRkZ0,PrU(E)=Pr(�y1jZjkyRkZ0(kxk�y1)jZj)=PrZ0 jZj2�y1 kyRk;kxk�y1 kyRk:6 DasguptaSinhaInthetreedatastructuresweanalyze,mostcellscontainonlyasubsetofthedatafx1;:::;xng.Foracellthatcontainsmofthesepoints,theappropriatevariantofism(q;fx1;:::;xng)=1 mmXi=2kq�x(1)k kq�x(i)k:Corollary4Pickanypointsq;x1;:::;xnandletSdenoteanysubsetofthexithatincludesx(1).IfqandthepointsinSareprojectedtoadirectionUchosenatrandomfromtheunitsphere,thenforany0 1,theprobability(overU)thatatleastan fractionoftheprojectedSfallsbetweenqandx(1)isupper-boundedby(1=2 )jSj(q;fx1;:::;xng).ProofApplyTheorem3toS,notingthatthecorrespondingvalueofismaximizedwhenSconsistsofthepointsclosesttoq;andthenapplyMarkov'sinequality. 2.3.ExtensiontoknearestneighborsIfweareinterestedin ndingtheknearestneighbors,asuitablegeneralizationofmisk;m(q;fx1;:::;xng)=1 mmXi=k+1(kq�x(1)k++kq�x(k)k)=k kq�x(i)k:Theorem5Pickanypointsq;x1;:::;xnandletSdenoteasubsetofthexithatincludesx(1);:::;x(k).SupposeqandthepointsinSareprojectedtoarandomunitdirectionU.Then,forany(k�1)=jSj 1,theprobability(overU)thatintheprojection,thereissome1jkforwhich mpointsliebetweenx(j)andqisatmostk 2( �(k�1)=jSj)k;jSj(q;fx1;:::;xng):Thistheorem,andmanyoftheothersthatfollow,areprovedintheappendix.3.RandomizedpartitiontreesWe'llnowseethatthefailureprobabilityoftherandomprojectiontreeisproportionaltoln(1=),whilethatofthetwospilltreesisproportionalto.Westartwiththesecondresult,sinceitisthemorestraightforwardofthetwo.3.1.RandomizedspilltreesInarandomizedspilltree,eachcellissplitalongadirectionchosenuniformlyatrandomfromtheunitsphere.Twokindsofsplitsaresimultaneouslyconsidered:(1)asplitatthemedian(alongtherandomdirection),and(2)anoverlappingsplitwithonepartcontainingthebottom1=2+ fractionofthecell'spoints,andtheotherpartcontainingthetop1=2+ fraction,where0 1=2(recallFigure3).8 RandomizedtreesforNNsearch3.3.Couldcoordinatedirectionsbeused?Thetreedatastructureswehavestudiedmakecrucialuseofrandomprojectionforsplittingcells.Itwouldnotsucetousecoordinatedirections,asink-dtrees.Toseethis,considerasimpleexample.Letq,thequerypoint,betheorigin,andsupposethedatapointsx1;:::;xn2Rdarechosenasfollows:x1istheall-onesvector.Eachxi;i�1,ischosenbypickingacoordinateatrandom,settingitsvaluetoM,andthensettingallremainingcoordinatestouniform-randomnumbersintherange(0;1).HereMissomeverylargeconstant.ForlargeenoughM,thenearestneighborofqisx1.BylettingMgrowfurther,wecanlet(q;fx1;:::;xng)getarbitrarilyclosetozero,whichmeansthattherandomprojectionmethodswillworkwell.However,anycoordinateprojectionwillcreateadisastrouslylargeseparationbetweenqandx1:onaverage,a(1�1=d)fractionofthedatapointswillfallbetweenthem.4.BoundingTheexactnearestneighborschemesweanalyzehaveerrorprobabilitiesrelatedto,whichliesintherange[0;1].Theworstcaseiswhenallpointsareequidistant,inwhichcaseisexactly1,butthisisapathologicalsituation.Isitpossibletoboundundersimpleassumptionsonthedata?Inthissectionwestudytwosuchassumptions.Ineachcase,querypointsarearbitrary,butthedataareassumedtohavebeendrawni.i.d.fromanunderlyingdistribution.4.1.DatadrawnfromadoublingmeasureSupposethedatapointsaredrawnfromadistributiononRdwhichisadoublingmeasure:thatis,thereexistaconstantC�0andasubsetXRdsuchthat(B(x;2r))C(B(x;r))forallx2Xandallr�0:HereB(x;r)istheclosedEuclideanballofradiusrcenteredatx.Tounderstandthiscondition,itishelpfultoalsolookatanalternativeformulationthatisessentiallyequivalent:thereexistaconstantdo�0andasubsetXRdsuchthatforallx2X,allr�0,andall 1,wehave(B(x; r)) do(B(x;r)).Inotherwords,theprobabilitymassofaballgrowspolynomiallyintheradius.Comparingthistothestandardformulaforthevolumeofaball,weseethatthedegreeofthispolynomial,do(=log2C),canreasonablybethoughtofasthe\dimension"ofmeasure.Theorem8SupposeiscontinuousonRdandisadoublingmeasureofdimensiondo2.Pickanyq2Xanddrawx1;:::;xn.Pickany01=2.Withprobability1�311 RandomizedtreesforNNsearchTheoveralldistributionisthusamixture=w11++wttwhosejthcomponentisaBernoulliproductdistributionj=B(p(j)1)B(p(j)N).HereB(p)isashorthandforthedistributiononf0;1gwithexpectedvaluep.Itwillsimplifythingstoassumethat0p(j)i1=2;thisisnotahugeassumptionif,say,stopwordshavebeenremoved.Forthepurposesofbounding,weareinterestedinthedistributionofdH(q;X),whereXischosenfromanddHdenotesHammingdistance.Thisisasumofsmallindependentquantities,anditiscustomarytoapproximatesuchsumsbyaPoissondistribution.Inthecurrentcontext,however,thisapproximationisratherpoor,andweinsteadusecountingargumentstodirectlyboundhowrapidlythedistributiongrows.Theresultsstandinstarkcontrasttothoseweobtainedfordoublingmeasures,andrevealthistobeasubstantiallymoredicultsettingfornearestneighborsearch.Foradoublingmeasure,theprobabilitymassofaballB(q;r)doubleswheneverrismultipliedbyaconstant.Inourpresentsetting,itdoubleswheneverrisincreasedbyanadditiveconstant:Theorem10Supposethatallp(j)i2(0;1=2).LetLj=Pip(j)idenotetheexpectednumberofwordsinadocumentfromtopicj,andletL=min(L1;:::;Lt).Pickanyqueryq2f0;1gN,anddrawX.Forany`0,Pr(dH(q;X)=`+1) Pr(dH(q;X)=`)L�`=2 `+1:Now, xaparticularqueryq2f0;1gN,anddrawx1;:::;xnfromdistribution.Lemma11Thereisanabsoluteconstantcoforwhichthefollowingholds.Pickany01andanyk1,andletvdenotethesmallestintegerforwhichPrX(dH(q;X)v)(8=n)max(k;ln1=).Thenwithprobabilityatleast1�3overthechoiceofx1;:::;xn,foranymn,k;m(q;fx1;:::;xng)4r v coL�log2(n=m):Theimplicationofthislemmaisthatforanyofthethreetreedatastructures,thefailureprobabilityatasinglelevelisroughlyp v=L.ThismeansthatthetreecanonlybegrowntodepthO(p L=v),andthusthequerytimeisdominatedbyno=n2�O(p L=v).Whennislarge,weexpectvtobesmall,andthusthequerytimeimprovesoverexhaustivesearchbyafactorofroughly2�p L.AcknowledgmentsWethanktheNationalScienceFoundationforsupportundergrantIIS-1162581.ReferencesN.AilonandB.Chazelle.ThefastJohnson-Lindenstrausstransformandapproximatenearestneighbors.SIAMJournalonComputing,39:302{322,2009.13 RandomizedtreesforNNsearchAppendixB.ProofofTheorem7Consideranyinternalnodeofthetreethatcontainsqaswellasmofthedatapoints,includingx(1).Whatistheprobabilitythatthesplitatthatnodeseparatesqfromx(1)?Toanalyzethis,letFdenotethefractionofthempointsthatfallbetweenqandx(1)alongtherandomly-chosensplitdirection.Sincethesplitpointischosenatrandomfromanintervalofmass1=2,theprobabilitythatitseparatesqfromx(1)isatmostF=(1=2).IntegratingoutF,wegetPr(qisseparatedfromx(1))Z10Pr(F=f)f 1=2df=2Z10Pr(F�f)df2Z10min1;m 2fdf=2Zm=20df+2Z1m=2m 2fdf=mln2e m;wherethesecondinequalityusesCorollary4.Thelemmafollowsbytakingaunionboundoverthepaththatconveysqfromroottoleaf,inwhichthenumberofdatapointsperlevelshrinksgeometrically,byafactorof3=4orbetter.Thesamereasoninggeneralizestoknearestneighbors.Thistime,Fisde nedtobethefractionofthempointsthatliebetweenqandthefurthestofx(1);:::;x(k)alongtherandomsplittingdirection.ThenqisseparatedfromoneoftheseneighborsonlyifthesplitpointliesinanintervalofmassFoneithersideofq,aneventthatoccurswithprobabilityatmost2F=(1=2).UsingTheorem5,Pr(qisseparatedfromsomex(j),1jk)Z10Pr(F=f)2f 1=2df=4Z10Pr(F�f)df4Z10min1;kk;m 2(f�(k�1)=m)df4Z(kk;m=2)+(k�1)=m0df+4Z1(kk;m=2)+(k�1)=mkk;m 2(f�(k�1)=m)df2kk;mln2e kk;m+4(k�1) m;andasbefore,wesumthisoveraroot-to-leafpathinthetree.AppendixC.ProofofTheorem8C.1.Thek=1caseWewillconsideracollectionofballsBo;B1;B2;:::centeredatq,withgeometricallyin-creasingradiiro;r1;r2;:::,respectively.Fori1,wewilltakeri=2iro.Thusbythedoublingcondition,(Bi)Ci(Bo),whereC=2do4.15 RandomizedtreesforNNsearchwherethelastinequalitycomesfrom(*).Tolower-bound2`,weagainuse(*)togetC`m=(2n(Bo)),whereupon2`m 2n(Bo)1=log2C=m 2ln(1=)1=log2Candwe'redone.C.2.Thek�1caseTheonlybigchangeisinthede nitionofro;itisnowtheradiusforwhich(Bo)=4 nmaxk;ln1 :Thus,whenx1;:::;xnaredrawnindependentlyatrandomfrom,theexpectednumberofthemthatfallinBoisatleast4k,andbyamultiplicativeCherno boundisatleastkwithprobability1�.TheballsB1;B2;:::arede nedasbefore,andonceagain,wecanconcludethatwithprobability1�2,eachBicontainsatmost2nCi(Bo)ofthedatapoints.Anypointx(i)62BoliesinsomeannulusBjnBj�1,anditscontributiontothesummationink;mis(kq�x(1)k++kq�x(k)k)=k kq�x(i)k1 2j�1:Therelationship(*)andtheremainderoftheargumentareexactlyasbefore.AppendixD.AusefultechnicallemmaLemma12SupposethatforsomeconstantsA;B�0anddo1,F(m)AB m1=doforallmno.Pickany0 1andde ne`=log1= (n=no).Then:`Xi=0F( in)Ado 1� B no1=doand,ifnoB(A=2)do,`Xi=0F( in)ln2e F( in)Ado 1� B no1=do1 1� ln1 +ln2e A+1 dolnno B:17 RandomizedtreesforNNsearchTounderstandthisdistribution,westartwithageneralresultaboutsumsofBernoullirandomvariables.Noticethattheresultisexactlycorrectinthesituationwhereallpi=1=2.Lemma13SupposeZ1;:::;ZNareindependent,whereZi2f0;1gisaBernoullirandomvariablewithmean0ai1,anda1a2aN.LetZ=Z1++ZN.Thenforany`0,Pr(Z=`+1) Pr(Z=`)1 `+1NXi=`+1ai 1�ai:ProofDe neri=ai=(1�ai)2(0;1);thenr1r2rN.Now,forany`0,Pr(Z=`)=Xfi1;:::;i`g[N]ai1ai2ai`Yj62fi1;:::;i`g(1�aj)=NYi=1(1�ai)Xfi1;:::;i`g[N]ai1 1�ai1ai2 1�ai2ai` 1�ai`=NYi=1(1�ai)Xfi1;:::;i`g[N]ri1ri2ri`wherethesummationsareoversubsetsfi1;:::;i`gof`distinctelementsof[N].Inthe nalline,theproductofthe(1�ai)doesnotdependupon`andcanbeignored.Let'sfocusonthesummation;callitS`.WewouldliketocompareittoS`+1.S`+1isthesumof�N`+1distinctterms,eachtheproductof`+1ri's.ThesetermsalsoappearinthequantityS`(r1++rN);infact,eachtermofS`+1appearsmultipletimes,`+1timestobeprecise.TheremainingtermsinS`(r1++rN)eachcontain`�1uniqueelementsandoneduplicatedelement.Byaccountinginthisway,wegetS`(r1++rN)=(`+1)S`+1+Xfi1;:::;i`g[N]ri1ri2ri`(ri1++ri`)(`+1)S`+1+S`(r1++r`)sincetheri'sarearrangedindecreasingorder.HencePr(Z=`+1) Pr(Z=`)=S`+1 S`1 `+1(r`+1++rN);asclaimed. WenowapplythisresultdirectlytothesumofBernoullivariablesZ=dH(q;X).Lemma14Supposethatp1;:::;pN2(0;1=2).Pickanyqueryq2f0;1gN,anddrawXfromdistribution=B(p1)B(pN).Thenforany`0,Pr(dH(q;X)=`+1) Pr(dH(q;X)=`)L�`=2 `+1;whereL=PipiistheexpectednumberofwordsinX.19 RandomizedtreesforNNsearchSupposethatforsomei�k,pointx(i)isatHammingdistance`fromq,thatis,x(i)2S`.Then(kq�x(1)k++kq�x(k)k)=k kq�x(i)kr v `sinceEuclideandistanceisthesquarerootofHammingdistance.Inboundingk;m,weneedtogaugetherangeofHammingdistancesspannedbyx(k+1);:::;x(m).Thegeometricgrowthrateofpart(c)impliesthatmostpointslieatHammingdistancecoLorgreaterfromq.ItalsomeansthatdH(q;x(m))�coL�log2(n=m).Thus,k;m(q;fx1;:::;xng)=1 mX�ik(kq�x(1)k++kq�x(k)k)=k kq�x(i)k1 mX`vjS`\fx(1);:::;x(m)gjr v `4r v coL�log2(n=m)wherethelaststepfollowsbylower-boundingjS`jbyanincreasinggeometricseries.21

Shom More....
By: pasty-toler
Views: 121
Type: Public

Download Section

Please download the presentation after appearing the download area.


Download Pdf - The PPT/PDF document "JMLR Workshop and Conference Proceedings..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Try DocSlides online tool for compressing your PDF Files Try Now

Related Documents