/
JMLR Workshop and Conference Proceedings vol     Randomized partition trees for exact JMLR Workshop and Conference Proceedings vol     Randomized partition trees for exact

JMLR Workshop and Conference Proceedings vol Randomized partition trees for exact - PDF document

pasty-toler
pasty-toler . @pasty-toler
Follow
466 views
Uploaded On 2014-10-18

JMLR Workshop and Conference Proceedings vol Randomized partition trees for exact - PPT Presentation

ucsdedu Department of Computer Science and Engineering University of California San Diego 9500 Gilman Drive La Jolla CA 92093 Kaushik Sinha kaushiksinhawichitaedu Department of Electrical Engineering and Computer Science Wichita State University 1845 ID: 5772

ucsdedu Department Computer Science

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JMLR Workshop and Conference Proceedings..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

DasguptaSinhatree,whichtakesO(n)spaceandanswersquerieswithanapproximationfactorc=1+inO((6=)dlogn)time(Aryaetal.,1998).Insomeoftheseresults,anexponentialdependenceondimensionisevident,andindeedthisisafamiliarblotonthenearestneighborlandscape.Onewaytomitigatethecurseofdimensionalityistoconsidersituationsinwhichdatahavelowintrinsicdimensiondo,eveniftheyhappentolieinRdforddoorinageneralmetricspace.Acommonassumptionisthatthedataaredrawnfromadoublingmeasureofdimensiondo(orequivalently,haveexpansionrate2do);thisisde nedinSection4.1below.Underthiscondition,KargerandRuhl(2002)haveaschemethatgivesexactanswerstonearestneighborqueriesintimeO(23dologn),usingadatastructureofsizeO(23don).Themorerecentcovertreealgorithm(Beygelzimeretal.,2006),whichhasbeenusedquitewidely,createsadatastructureinspaceO(n)andanswersqueriesintimeO(2dologn).Thereisalsoworkthatcombinesintrinsicdimensionandapproximatesearch.Thenavigatingnet(KrauthgamerandLee,2004),givendatafromametricspaceofdoublingdimensiondo,hassizeO(2O(do)n)andgivesa(1+)-approximateanswertoqueriesintimeO(2O(do)logn+(1=)O(do));thecrucialadvantagehereisthatdoublingdimensionisamoregeneralandrobustnotionthandoublingmeasure.Despitetheseandmanyotherresults,therearetwosigni cantde cienciesinthenearestneighborliteraturethathavemotivatedthepresentpaper.First,existinganalyseshavesucceededatidentifying,foragivendatastructure,highlyspeci cfamiliesofdataforwhichecientexactNNsearchispossible|forinstance,datafromdoublingmeasures|buthavefailedtoprovideamoregeneralcharacterization.Second,thereremainsaclassofnearestneighbordatastructuresthatarepopularandsuccessfulinpractice,butthathavenotbeenanalyzedthoroughly.Thesestructurescombineclassicalk-dtreepartitioningwithrandomizationandoverlappingcells,andarethesubjectofthispaper.1.1.ThreerandomizedtreestructuresforexactNNsearchThek-dtreeisapartitionofRdintohyper-rectangularcells,basedonasetofdatapoints(Bentley,1975).Therootofthetreeisasinglecellcorrespondingtotheentirespace.Acoordinatedirectionischosen,andthecellissplitatthemedianofthedataalongthisdirection(Figure1,left).Theprocessisthenrecursedonthetwonewlycreatedcells,andcontinuesuntilallleafcellscontainatmostsomepredeterminednumbernoofpoints.Whentherearendatapoints,thedepthofthetreeisatmostaboutlog(n=no).Givenak-dtreebuiltfromdatapointsS,thereareseveralwaystoansweranearestneighborqueryq.Thequickestanddirtiestoftheseistomoveqdownthetreetoitsappropriateleafcell,andthenreturnthenearestneighborinthatcell.ThisdefeatistsearchtakestimejustO(no+log(n=no)),whichisO(logn)forconstantno.Theproblemisthatq'snearestneighbormaywelllieinadi erentcell,forinstancewhenthedatahappentobeconcentratednearcellboundaries.Consequently,thefailureprobabilityofthisschemecanbeunacceptablyhigh.Overtheyears,somesimpletrickshaveemerged,fromvarioussources,forreducingthefailureprobability.ThesearenicelylaidoutbyLiuetal.(2004),whoshowexperimentallythattheresultingalgorithmsaree ectiveinpractice.2 DasguptaSinha Figure3:Threetypesofsplit.Thefractionsrefertoprobabilitymass. issomeconstant,while ischosenuniformlyatrandomfrom[1=4;3=4].unitsphere.Butnow,threesplitpointsarenoted:themedianm(C)ofthedataalongdirectionU,the(1=2)� fractilevaluel(C),andthe(1=2)+ fractilevaluer(C).Here isasmallconstant,like0:05or0:1.Theideaistosimultaneouslyentertainamediansplitleft=fx:xUm(C)gright=fx:xUm(C)gandanoverlappingsplit(withthemiddle2 fractionofthedatafallingonbothsides)left=fx:xUr(C)gright=fx:xUl(C)g:Inthespilltree(Liuetal.,2004),eachdatapointinSisstoredinmultipleleaves,byfollowingtheoverlappingsplits.Aqueryisthenanswereddefeatist-style,byroutingittoasingleleafusingmediansplits.BoththeRPtreeandthespilltreehavequerytimesofO(no+log(n=no)),butthelattercanbeexpectedtohavealowerfailureprobability,andwewillseethisintheboundsweobtain.Ontheotherhand,theRPtreerequiresjustlinearspace,whilethesizeofthespilltreeisO(n1=(1�lg(1+2 ))).When =0:05,forinstance,thesizeisO(n1:159).Inviewofthesetradeo s,weconsiderafurthervariant,whichwecallthevirtualspilltree.Itstoreseachdatapointinasingleleaf,followingmediansplits,andhencehaslinearsize.However,eachqueryisroutedtomultipleleaves,usingoverlappingsplits,andthereturnvalueisitsnearestneighborintheunionoftheseleaves.ThevarioussplitsaresummarizedinFigure3,andthethreetreesusethemasfollows: RoutingdataRoutingqueries RPtree PerturbedsplitPerturbedsplitSpilltree OverlappingsplitMediansplitVirtualspilltree MediansplitOverlappingsplitOnesmalltechnicality:if,forinstance,thereareduplicatesamongthedatapoints,itmightnotbepossibletoachieveamediansplit,orasplitatadesiredfractile.Wewillignorethesediscretizationproblems.4 DasguptaSinha2.Apotentialfunctionforpointcon gurationsTomotivatethepotentialfunction,westartbyconsideringwhathappenswhentherearejusttwodatapointsandonequerypoint.2.1.Howrandomprojectiona ectstherelativeplacementofthreepointsConsideranyq;x;y2Rd,suchthatxisclosertoqthanisy;thatis,kq�xkkq�yk.NowsupposethatarandomdirectionUischosenfromtheunitsphereSd�1,andthatthepointsareprojectedontothisdirection.Whatistheprobabilitythatyfallsbetweenqandxonthisline?Thefollowinglemmaanswersthisquestionexactly.Anapproximatesolution,withdi erentproofmethod,wasgivenearlierbyKleinberg(1997).Lemma1Pickanyq;x;y2Rdwithkq�xkkq�yk.PickarandomunitdirectionU.Theprobability,overU,thatyUfalls(strictly)betweenqUandxUis1 arcsin0@kq�xk kq�yks 1�(q�x)(y�x) kq�xkky�xk21A:ProofWemayassumeUisdrawnfromN(0;Id),thed-dimensionalGaussianwithmeanzeroandunitcovariance.ThisgivestherightdistributionifwescaleUtounitlength,butwecanskipthislaststepsinceithasnoe ectonthequestionathand.Wecanalsoassume,withoutlossofgenerality,thatqliesattheoriginandthatxliesalongthe(positive)x1-axis:thatis,q=0andx=kxke1.ItwillthenbehelpfultosplitthedirectionUintotwopieces,itscomponentU1inthex1-direction,andtheremainingd�1coordinatesUR.Likewise,wewillwritey=(y1;yR).IfyR=0thenx,y,andqarecollinear,andtheprojectionofycannotpossiblyfallbetweenthoseofxandq.HenceforthassumeyR6=0.LetEdenotetheeventofinterest:EyUfallsbetweenqU(thatis,0)andxU(thatis,kxkU1)yRURfallsbetween�y1U1and(kxk�y1)U1Theintervalofinterestiseither(�y1jU1j;(kxk�y1)jU1j),ifU10,or(�(kxk�y1)jU1j;y1jU1j),ifU10.NowyRURisindependentofU1andisdistributedasN(0;kyRk2),whichissymmetricandthusassignsthesameprobabilitymasstothetwointervals.ThereforePrU(E)=PrU1PrUR(�y1jU1jyRUR(kxk�y1)jU1j):LetZandZ0beindependentstandardnormalsN(0;1).SinceU1isdistributedasZandyRURisdistributedaskyRkZ0,PrU(E)=Pr(�y1jZjkyRkZ0(kxk�y1)jZj)=PrZ0 jZj2�y1 kyRk;kxk�y1 kyRk:6 DasguptaSinhaInthetreedatastructuresweanalyze,mostcellscontainonlyasubsetofthedatafx1;:::;xng.Foracellthatcontainsmofthesepoints,theappropriatevariantofism(q;fx1;:::;xng)=1 mmXi=2kq�x(1)k kq�x(i)k:Corollary4Pickanypointsq;x1;:::;xnandletSdenoteanysubsetofthexithatincludesx(1).IfqandthepointsinSareprojectedtoadirectionUchosenatrandomfromtheunitsphere,thenforany0 1,theprobability(overU)thatatleastan fractionoftheprojectedSfallsbetweenqandx(1)isupper-boundedby(1=2 )jSj(q;fx1;:::;xng).ProofApplyTheorem3toS,notingthatthecorrespondingvalueofismaximizedwhenSconsistsofthepointsclosesttoq;andthenapplyMarkov'sinequality. 2.3.ExtensiontoknearestneighborsIfweareinterestedin ndingtheknearestneighbors,asuitablegeneralizationofmisk;m(q;fx1;:::;xng)=1 mmXi=k+1(kq�x(1)k++kq�x(k)k)=k kq�x(i)k:Theorem5Pickanypointsq;x1;:::;xnandletSdenoteasubsetofthexithatincludesx(1);:::;x(k).SupposeqandthepointsinSareprojectedtoarandomunitdirectionU.Then,forany(k�1)=jSj 1,theprobability(overU)thatintheprojection,thereissome1jkforwhich mpointsliebetweenx(j)andqisatmostk 2( �(k�1)=jSj)k;jSj(q;fx1;:::;xng):Thistheorem,andmanyoftheothersthatfollow,areprovedintheappendix.3.RandomizedpartitiontreesWe'llnowseethatthefailureprobabilityoftherandomprojectiontreeisproportionaltoln(1=),whilethatofthetwospilltreesisproportionalto.Westartwiththesecondresult,sinceitisthemorestraightforwardofthetwo.3.1.RandomizedspilltreesInarandomizedspilltree,eachcellissplitalongadirectionchosenuniformlyatrandomfromtheunitsphere.Twokindsofsplitsaresimultaneouslyconsidered:(1)asplitatthemedian(alongtherandomdirection),and(2)anoverlappingsplitwithonepartcontainingthebottom1=2+ fractionofthecell'spoints,andtheotherpartcontainingthetop1=2+ fraction,where0 1=2(recallFigure3).8 RandomizedtreesforNNsearch3.3.Couldcoordinatedirectionsbeused?Thetreedatastructureswehavestudiedmakecrucialuseofrandomprojectionforsplittingcells.Itwouldnotsucetousecoordinatedirections,asink-dtrees.Toseethis,considerasimpleexample.Letq,thequerypoint,betheorigin,andsupposethedatapointsx1;:::;xn2Rdarechosenasfollows:x1istheall-onesvector.Eachxi;i�1,ischosenbypickingacoordinateatrandom,settingitsvaluetoM,andthensettingallremainingcoordinatestouniform-randomnumbersintherange(0;1).HereMissomeverylargeconstant.ForlargeenoughM,thenearestneighborofqisx1.BylettingMgrowfurther,wecanlet(q;fx1;:::;xng)getarbitrarilyclosetozero,whichmeansthattherandomprojectionmethodswillworkwell.However,anycoordinateprojectionwillcreateadisastrouslylargeseparationbetweenqandx1:onaverage,a(1�1=d)fractionofthedatapointswillfallbetweenthem.4.BoundingTheexactnearestneighborschemesweanalyzehaveerrorprobabilitiesrelatedto,whichliesintherange[0;1].Theworstcaseiswhenallpointsareequidistant,inwhichcaseisexactly1,butthisisapathologicalsituation.Isitpossibletoboundundersimpleassumptionsonthedata?Inthissectionwestudytwosuchassumptions.Ineachcase,querypointsarearbitrary,butthedataareassumedtohavebeendrawni.i.d.fromanunderlyingdistribution.4.1.DatadrawnfromadoublingmeasureSupposethedatapointsaredrawnfromadistributiononRdwhichisadoublingmeasure:thatis,thereexistaconstantC�0andasubsetXRdsuchthat(B(x;2r))C(B(x;r))forallx2Xandallr�0:HereB(x;r)istheclosedEuclideanballofradiusrcenteredatx.Tounderstandthiscondition,itishelpfultoalsolookatanalternativeformulationthatisessentiallyequivalent:thereexistaconstantdo�0andasubsetXRdsuchthatforallx2X,allr�0,andall 1,wehave(B(x; r)) do(B(x;r)).Inotherwords,theprobabilitymassofaballgrowspolynomiallyintheradius.Comparingthistothestandardformulaforthevolumeofaball,weseethatthedegreeofthispolynomial,do(=log2C),canreasonablybethoughtofasthe\dimension"ofmeasure.Theorem8SupposeiscontinuousonRdandisadoublingmeasureofdimensiondo2.Pickanyq2Xanddrawx1;:::;xn.Pickany01=2.Withprobability1�311 RandomizedtreesforNNsearchTheoveralldistributionisthusamixture=w11++wttwhosejthcomponentisaBernoulliproductdistributionj=B(p(j)1)B(p(j)N).HereB(p)isashorthandforthedistributiononf0;1gwithexpectedvaluep.Itwillsimplifythingstoassumethat0p(j)i1=2;thisisnotahugeassumptionif,say,stopwordshavebeenremoved.Forthepurposesofbounding,weareinterestedinthedistributionofdH(q;X),whereXischosenfromanddHdenotesHammingdistance.Thisisasumofsmallindependentquantities,anditiscustomarytoapproximatesuchsumsbyaPoissondistribution.Inthecurrentcontext,however,thisapproximationisratherpoor,andweinsteadusecountingargumentstodirectlyboundhowrapidlythedistributiongrows.Theresultsstandinstarkcontrasttothoseweobtainedfordoublingmeasures,andrevealthistobeasubstantiallymoredicultsettingfornearestneighborsearch.Foradoublingmeasure,theprobabilitymassofaballB(q;r)doubleswheneverrismultipliedbyaconstant.Inourpresentsetting,itdoubleswheneverrisincreasedbyanadditiveconstant:Theorem10Supposethatallp(j)i2(0;1=2).LetLj=Pip(j)idenotetheexpectednumberofwordsinadocumentfromtopicj,andletL=min(L1;:::;Lt).Pickanyqueryq2f0;1gN,anddrawX.Forany`0,Pr(dH(q;X)=`+1) Pr(dH(q;X)=`)L�`=2 `+1:Now, xaparticularqueryq2f0;1gN,anddrawx1;:::;xnfromdistribution.Lemma11Thereisanabsoluteconstantcoforwhichthefollowingholds.Pickany01andanyk1,andletvdenotethesmallestintegerforwhichPrX(dH(q;X)v)(8=n)max(k;ln1=).Thenwithprobabilityatleast1�3overthechoiceofx1;:::;xn,foranymn,k;m(q;fx1;:::;xng)4r v coL�log2(n=m):Theimplicationofthislemmaisthatforanyofthethreetreedatastructures,thefailureprobabilityatasinglelevelisroughlyp v=L.ThismeansthatthetreecanonlybegrowntodepthO(p L=v),andthusthequerytimeisdominatedbyno=n2�O(p L=v).Whennislarge,weexpectvtobesmall,andthusthequerytimeimprovesoverexhaustivesearchbyafactorofroughly2�p L.AcknowledgmentsWethanktheNationalScienceFoundationforsupportundergrantIIS-1162581.ReferencesN.AilonandB.Chazelle.ThefastJohnson-Lindenstrausstransformandapproximatenearestneighbors.SIAMJournalonComputing,39:302{322,2009.13 RandomizedtreesforNNsearchAppendixB.ProofofTheorem7Consideranyinternalnodeofthetreethatcontainsqaswellasmofthedatapoints,includingx(1).Whatistheprobabilitythatthesplitatthatnodeseparatesqfromx(1)?Toanalyzethis,letFdenotethefractionofthempointsthatfallbetweenqandx(1)alongtherandomly-chosensplitdirection.Sincethesplitpointischosenatrandomfromanintervalofmass1=2,theprobabilitythatitseparatesqfromx(1)isatmostF=(1=2).IntegratingoutF,wegetPr(qisseparatedfromx(1))Z10Pr(F=f)f 1=2df=2Z10Pr(F�f)df2Z10min1;m 2fdf=2Zm=20df+2Z1m=2m 2fdf=mln2e m;wherethesecondinequalityusesCorollary4.Thelemmafollowsbytakingaunionboundoverthepaththatconveysqfromroottoleaf,inwhichthenumberofdatapointsperlevelshrinksgeometrically,byafactorof3=4orbetter.Thesamereasoninggeneralizestoknearestneighbors.Thistime,Fisde nedtobethefractionofthempointsthatliebetweenqandthefurthestofx(1);:::;x(k)alongtherandomsplittingdirection.ThenqisseparatedfromoneoftheseneighborsonlyifthesplitpointliesinanintervalofmassFoneithersideofq,aneventthatoccurswithprobabilityatmost2F=(1=2).UsingTheorem5,Pr(qisseparatedfromsomex(j),1jk)Z10Pr(F=f)2f 1=2df=4Z10Pr(F�f)df4Z10min1;kk;m 2(f�(k�1)=m)df4Z(kk;m=2)+(k�1)=m0df+4Z1(kk;m=2)+(k�1)=mkk;m 2(f�(k�1)=m)df2kk;mln2e kk;m+4(k�1) m;andasbefore,wesumthisoveraroot-to-leafpathinthetree.AppendixC.ProofofTheorem8C.1.Thek=1caseWewillconsideracollectionofballsBo;B1;B2;:::centeredatq,withgeometricallyin-creasingradiiro;r1;r2;:::,respectively.Fori1,wewilltakeri=2iro.Thusbythedoublingcondition,(Bi)Ci(Bo),whereC=2do4.15 RandomizedtreesforNNsearchwherethelastinequalitycomesfrom(*).Tolower-bound2`,weagainuse(*)togetC`m=(2n(Bo)),whereupon2`m 2n(Bo)1=log2C=m 2ln(1=)1=log2Candwe'redone.C.2.Thek�1caseTheonlybigchangeisinthede nitionofro;itisnowtheradiusforwhich(Bo)=4 nmaxk;ln1 :Thus,whenx1;:::;xnaredrawnindependentlyatrandomfrom,theexpectednumberofthemthatfallinBoisatleast4k,andbyamultiplicativeCherno boundisatleastkwithprobability1�.TheballsB1;B2;:::arede nedasbefore,andonceagain,wecanconcludethatwithprobability1�2,eachBicontainsatmost2nCi(Bo)ofthedatapoints.Anypointx(i)62BoliesinsomeannulusBjnBj�1,anditscontributiontothesummationink;mis(kq�x(1)k++kq�x(k)k)=k kq�x(i)k1 2j�1:Therelationship(*)andtheremainderoftheargumentareexactlyasbefore.AppendixD.AusefultechnicallemmaLemma12SupposethatforsomeconstantsA;B�0anddo1,F(m)AB m1=doforallmno.Pickany0 1andde ne`=log1= (n=no).Then:`Xi=0F( in)Ado 1� B no1=doand,ifnoB(A=2)do,`Xi=0F( in)ln2e F( in)Ado 1� B no1=do1 1� ln1 +ln2e A+1 dolnno B:17 RandomizedtreesforNNsearchTounderstandthisdistribution,westartwithageneralresultaboutsumsofBernoullirandomvariables.Noticethattheresultisexactlycorrectinthesituationwhereallpi=1=2.Lemma13SupposeZ1;:::;ZNareindependent,whereZi2f0;1gisaBernoullirandomvariablewithmean0ai1,anda1a2aN.LetZ=Z1++ZN.Thenforany`0,Pr(Z=`+1) Pr(Z=`)1 `+1NXi=`+1ai 1�ai:ProofDe neri=ai=(1�ai)2(0;1);thenr1r2rN.Now,forany`0,Pr(Z=`)=Xfi1;:::;i`g[N]ai1ai2ai`Yj62fi1;:::;i`g(1�aj)=NYi=1(1�ai)Xfi1;:::;i`g[N]ai1 1�ai1ai2 1�ai2ai` 1�ai`=NYi=1(1�ai)Xfi1;:::;i`g[N]ri1ri2ri`wherethesummationsareoversubsetsfi1;:::;i`gof`distinctelementsof[N].Inthe nalline,theproductofthe(1�ai)doesnotdependupon`andcanbeignored.Let'sfocusonthesummation;callitS`.WewouldliketocompareittoS`+1.S`+1isthesumof�N`+1distinctterms,eachtheproductof`+1ri's.ThesetermsalsoappearinthequantityS`(r1++rN);infact,eachtermofS`+1appearsmultipletimes,`+1timestobeprecise.TheremainingtermsinS`(r1++rN)eachcontain`�1uniqueelementsandoneduplicatedelement.Byaccountinginthisway,wegetS`(r1++rN)=(`+1)S`+1+Xfi1;:::;i`g[N]ri1ri2ri`(ri1++ri`)(`+1)S`+1+S`(r1++r`)sincetheri'sarearrangedindecreasingorder.HencePr(Z=`+1) Pr(Z=`)=S`+1 S`1 `+1(r`+1++rN);asclaimed. WenowapplythisresultdirectlytothesumofBernoullivariablesZ=dH(q;X).Lemma14Supposethatp1;:::;pN2(0;1=2).Pickanyqueryq2f0;1gN,anddrawXfromdistribution=B(p1)B(pN).Thenforany`0,Pr(dH(q;X)=`+1) Pr(dH(q;X)=`)L�`=2 `+1;whereL=PipiistheexpectednumberofwordsinX.19 RandomizedtreesforNNsearchSupposethatforsomei�k,pointx(i)isatHammingdistance`fromq,thatis,x(i)2S`.Then(kq�x(1)k++kq�x(k)k)=k kq�x(i)kr v `sinceEuclideandistanceisthesquarerootofHammingdistance.Inboundingk;m,weneedtogaugetherangeofHammingdistancesspannedbyx(k+1);:::;x(m).Thegeometricgrowthrateofpart(c)impliesthatmostpointslieatHammingdistancecoLorgreaterfromq.ItalsomeansthatdH(q;x(m))�coL�log2(n=m).Thus,k;m(q;fx1;:::;xng)=1 mX�ik(kq�x(1)k++kq�x(k)k)=k kq�x(i)k1 mX`vjS`\fx(1);:::;x(m)gjr v `4r v coL�log2(n=m)wherethelaststepfollowsbylower-boundingjS`jbyanincreasinggeometricseries.21