betweenboostingalgorithmsandlogisticlossandsubsequentlyoveraseriesofpapersLaertyetal1997Laerty1999KivinenWarmuth1999Collinsetal2002thestudyofBregmandivergencesandinformationgeometryhas ID: 257883
Download Pdf The PPT/PDF document "1IntroductionInrecentyears,thedata-strea..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1IntroductionInrecentyears,thedata-streammodelhasenjoyedsignicantattentionbecauseoftheneedtoprocessmassivedatasets(e.g.Henzingeretal.,1999;Alonetal.,1999;Feigenbaumetal.,2002).Astreamingcomputationisasublinearspacealgorithmthatreadstheinputinsequentialorderandanyitemnotexplicitlyrememberedisinaccessible.Afundamentalprobleminthemodelistheestimationofdistancesbetweentwoobjectsthataredeterminedbythestream,e.g.,thenetworktracmatricesattworouters.Estimationofdistancesallowsustoconstructapproximaterepresentations,e.g.,histograms,wavelets,Fouriersummaries,orequivalently,ndmodelsoftheinputstream,sincethisproblemreducestondingthe\closest"representationinasuitableclass.Inthispaper,theobjectsofinterestareempiricalprobabilitydistributionsdenedbyastreamofupdatesasfollows.Denition1.ForadatastreamS=ha1;:::;amiwhereai2fp;qg[n]wedeneempiricaldistributionspandqasfollows.Letm(p)i=jfj:aj=hp;iigj,m(p)=jfj:aj=hp;igjandpi=m(p)i=m(p).Similarlyforq.OneofthecornerstonesinthetheoryofdatastreamalgorithmshasbeentheresultofAlonetal.(1999).Theyshowedthatitispossibletoestimate`2(p;q):=kpqk2(theEuclideandistance)uptoa(1+)factorusingonlypoly(1;logn)space.Thealgorithmcan,inretrospect,beviewedintermsofthefamousembeddingresultofJohnson&Lindenstrauss(1984).ThisresultimpliesthatforanytwovectorspandqandanknmatrixAwhoseentriesareindependentNormal(0;1)randomvariables(scaledappropriately),(1+)1`2(p;q)`2(Ap;Aq)(1+)`2(p;q)withhighprobabilityforsomek=poly(1;logn).Alon,Matias,andSzegedydemonstratedthatan\eective"Acanbestoredinsmallspaceandcanbeusedtomaintainasmall-space,updateablesummary,orsketch,ofpandq.The`2distancebetweenpandqcanthenbeestimatedusingonlythesketchesofpandq.WhileBrinkman&Charikar(2003)provedthattherewasnoanalogoftheJohnson-Lindenstraussresultfor`1,Indyk(2000)demonstratedthat`1(p;q)couldalsobeestimatedinpoly(1;logn)spacebyusingCauchy(0;1)randomvariablesratherthanNormal(0;1)randomvariables.Theresultsextendedtoall`p-measureswith0p2usingstabledistributions.Overasequenceofpapers(Saks&Sun,2002;Chakrabartietal.,2003;Cormodeetal.,2003;Bar-Yossefetal.,2004;Indyk&Woodru,2005;Bhuvanagirietal.,2006;Cormode&Ganguly,2007)`pandHammingdistanceshavebecomewellunderstood.Concurrentlyseveralmethodsofcreatingsummaryrepresentationsofstreamshavebeenproposed(Broderetal.,2000;Charikaretal.,2002;Cormode&Muthukrishnan,2005)foravarietyofapplications;intermsofdistancestheycanbeadaptedtocomputetheJaccardcoecient(symmetricdierenceoverunion)fortwosets.Oneoftheprincipalmotivationsofthisworkistocharacterizethedistancesthatcanbesketched.2 betweenboostingalgorithmsandlogisticloss,andsubsequentlyoveraseriesofpapers(Laertyetal.,1997;Laerty,1999;Kivinen&Warmuth,1999;Collinsetal.,2002),thestudyofBregmandivergencesandinformationgeometryhasbecomethemethodofchoiceforstudyingexponentiallossfunctions.Theconnectionbetweenlossfunctionsandf-divergencesareinvestigatedmorerecentlybyNguyenetal.(2005).Denition3(DecomposableBregmanDivergences).Letpandqbetwon-pointdistributions.AstrictlyconvexfunctionF:(0;1]!RgivesrisetoaBregmandivergence,BF(p;q)=Xi2[n]F(pi)F(qi)(piqi)F0(qi):PerhapsthemostfamiliarBregmandivergenceis`22withF(z)=z2.TheKullback{LeiblerdivergenceisalsoaBregmandivergencewithF(z)=zlogz,andtheItakura{SaitodivergenceF(z)=logz.Laertyetal.(1997)suggestF(z)=z+z+1for2(0;1),F(z)=zz+1for0.TheprincipaluseofBregmandivergencesisinndingoptimalmodels.Givenadistributionqweareinterestedinndingapthatbestmatchesthedata,andthisisposedastheconvexoptimizationproblemminpBF(p;q).ItiseasytoverifythatanypositivelinearcombinationofBregmandivergencesisaBregmandivergenceandthattheBregmanballsareconvexintherstargumentbutoftennotinthesecond.Thisistheparticularappealofthetechnique,thatthedivergencedependsonthedatanaturallyandthedivergenceshavecometobeknownasInformationGeometrytechniques.Furthermore,thereisanaturalconvexdualitybetweentheoptimumrepresentationpunderBF,andthedivergenceBF.ThisconnectiontoconvexoptimizationisoneofthemanyreasonsfortheemergingheavyuseofBregmandivergencesinthelearningliterature.Giventhatwecanestimate`1and`2distancesbetweentwostreamsinsmallspace,itisnaturaltoaskwhichotherf-divergencesandBregmandivergencesaresketchable?OurContributions:Inthispaperwetakeseveralstepstowardsacharacterizationofthedistancesthatcanbesketched.Ourrstresults,inSection3,arenegativeandhelpusunderstandwhythe`1and`2distancesarespecialamongthefandBregmandivergences.WeprovetheShiftInvariantTheoremthatcharacterizesalargefamilyofdistancesthatcannotbeapproximatedmultiplicativelyinthedata-streammodel.Thistheorempertainstodecomposabledistances,i.e.,distancesd:RnRn!R+forwhichthereexistsa:RR!R+suchthatd(x;y)=Pi2[n](xi;yi).Thetheoremsuggestthatunless(xi;yi)isafunctionofxiyithemeasuredcannotbesketched.Forallf-divergenceforwhichfistwicedierentiableandf00isstrictlypositive,nopolynomial4 Lemma4.Letfbeareal-valuedfunctionthatisconvexon(0;1)andsatisesf(1)=0.Thenthereexistsareal-valuedfunctiongthatisconvexon(0;1)andsatisesg(1)=0suchthat1.Df=Dg.2.gispositiveandiffisdierentiableat1theng0(1)=0.3.IfDfisboundedtheng(0)=limu!0g(u)andg(0)=limu!0g(u)exists.Proof.Forp=(1=2;1=2)andq=(0;1),Df(p;q)=(f(0)+f(2))=2andDf(q;p)=0:5limu!0uf(1=u)+f(0:5):Hence,ifDfisboundedthenf(0)=limu!0f(u)andf(0)=limu!0f(u)=limu!0uf(1=u)exist.Letc=limu!1f(1)f(u) 1u.Iffisdierentiablethenc=f0(1).Otherwise,thislimitstillexistsbecausefisconvexanddenedon(0;1).Theng(u)=f(u)c(u1)satisesthenecessaryconditions. Forexample,theHellingerdivergencecanberealizedbyeitherf(u)=(p u1)2orf(u)=22p u.Henceforth,weassumefisnon-increasingintherange[0;1]andnon-decreasingintherange[1;1).Thenextlemmashowsthat,ifwearewillingtotolerateanadditiveapproximation,wemaymakecertainassumptionsaboutthederivativeoff.Thisisachievedbyapproximatingfbyastraightlineforverysmallandverylargevalues.Lemma5.GivenaboundedDfwithfdierentiable(w.l.o.g.,fisunimodalandminimizedat1)and2(0;1),letu0()=maxfu2(0;1]:f(u)=f(0)1;f(u)=f(0)1ganddeneg:g(u)=8]TJ ; -1; .63; Td;[000;:f(u)foru2(u0;1=u0)f(0)u(f(0)f(u0))=u0foru2[0;u0]uf(0)(f(0)f(u0))=u0foru2[1=u0;1)Then,Dg(p;q)(1)Df(p;q)Dg(p;q)andmaxujg0(u)jmax(f(0)=u0;f(0))andmaxujg0(u)jmax(f(0)=u0;f(0)):Proof.Becausef;f;g;garenon-increasingintherange[0;1],forallu2[0;u0],1g(u) f(u)f(0) f(u)f(0) f(u0)and1g(u) f(u)f(0) f(u)f(0) f(u0):(1)6 O((a+b+c)n)over[5n=4]requires (n)space.Thisremainstrueevenifthealgorithmmaytakeaconstantnumberofpassesoverthestream.Thefactornontheright-handsideofEqn.2isonlynecessaryifwewishtoprovea (n)spacelowerboundandtherebyruleoutsub-linearspacealgorithms.Inparticular,ifnisreplacedbysomewnthenthelowerboundwouldbecome (w).However,theaboveformulationwillbesucientforthepurposesofprovingresultsontheestimationofinformationdivergences.Proof.TheproofisbyareductionfromthecommunicationcomplexityoftheSet-Disjointnessprob-lem.Aninstanceofthisproblemconsistsoftwobinarystrings,x;y2f0;1gnsuchthatPixi=Piyi=n=4.Weconsidertwoplayers,AliceandBob,suchthatAliceknowsthestringxandBobknowsthestringy.AliceandBobtaketurnstosendmessagestoeachotherwiththegoalofdeterminingifxandyaredisjoint,i.e.,xy=0(wheretheinnerproductistakenoverthereals).Itisknownthatdeterminingifxy=0withprobabilityatleast3=4requires (n)bitstobecommunicated(Razborov,1992).However,supposethatthereexistsastreamingalgorithmAthattakesPpassesoverastreamandusesWworkingmemoryto-approximated(p;q)withprobability3=4.Wewillshowthatthisalgorithmgivesrisetoa(2P1)-roundprotocolforSet-DisjointnessthatonlyrequiresO(PW)bitstobecommunicatedandthereforeW= (n=P).Wewillassumethat(a=t;(a+c)=t)((a+c)=t;a=t).If(a=t;(a+c)=t)((a+c)=t;a=t)thentheprooffollowsbyreversingtherolesofthepandqthatwenowdene.Considerthemulti-sets,SA(x)=[i2[n]faxi+b(1xi)copiesoffhp;ii,hq;iigg[[i2[n=4]fbcopiesoffhp;i+ni;hq;i+niggSB(y)=[i2[n]fcyicopiesofhq;iig[[i2[n=4]fccopiesofhp;i+nig:Thisdenesthefollowingfrequencies:m(p)i=8]TJ ; -1; .63; Td;[000;]TJ ; -1; .63; Td;[000;]TJ ; -1; .63; Td;[000;:aifxi=1andi2[n]bifxi=0andi2[n]b+cifni5n=4andm(q)i=8-278;-278;-278;-278;-278;-278;-278;-278;-278;]TJ ; -1; .63; Td;[000;]TJ ; -1; .63; Td;[000;]TJ ; -1; .63; Td;[000;]TJ ; -1; .63; Td;[000;]TJ ; -1; .63; Td;[000;]TJ ; -1; .63; Td;[000;]TJ ; -1; .63; Td;[000;]TJ ; -1; .63; Td;[000;]TJ ; -1; .63; Td;[000;:aif(xi;yi)=(1;0)andi2[n]bif(xi;yi)=(0;0)andi2[n]a+cif(xi;yi)=(1;1)andi2[n]b+cif(xi;yi)=(0;1)andi2[n]bifni5n=4:Consequently,d(p;q)=(xy)(a=t;(a+c)=t)+(n=4xy)(b=t;(b+c)=t)+(n=4)((b+c)=t;b=t);8 forsome 2[0;1=b]byTaylor'sTheorem.Sincef(1)=f0(1)=0andf00(t)iscontinuousatt=1thisimpliesthatforsucientlylargen,f00(1+ )f00(1)+1andso,(b=t;(b+c)=t)f00(1)+1 2tb=f00(1)+1 2f(2)bt1f(2)8 2n(a=t;(a+c)=t):Similarlywecanshowthatforsucientlylargen,((b+c)=t;b=t)8(a=t;(a+c)=t)=(2n):Then,appealingtoTheorem7wegettherequiredresult. Corollary9(BregmanDivergences).GivenaBregmandivergenceBF,ifFistwicedierentiableandthereexists;z00suchthat,80z2z1z0;F00(z1) F00(z2)z1 z2or80z2z1z0;F00(z1) F00(z2)z2 z1thennopolynomialfactorapproximationofBFispossibleino(n)bitsofspace.ThisconditioneectivelystatesthatF00(z)vanishesordivergesmonotonically,andpolynomiallyfast,asz!0.Notethatfor`22,whichcanbesketched,F(z)=z2andthereforeF00isconstanteverywhere.Proof.BytheMean-ValueTheorem,foranyt;r2N,thereexists (r)2[0;1]suchthat,(r=t;(r+1)=t)+(r=t+1=t;r=t)=t1(F0(r=t+1=t)F0(r=t))=t2F00((r+ (r))=t):Therefore,foranya;b2N;c=1andt=an=4+bn+n=2,maxa t;a+c t;a+c t;a t b+c t;b t+b t;b+c t1 2F00((a+ (a))=t) F00((b+ (b))=t):If80z2z1z0;F00(z1)=F00(z2)(z1=z2)thenseta=(2n)1=andb=1whereisanarbitrarypolynomialinn.If80z2z1z0;F00(z1)=F00(z2)(z2=z1)thenseta=1andb=(n)1=.InbothcaseswededucethattheRHSofEqn.3isgreaterthan2n=4.Hence,appealingtoTheorem7,wegettherequiredresult. 4AdditiveApproximationsInthissectionwefocusonadditiveapproximations.Asmentionedearlier,theprobabilityofmisclassicationusingratiotestsisoftenboundedby2Df,forcertainDf.Hence,anadditiveapproximationtranslatestoamultiplicative2factorforcomputingtheerrorprobability.Ourgoalisthecharacterizationofdivergencesthatcanbeapproximatedadditively.WerstpresentageneralalgorithmicresultbasedonanextensionofatechniquerstusedbyAlon10 Toseethisconsiderthesub-streamconsistingoftheelementsoftheformh;ii,e.g.,hp;ii;hq;ii;hq;ii;hp;ii;hq;ii;hp;ii:andexpandE[X(r;s)jk=i]asfollows:E[X(r;s)jk=i]= Xi2[m(q)]X(2m(q)+i;3m(p))+Xi2[m(p)]X(2m(q);2m(p)+i)++Xi2[m(p)]X(2m(q);m(p)+i)+Xi2[m(q)]X(m(q)+i;m(p))+Xi2[m(p)]X(m(q);i)+Xi2[m(q)]X(i;0)=2m 0BBBBBB@(2=m(q);3=m(q))(2=m(q);2=m(q))+(2=m(p);2=m(q))(2=m(q);1=m(q))+(2=m(p);1=m(q))(1=m(q);1=m(q))+(1=m(p);1=m(q))(1=m(q);0=m(q))+(1=m(p);0=m(q))(0=m(q);0=m(q))1CCCCCCA=2(pi;qi) pi+qi:where =1 m(p)im(q)+m(p)m(q)i.ThereforeE[X(r;s)]=Pi(pi;qi)asrequired.Furthermore,jX(r;s)j2max(maxx2[r1 m;r m]@ @x(x;s=m);maxy2[s1 m;s m]@ @y(r=m;y)):Hence,byanapplicationoftheChernobound,averagingO(2log1)independentbasicestimatorsgivesan(;)-additive-approx. Wenextprovealowerboundonthespacerequiredforadditiveapproximationbyanysingle-passalgorithm.Theproofusesareductionfromtheone-waycommunicationcomplexityoftheGap-Hammingproblem(Woodru,2004).Itiswidelybelievedthatasimilarlower-boundexistsformulti-roundcommunication(e.g.McGregor,2007,Question10(R.Kumar))and,ifthisisthecase,itwouldimplythatthelower-boundbelowalsoappliestoalgorithmsthattakeaconstantnumberofpassesoverthedata.Theorem11.Any(;1=4)-additive-approximationofd(p;q)requires (2)bitsofspaceif,9a;b0;8x;(x;0)=ax;(0;x)=bx;and(x;x)=0:Proof.TheproofisbyareductionfromthecommunicationcomplexityoftheGap-Hammingprob-lem.Aninstanceofthisproblemconsistsoftwobinarystrings,x;y2f0;1gnsuchthatPixi=12 Forexample,()=O(1)forTriangleand()=O(1)forHellinger.Thealgorithmdoesnotneedtoknowm(p)orm(q)inadvance.Proof.WeappealtoTheorem10andnotethat,maxx;y2[0;1]@ @x(x;y)+@ @y(x;y)=maxx;y2[0;1]f(y=x)(y=x)f0(y=x)+f0(y=x)2maxu0f0(u)+f0(u):TheresultfollowsbyappealingtoByLemma5,wemayboundthederivativesoffandfintermsoftheadditiveapproximationerror.Thisgivestherequiredresult. WecomplementTheorem13withthefollowingresultwhichfollowsfromTheorems11and12.Theorem14.Any(;1=4)-additive-approximationofanunboundedDfrequires (n)bitsofspace.Thisappliesevenifoneofthedistributionsisknowntobeuniform.Any(;1=4)-additive-approximationofaboundedDfrequires (2)bitsofspace.4.2AdditiveApproximationforBregmandivergencesInthissectionweproveapartialcharacterizationoftheBregmandivergencesthatcanbeadditivelyapproximated.Theorem15.Thereexistsaone-pass,O(2log1(logn+logm))-space,(;)-additive-approx.ofaBregmandivergenceifFandF00areboundedintherange[0;1].Thealgorithmdoesnotneedtoknowm(p)orm(q)inadvance.Proof.WeappealtoTheorem10andnotethat,maxx;y2[0;1]@ @x(x;y)+@ @y(x;y)=maxx;y2[0;1]F0(x)F0(y)+jxyjF00(y):WemayassumethisisconstantbyconvexityofFandtheassumptionsofthetheorem.Theresultfollows. ThenexttheoremfollowsimmediatelyfromTheorem12.Theorem16.IfF(0)orF0(0)isunboundedthenan(;1=4)-additive-approx.ofBFrequires (n)bitsofspaceevenifoneofthedistributionsisknowntobeuniform.5ConclusionsandOpenQuestionsWepresentedapartialcharacterizationoftheinformationdivergencesthatcanbemultiplicativelyapproximatedinthedatastreammodel.Thischaracterizationwasbasedonageneralresultthat14 Breiman,L.(1999).Predictiongamesandarcingalgorithms.NeuralComputation,11(7),1493{1517.Brinkman,B.,&Charikar,M.(2003).Ontheimpossibilityofdimensionreductionin`1.InIEEESymposiumonFoundationsofComputerScience,(pp.514{523).Broder,A.Z.,Charikar,M.,Frieze,A.M.,&Mitzenmacher,M.(2000).Min-wiseindependentpermutations.J.Comput.Syst.Sci.,60(3),630{659.Chakrabarti,A.,Cormode,G.,&McGregor,A.(2007).Anear-optimalalgorithmforcomputingtheentropyofastream.InACM-SIAMSymposiumonDiscreteAlgorithms,(pp.328{335).Chakrabarti,A.,Khot,S.,&Sun,X.(2003).Near-optimallowerboundsonthemulti-partycommunicationcomplexityofsetdisjointness.InIEEEConferenceonComputationalComplexity,(pp.107{117).Charikar,M.,Chen,K.,&Farach-Colton,M.(2002).Findingfrequentitemsindatastreams.InInternationalColloquiumonAutomata,LanguagesandProgramming,(pp.693{703).Collins,M.,Schapire,R.E.,&Singer,Y.(2002).Logisticregression,AdaBoostandBregmandistances.MachineLearning,48(1-3),253{285.Cormode,G.,Datar,M.,Indyk,P.,&Muthukrishnan,S.(2003).Comparingdatastreamsusinghammingnorms(howtozeroin).IEEETrans.Knowl.DataEng.,15(3),529{540.Cormode,G.,&Ganguly,S.(2007).Onestimatingfrequencymomentsofdatastreams.InInternationalWorkshoponRandomizationandApproximationTechniquesinComputerScience.Cormode,G.,&Muthukrishnan,S.(2005).Animproveddatastreamsummary:thecount-minsketchanditsapplications.J.Algorithms,55(1),58{75.Cover,T.M.,&Thomas,J.A.(1991).ElementsofInformationTheory.WileySeriesinTelecom-munications.NewYork,NY,USA:JohnWiley&Sons.Csiszar,I.(1991).Whyleastsquaresandmaximumentropy?Anaxiomaticapproachtoinferenceforlinearinverseproblems.Ann.Statist.,(pp.2032{2056).Feigenbaum,J.,Kannan,S.,Strauss,M.,&Viswanathan,M.(2002).AnapproximateL1dierencealgorithmformassivedatastreams.SIAMJournalonComputing,32(1),131{151.Friedman,J.,Hastie,T.,&Tibshirani,R.(2000).Additivelogisticregression:astatisticalviewofboosting.AnnalsofStatistics,28,337{407.Guha,S.,McGregor,A.,&Venkatasubramanian,S.(2006).Streamingandsublinearapproximationofentropyandinformationdistances.InACM-SIAMSymposiumonDiscreteAlgorithms,(pp.733{742).16