Published bymarina-yarberry
stanfordedu Sergei Vassilvitskii Stanford University Stanford CA sergeicsstanfordedu ABSTRACT The kmeans method is an old but popular clustering algo rithm known for its observed speed and its simplicity Until recently however no meaningful theoretic
Download Pdf - The PPT/PDF document "How Slow is the Means Method David Arthu..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
HowSlowisthekMeansMethod?DavidArthurStanfordUniversityStanford,CAdarthur@cs.stanford.eduSergeiVassilvitskiiyStanfordUniversityStanford,CAsergei@cs.stanford.eduABSTRACTThek-meansmethodisanoldbutpopularclusteringalgo-rithmknownforitsobservedspeedanditssimplicity.Untilrecently,however,nomeaningfultheoreticalboundswereknownonitsrunningtime.Inthispaper,wedemonstratethattheworst-caserunningtimeofk-meansissuperpolyno-mialbyimprovingthebestknownlowerboundfrom\n(n)iterationsto2\n(p n).CategoriesandSubjectDescriptors:F.2.2[AnalysisofAlgorithmsandProblemComplexity]:NonnumericalAlgorithmsandProblemsGeneralTerms:Algorithms,Theory.Keywords:K-means,LocalSearch,LowerBounds.1.INTRODUCTIONThek-meansmethodisawellknowngeometricclusteringalgorithmbasedonworkbyLloydin1982[12].Givenasetofndatapoints,thealgorithmusesalocalsearchapproachtopartitionthepointsintokclusters.Asetofkinitialclus-tercentersischosenarbitrarily.Eachpointisthenassignedtothecenterclosesttoit,andthecentersarerecomputedascentersofmassoftheirassignedpoints.Thisisrepeateduntiltheprocessstabilizes.Itcanbeshownthatnoparti-tionoccurstwiceduringthecourseofthealgorithm,andsothealgorithmisguaranteedtoterminate.Thek-meansmethodisstillverypopulartoday,andithasbeenappliedinawidevarietyofareasrangingfromcomputationalbiologytocomputergraphics(see[1,6,8] SupportedinpartbyanNDSEGFellowship,NSFGrantITR-0331640,andgrantsfromMedia-XandSNRC.ySupportedinpartbyNSFGrantITR-0331640,andgrantsfromMedia-XandSNRC.Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.SCG'06,June5–7,2006,Sedona,Arizona,USA.Copyright2006ACM1595933409/06/0006...$5.00.forsomerecentapplications).Themainattractionofthealgorithmliesinitssimplicityanditsobservedspeed.Indeed,therunningtimeofk-meansiswellstudiedex-perimentally(see,forexample,[7]).Intheirtextonpatternclassication,Dudaetal.remarkthat,\Inpracticethenum-berofiterationsisgenerallymuchlessthanthenumberofpoints"[5].However,fewmeaningfultheoreticalboundsontheworst-caserunningtimeofk-meansareknown.1.1RelatedWorkThereisatrivialupperboundofO(kn)iterationssincenopartitionofpointsintoclustersiseverrepeatedduringthecourseofthealgorithm.Ind-dimensionalspace,thisboundwasslightlyimprovedbyInabaetal.[9]toO(nkd)bycountingthenumberofdistinctVoronoipartitionsonnpoints.Morerecently,Dasgupta[4]presentedsometighterresultsforafewspecialcases.Hedemonstratedaworst-caselowerboundof\n(n)iterations,andanupperboundofO(n)fork5andd=1.ThisworkwasextendedbyHar-PeledandSadri[7]in2005.Againrestrictingtod=1,theauthorsshowanupperboundofO(n2)whereisthespreadofthepointset(de-nedastheratiobetweenthelargestpairwisedistanceandthesmallestpairwisedistance).Theyareunabletoboundtherunningtimeofk-meansingeneral,buttheysuggestafewmodicationsthatareeasiertoanalyze.Forexample,ifonereclassiesexactlyonepointperiteration,thenk-meansisguaranteedtoconvergeafterO(kn22)iterationsinanydimension.1.2OurResultsOurmainresultisalowerboundconstructionforwhichtherunningtimeofk-meansissuperpolynomial.Inparticu-lar,wepresentasetofndatapointsandasetofadversar-iallychosenclustercentersforwhichthealgorithmrequires2\n(p n)iterations.Wethenexpandthistoshowthateveniftheinitialclustercentersarechosenuniformlyatrandomfromthedatapoints,therunningtimeisstillsuperpolyno-mialwithhighprobability.Wealsoshowourconstructioncanbemodiedtohaveconstantspread,therebydisprovingarecentconjectureofHar-PeledandSadri[7].Explainingtherunningtimesobservedinpracticeremainsanopenproblem.Asarststep,weshowthatifthedatapointsareselectedfromindependentnormaldistributionsin\n(n=logn)dimensions,thenk-meanswillterminateinapolynomialnumberofstepswithhighprobability.Wealsobrie\rydiscussseveralotherwaysinwhichonemighthopetocircumventtheworst-caselowerbound. CCBPAQRCABQRPABPQRCABPQR(d)(b)(c)(a) Figure1:Anidealized\resetwidget"thatcanbeusedtoresetthecenterofsomeclusterCafterk-meanshasnishedexecuting:(a)Thecongurationrightbeforethesignalingbegins.(b)PswitchestoclusterC,andthecenterofAmovesawayfromQandR.(c)QswitchestoclusterC,therebyresettingthecenterofC.Inaddition,RswitchestoclusterB,andthecenterofBmovestowardsPandQ.(d)PandQswitchtoclusterB.NowCiscompletelyreset.2.PRELIMINARIESThek-meansalgorithm[12]isamethodforpartitioningdatapointsintoclusters.LetX=fx1;x2;:::;xngbeasetofpointsinR.Afterbeingseededwithasetofkcentersc1;c2;:::;ckinR,thealgorithmpartitionsthesepointsintoclustersasfollows.1.Foreachi2f1;:::;kg,settheclusterCitobethesetofpointsinXthatareclosertocithantheyaretocjforallj=i.2.Foreachi2f1;:::;kg,setcitobethecenterofmassofallpointsinCi:ci=1 jCijPxj2Cixj.3.Repeatsteps1and2untilciandCinolongerchange,atwhichpointreturntheclustersCi.IftherearetwocentersequallyclosetoapointinX,webreakthetiearbitrarily.Ifaclusterhasnodatapointsattheendofstep2,weeliminatetheclusterandcontinueasbefore.Ourlowerboundconstructionwillnotrelyoneitherofthesedegeneracies.Duringtheanalysisitwillbeusefultotalkaboutameansconguration.Definition2.1.AmeanscongurationM=(X;C)isasetofdatapointsXandasetofclustercentersC=fcigi=1;:::;k.NotethatameanscongurationMdenesanintermedi-atepointintheexecutionofthealgorithm.GivenameanscongurationM,letT(M)denotethenumberofiterationsrequiredbyk-meanstoconvergestartingatM.WesaythatMisnon-degenerateif,asthealgorithmisruntocomple-tion,(a)nopointiseverequidistantfromthetwoclosestclustercentersand(b)noclustereverhas0datapoints.3.LOWERBOUNDSInthissection,wedemonstrateameanscongurationwhichrequires2\n(p n)iterations.Astheconstructionisratherinvolved,webeginwithsomeintuition,andthenproceedwithaformalproof.WehavealsoimplementedtheconstructioninC++[3].Attheendofthesection,weconsideracouplemodica-tionstothemainconstruction.First,weshowthatevenifthestartingcentersarechosenuniformlyatrandomfromthedatapoints,thereexistexampleswhereasuperpolyno-mialnumberofiterationsisstillrequiredwithhighprob-ability.Finally,weshowhowtoreducethespreadofanyconstructionatthecostofincreasingthedimensionalityoftheinstance.3.1IntuitionThemainideabehindthelowerboundconstructionisthatofa\resetwidget".Theroleofthewidgetistorec-ognizewhenk-meanshasruntocompletionandtothenresetitbackintoitsinitialstate.Werequirethewidgettonotinterferewiththeoriginaldatapointsbeforeoraftertheresetoperation,therebyensuringthatthenewk-meanscongurationtakestwiceaslongtoruntocompletion.Ourlowerboundisobtainedbyrecursivelyaddingresetwidgets.ByensuringeachwidgethasO(k)newpointsandO(1)newclusters,wegettheboundof2\n(p n)iterations.Webeginwithanidealizeddescriptionofaresetwidget,illustratedinFigure1.Wethenbrie\rymentionafewissuesthatthisidealizeddiscussionomits.3.1.1AResetWidgetSupposewearegivenameanscongurationinR2andwearepromisedthatthenalcenter(cx;cy)ofsomeclusterCneverappearsasaclustercenterinanypreviousitera-tion.Wecallthisa\signaling"meansconguration.Wecandetectwhenk-meanshasruntocompletionbyliftingtheoriginalcongurationtoR3,andaddingapointP=(cx;cy;D )inanewclusterAwithcenterat(cx;cy;2D).IfDislargeandissmall,thenPwillswitchtoCafterk-meansnishesexecutingontheoriginaldataset,butnoearlier.Thiscreatesawidgetthattriggersattherighttime.ThenextstepistomakethewidgetactuallyresetC.WedothisbyaugmentingAtoalsoincludeapointQ=(dx;dy;D(1+0))whilemaintainingthecenterofAat(cx;cy;2D).Switch-ingPfromAtoCcausesthecenterofCtomovetowardsQ,andthecenterofAtomoveawayfromQ.AslongasDissucientlylarge,QwillfollowPintoConthenextiteration,regardlessofthevaluesofdxanddy.Inparticu-lar,thismeanswecanchoosedxanddysoastoresetthecenterofCtoitsinitialposition(atleastinthexandy coordinates).ToavoidchangingC'szcoordinate,wemaketwosymmetricresetwidgets,oneabovethexy-plane,andonebelow.SeeFigure1,parts(a)through(c).Unfortunately,thismethodisnotquitesucient.WehaveresetthecenterofCbyaddingpointstothecluster.Ask-meansprogressesthesecondtimethrough,thesepointswilllingerwithCandprovideaconstantdragbacktoitsoriginalposition.Toactuallymaketheresetcongurationproceedastheoriginaldid,Cmustlosethesenewpointsimmediatelyaftertheresetoccurs.Toensurethishappens,weaddathirdpointRnearthecenterofA,andanewclusterBnearR.NowBwillacquireRduringtheresetprocess,whichmovesitintopositiontorecapturethepointsPandQ.ThewholeprocessisillustratedinFigure1.WehavenowfullyresetC.Applyingthistechniquesimul-taneouslytoeachcluster,wecanhopetodoubletherunningtimeofk-means.3.1.2PitfallsAfewadditionalconsiderationscomeintoplaywhenfor-malizingthisintuition.1.Wecanonlyaddaresetwidgettoasignalingcong-uration.Torecursivelyaddresetwidgets,weneedtoensurethataddingaresetwidgetmaintainsthesig-nalingproperty.2.Wecanonlyresetaclusterifthatspecicclustersig-naledonthenaliteration.Thus,weneedtobeabletotakeasignalingcongurationandaugmentitsothateachclustersimultaneouslysignals.3.Wecannotaordtodoublethenumberofclustersbyaddingadierentresetwidgetforeachcluster.In-stead,wemusthaveonewidgetresetallclustersatthesametime.Toaccomplishthis,itisconvenienttohavetheresetwidgetclustercenteredequallyfarfromeachsignal,whichrequiresplacingcertainclustercen-tersonahypersphere.3.2TheFormalConstructionWenowformallypresenttheresetwidget.Thisrequiresacarefulplacementofpointsandclustercenters,buttheintuitionisexactlytheonedescribedabove.Werststateourmainresults.Definition3.1.Ameanscongurationissaidtobesig-nalingifatleastonenalclustercenterisdistinctfromeveryclustercenterarisinginpreviousiterations.Theorem3.1.LetMbeasignaling,non-degeneratemeanscongurationonndatapointswithkclusters.Thenthereexistsasignaling,non-degeneratemeanscongurationNonn+O(k)datapointswithk+O(1)clusterssuchthatT(N)2T(M).Startingwithanarbitraryconguration,wecanapplythisconstructionttimestoobtainameanscongurationwithO(t2)pointsandO(t)clustersforwhichT(M)2t.Thesuperpolynomialcomplexityofk-meansfollowsimme-diately.Corollary3.2.Theworst-casecomplexityofk-meansonndatapointsis2\n(p n).WeproveTheorem3.1intwoparts.First,weshowthatparticulartypesofmeanscongurations,calledsuper-signa-lingcongurations,canbeslightlyenlargedtocreatenon-degenerate,signalingmeanscongurationswithtwicethecomplexity.Wethenshowhowtoslightlyenlargenon-de-generate,signalingmeanscongurationstoobtainsuper-signalingcongurations,therebyestablishingtherecursion.Definition3.2.AmeanscongurationsMissaidtobesuper-signalingifithasthefollowingproperties.1.Thenalpositionsofallclustercenterslieonahy-persphere.2.Thenalpositionsofallclustercentersaredistinctfromallclustercentersarisinginpreviousiterations.3.ThereexistsameanscongurationM0withthesamesetofdatapointsasMandwiththesamenumberofclustersasM.Furthermore,T(M0)=T(M)andatleastonenalclustercenterinM0isdistinctfromanyotherclustercenterarisinginalliterationsstartingfromMandM0.Lemma3.3.LetMbeasuper-signaling,non-degeneratemeanscongurationonndatapointswithkclusters.Thenthereexistsasignaling,non-degeneratemeanscongurationNonn+O(k)datapointswithk+O(1)clusterssuchthatT(N)2T(M).Proof.Webeginwithaformaldenitionofourcon-struction,andthentracetheexecutionofk-meansinTable1andFigures3-7.LetM0begivenasinDenition3.2.LabeltheclustersinMandM0with1throughk,andletxi;t(respectivelyyi;t)denotethecenterofclusteriinM(respectivelyM0)aftertiterations.Alsolet~xidenotethenalcenterofclusteriinMandletnidenotethenalnumberofdatapointsinclusteri.SinceMissuper-signaling,wemayassumewithoutlossofgeneralitythatk~xikisindependentofi(e.g.thecenterofthehyperspherepassingthroughthexi'sliesattheorigin).Finally,letzi=1 2((ni+4)yi;0 (ni+2)~xi).LetV(M)denotethedatapointsinMandlet`denotethediameteroff0;zi;V(M)g.Letd,randbesuchthatdr`0andletd0besuchthat(d0)2=d2+k~xik2 .Finally,letu1;u2;:::;ukandv1;v2;:::;vkbevectorsinR2suchthat(a)kuik=ni+2 2,(b)vi=ui kuik,and(c)vi=vjforalli;j.NowconsiderthefollowingpointsinSpan(V(M))RR2R,Pi=(~xi;d0;rui;0)forik;P0i=( ~xi;d0+2d; rui;0)forik;Qi=(zi;d0+0:001d;rvi;0)forik;Q0i=( zi;d0+1:999d; rvi;0)forik;A=(0;d0+0:99d;0;0);A0=(0;d0+1:01d;0;0);X=(0;d0+0:99d;0;0:2d);X0=(0;d0+1:01d;0;0:2d):ForeachsuchpointZ,wedene Ztobethere\rectionofPaboutthehyperplaneSpan(V(M))f0gR2R|i.e. Pihascoordinates(~xi; d0;rui;0).LetV(N)denotethesetofallthesepointsalongwiththenaturalembeddingofV(M) V(M)(`)0:02d0:02d0:99d0:989d0:989d(r)0:001d0:02d0:001dd0dd0d0:001d0:02d0:001d0:99d A0AQi X0 P0i Q0iPiX A Qi PiA0 XX0Q0iP0i Figure2:ThedatapointsconstructedinLemma3.3(Figurenottoscale).Notedr`. t ClustersofN 0,...,T(M) Ci=Mi;twithcenter=(xi;t;0;0;0) G=fPi;P0i;Qi;Q0i;A;A0gwithcenter=(0;d0+d;0;0) H=fXgwithcenter=(0;d0+0:99d;0;0:2d) H0=fX0gwithcenter=(0;d0+1:01d;0;0:2d) T(M)+1 Ci=~Mi[fPi; Pigwithcenter=(~xi;0;rvi;0) G=fP0i;Qi;Q0i;A;A0gwithcenter(O(`);d0+d;O(rn);0)with1:254=3 H=fXgwithcenter=(0;d0+0:99d;0;0:2d) H0=fX0gwithcenter=(0;d0+1:01d;0;0:2d) T(M)+2 Ci=~Mi[fPi;Qi; Pi; Qigwithcenter=(yi;0;0;rvi;0) G=fP0i;Q0igwithcenter=(O(`);d0+1:9995d;O(rn);0) H=fA;Xgwithcenter=(0;d0+0:99d;0;0:1d) H0=fA0;X0gwithcenter=(0;d0+1:01d;0;0:1d) T(M)+3 Ci=M0i;1withcenter=(yi;1;0;0;0) G=fP0i;Q0igwithcenter=(O(`);d0+1:9995d;O(rn);0) H=fA;X;Pi;Qigwithcenter=(O(`);d0+0:0005d+0:9895 k+1d;O(rn); 2k+2) H0=fA0;X0gwithcenter=(0;d0+1:01d;0;0:1d) T(M)+4,..., Ci=M0i;t T(M) 2withcenter=(yi;t T(M) 2;0;0;0) 2T(M)+2 G=fP0i;Q0igwithcenter=(O(`);d0+1:9995d;O(rn);0) H=fPi;Qigwithcenter=(O(`);d0+0:0005d;O(rn);0) H0=fA;A0;X;X0gwithcenter=(0;d0+d;0;0:1d) Table1:TheclustersofNaftertiterationsofk-means(seeLemma3.3).Mi;t(respectivelyM0i;t)denotesthepointsinclusterofiofM(respectivelyM0)aftertiterations,and~MidenotesthenalpointsinclusteriofM.Alltableentriesdescribeclustersimmediatelyafterthecentersarerecomputed.Ratherthangoingthougheverycalculation,wediscussthekeyelementsinthefollowinggures. HH0G G H H0Ci Figure3:Clusteringat0tT(M)(seeLemma3.3).TheclusterscontainedwithinV(M)proceedindepen-dentlyoftheotherpoints.Theremainingclustersareprecariousbuttemporarilystable.Forexample,toseethatPidoesnotswitchfromclusterGtoCj,notethatthedistancesquaredfromPitothecenterofCjminusthedistancesquaredfromPitothecenterofGis(k~xi xj;tk2+(d0)2+kruik2) (k~xik2+d2+kruik2)=kxi xj;tk2 0.Thelastinequalityfollowsfromthefactthat`andthat,sinceMissuper-signaling,~xi=xj;t. H0G H G H0CiH Figure4:Clusteringatt=T(M)+1(seeLemma3.3).Wenowhavexi;t=~xiforalli,andthusbythecalculationinthepreviousstep,eachPiswitchestoclusterCi.Clearly,thiswillresultinasubstantialshiftofthecenterofG(andsimilarlyof G).Furthermore,theui'shavebeenchosensothatthecenterofCibecomes(~xi;0;rvi;0). HH H0CiG GH0 Figure5:Clusteringatt=T(M)+2(seeLemma3.3).FirstconsiderV(M).ThesepointscontinuetobeclosertotheCi'sthantootherclusters.EachCicenterhasmovedsincethepreviousiteration,buttheyhaveallmovedbyaconstantamount(namelyrkvik=r)inadirectionorthogonaltoSpan(V(M)).Therefore,theclosestcentertoeachpointinV(M)hasnotchanged,andthusthesepointsremainintheircurrentclusters.Ontheotherhand,sincethecenterofGmovedaway,A,A0,andQiallswitchtodierentclusters.ThersttwoclearlyswitchtoHandH0,butQicouldreasonablyswitchtoeitherHoranyCj.ThedistancesquaredfromQitothecenterofCjis(1:001d)2+r2kvi vjk2+O(`2),whichisminimizedwheni=j.ThedistancesquaredfromQitothecenterofHis(0:989d)2+(0:2d)2+O(r2).Since0:9892+0:221:0012anddr;`,itfollowsthatQiwillinfactswitchtoCi.NotethattheanalysissofardoesnotdependontheV(M)-coordinateofanyQi,sowemaychoosethosetomaketheV(M)-coordinateofeachCiequaltoyi;0attheendofthisstep. G H0CiGHH0 H Figure6:Clusteringatt=T(M)+3(seeLemma3.3).ByacquiringA,clusterHhasmovedclosertotheotherpoints.Infact,thedistancesquaredfromPitothecenterofHisnow(0:99d)2+(0:1d)2+O(r2)d2.Thus,eachPiswitchestoH,andasimilarcalculationshowseachQialsoswitchestoH.NowconsiderV(M).Asinthepreviousstep,wemayignorethervicomponentofeachCicenter.TheV(M)componentofeachCicenterisnowyi;0,whichmeanstheclusteringproceedsaccordingtoM0,andthepointsinV(M)associatedwithCiattheendofthisstepareM0i;1. H0 H H0 GCiHG Figure7:ClusteringatT(M)+4t2T(M)+2(seeLemma3.3).ThecenterofHmovesbecausePiandQihavebeenabsorbedintoH.AlsoAandXswitchtoH0.Beyondthat,thecongurationisnowverystable,andtheclusteringonV(M)willproceednormallyaccordingtoM0.inSpan(V(M))f0gf0;0gf0g.ThissetupisillustratedinFigure2.WealsodeneclusterswithinitialcentersinSpan(V(M))RR2Rasfollows.Ciwithcenter=(xi;0;0;0;0)forik;Gwithcenter=(0;d0+d;0;0);Hwithcenter=(0;d0+0:99d;0;0:2d);H0withcenter=(0;d0+1:01d;0;0:2d):ForeachsuchclusterCotherthantheCi's,wedene Ctobeaclusterwhoseinitialcenterisobtainedbyre\rectingtheinitialcenterofCaboutthehyperplaneSpan(V(M))f0gR2R.LetNdenotethemeanscongurationwithalltheseclus-tercentersandwithdatapointsV(N).Wetracetheevolu-tionofk-meansonNviaTable1andFigures3-7.Baseduponthis,weseethatT(N)T(M)+T(M0)=2T(M),andthatNisnon-degenerateandsignaling.SinceNhasn+O(k)datapointsandk+O(1)clusters,theresultfol-lows. Thiscompletesthersthalfofourconstruction,inwhichwetransformasuper-signalingcongurationintoasignalingcongurationwithtwicethecomplexity.Wenowshowhowtotransformasignalingcongurationintoasuper-signalingcongurationwithequalcomplexity.Lemma3.4.LetNbeasignaling,non-degeneratemeanscongurationonndatapointswithkclusters.Thenthereexistsasuper-signaling,non-degeneratemeanscongurationMonn+O(k)datapointswithk+O(1)clusterssuchthatT(M)T(N).Proof.Letxi;tdenotethecenterofclusteriinNaftertiterationsandlet~xidenotethenalcenterofclusteriinN.SinceNissignaling,wemayassumewithoutlossofgeneralitythat~x1isdistinctfromallotherxi;t.LetV(N)denotethesetofdatapointsinNandlet`denotethediameterofV(N).Letdandbesuchthatd`andletd0besuchthat(d0)2=d2 .Also,leta,bandcbepointsinV(N)suchthatb=a+c 2andsuchthatthedistancefromatoV(N)ismuchlargerthanboth`andkc ak.Now,take=1 3k+9,andconsiderthefollowingpointsinSpan(V(N))R,P=(~x1;d0);Xi=(~xi;d0+d)forik;A;B;C=(a;0);(b;0);(c;0);A0;B0;C0=(a;d0+d);(b;d0+d);(c;d0+d);Q=(k+4)~x1 X~xi 3b;d0+(k+14=3)d:ForeachsuchpointZ62fA;B;Cg,wealsodene Ztobethere\rectionofZaboutthehyperplaneSpan(V(N))f0g.LetV(M)denotethesetofallthesepointsaswellasthenaturalembeddingofV(N)inSpan(V(N))f0g.ThisisillustratedinFigure8.WealsodeneclusterswithcentersinSpan(V(N))Rasfollows.Ciwithcenter=(xi;0;0)forik;Hwithcenter=((a+b)=2;0);H0withcenter=(c;0);Jwithcenter=(x1;d0+d); Jwithcenter=(x1; d0 d):LetMdenotethemeanscongurationwithalltheseclus-tercentersandwithdatapointsV(M).Wetracetheevo-lutionofk-meansonMviaTable2andFigures9-11.Baseduponthis,weseethatT(M)T(N),thatMisnon-degenerate,andalsothatthenalclustersetsofMaredistinctfromallclustersetsarisinginpreviouscongura-tions.AlsoletM0denotethemeanscongurationwithdatapointsV(M)andwithclustercentersasaboveexceptwithHcenteredat(a;0)andH0centeredat((b+c)=2;0).Then,thesamecalculationshowsthatT(M0)=T(M)andthatthenalclustersetforHisdistinctfromallotherclustersetsarisinginMorM0.Finally,sinceMandM0arenon-degenerate,thereexistsa0suchthatwemaymoveeachdatapointbyuptowithoutalteringthek-meansexecution.Takingadvantageofthis,wecanensurethatthecentersofdistinctclustersetsaredistinct,andthatthenalclustercentersofM0lieonahypersphere.ThismakesMsuper-signaling,andtheresultfollows. Theorem3.1followsimmediatelyfromLemma3.3andLemma3.4. Xid(k+5)d 3k+9d0dd0dd 3k+9d(k+5)(`)V(N)XiABC P B0 C0 QPQB0A0C0 A0 Figure8:ThedatapointsconstructedinLemma3.4.Noted`. t ClustersofM 0,...,T(N) Ci=Ni;twithcenter=(xi;t;0)for1ik H=fA;Bgwithcenter=(a+b 2;0) H0=fCgwithcenter=(c;0) J=fP;Xi;A0;B0;C0;Qgwithcenter=(~x1;d0+d) T(N)+1 C1=~N1[fP; Pgwithcenter=(~x1;0) Ci=~Niwithcenter=(~xi;0)for2ik H=fA;Bgwithcenter=(a+b 2;0) H0=fCgwithcenter=(c;0) J=fXi;A0;B0;C0;Qgwithcenter=(~x1;d0+d+ k+4) T(N)+2 C1=~N1[fP;X1; P; X1gwithcenter=(~x1;0) Ci=~Ni[fXi; Xigwithcenter=(~xi;0)for2ik H=fA;B;A0;B0; A0; B0gwithcenter=(a+b 2;0) H0=fC;C0; C0gwithcenter=(c;0) J=fQgwithcenter=((k+4)~x1 P~xi 3b;d0+(k+14=3)d) Table2:TheclustersofMaftertiterationsofk-means(seeLemma3.4).Ni;tdenotesthepointsinclusterofiofNaftertiterations,and~NidenotesthenalpointsinclusteriofN.Alltableentriesdescribeclustersimmediatelyafterthecentersarerecomputed.Ratherthangoingthougheverycalculation,wediscussthekeyelementsinthefollowingpictures. JJHH0Ci Figure9:Clusteringat0tT(N)(seeLemma3.4).AswiththerstpartoftheconstructionforLemma3.3,theclusterscontainedwithinV(N)proceedindependentlyoftheotherpoints.Theremainingclustersareprecariousbuttemporarilystable.Forexample,toseethatPdoesnotswitchfromclusterJtoCi,notethatthedistancesquaredfromPtothecenterofCiminusthedistancesquaredfromPtothecenterofJisk~x1 xi;tk2+(d0)2 d2=kx1 xi;tk2 0.Thelastinequalityfollowsfromthefactthat`andthat,sinceNissignaling,~x1=xi;t. H0JCiH J Figure10:Clusteringatt=T(N)+1(seeLemma3.4).Wenowhavex1;t=~x1,andthusbythecalculationinthepreviousstep,PswitchestoclusterC1.Since PalsoswitchestoclusterC1,thecenterofC1doesnotchange.However,thecentersofJandJ0bothmoveslightlyfurtherawayfromtheotherpoints. HH0JCi J Figure11:Clusteringatt=T(N)+2(seeLemma3.4).ThepointsXi;A0;B0;C0wereallchosentobeonlybarelystablewithinJ.Thus,afterthecenterofJmoves,thesepointsswitchtotheclosestclustersinV(N).Forexample,thedistancefromXitothecenterofCiisapproximatelyd+ 3k+9,andthedistancefromXitothecenterofJisapproximatelyd+ k+4 3k+9d+ 3k+9Again,onlythecentersofJand Jmoveasaresultofthis,anditiseasytocheckthenewcongurationisstable. 3.3ProbabilityBoostingTheconstructionusedtoproveTheorem3.1requiresbothaspecicsetofdatapointsandaspecicsetofclustercen-ters.Inpractice,however,onlythedatapointsarespeciedandtheinitialclustercentersarechosenbythealgorithm.Typically,theinitialcentersarechosenuniformlyatran-domfromthedatapoints.Giventhis,onemightaskifthesuperpolynomiallowerboundcanactuallyarisewithnon-vanishingprobability.Inthissection,weshowhowtomodifyourlowerboundconstructiontoapplywithhighprobabilityeveniftheclus-tercentersarechosenrandomlyfromtheexistingdatapoints.Itfollowsthatk-meanscanstillbeveryslowforcertainsetsofdatapoints,evenaccountingfortherandomchoiceofclustercenters.Proposition3.5.LetMbeameanscongurationonnpoints.Then,thereexistsasetofO(n3logn)pointssuchthatifameanscongurationNisconstructedwiththesedatapointsandwith4nlognclustercenterschosenran-domlyfromthesetofdatapoints,thenT(N)T(M)withprobability1 O(1 n).Proof.LetkbethenumberofclustersinM.Forikandjm,letui;jdenoteorthogonalunitvectorsinRmk.LetV(M)denotethesetofdatapointsinMandlet`denotethediameterofV(M).Letd,randbesuchthatdr`.Also,letnidenotethenumberofpointsinclusteriinMafteroneiteration.ReplacingMwithtwoidenticaloverlappingcopiesifnecessary,wemayassumethatni1.Finally,letxi;tdenotethecenterofclusteriinMaftertiterations.LetmbeapositiveintegertobexedlaterandconsiderthepointsetinSpan(V(M))RkmRobtainedbyrstembeddingtwocopiesofV(M)atSpan(V(M))f0gf0gandthenaddingthefollowingpoints.1.Pi;j=(xi;0;P(i0;j0)=(i;j)rui0;j0;d+j)forik;jm.2.Qi;`=(nixi;1 xi;0 ni 1;Pi0=iPj0rui0;j0;d `)forikand`ni 1.3.Oj=(0;Pi0Pj0rui0;j0;d+j)forjm.ConsiderameanscongurationNwiththesedatapointsandwith4nlognclustercenterschosenfromthesepointsatrandom.Let 0=fO1;O2;:::;Omgandfori0,let i=fPi;1;Pi;2:::;Pi;mg.SupposethatNbeginswithallofitsclustercentersin =[i iandthateach ihasatleastoneclustercenter.OnecancheckthatT(N)T(M)inthiscase.Now,letm=n3logn k.Then,eachclustercenterwillbeinsome iwithprobability1 O(1 n2logn).Sincethereare4nlognclusters,allclusterswillbein withprobability1 O(1 n).Furthermore,theprobabilitythatnoclustercenterischoseninaxed iisatmost(1 1 2k)4nlogn1 n2.Thus,each ihasatleastoneclustercenterwithprobability1 O(1 n).Theresultnowfollows. 3.4LowspreadRecallthespreadofapointsetistheratioofthelargestpairwisedistancetothesmallestpairwisedistance.Har-PeledandSadri[7]conjecturedthatk-meansmightrunintimepolynomialinnand.Inthissection,however,weshowthatthespreadcanbereducedtoO(1)withoutde-creasingthenumberofiterationsrequired.Proposition3.6.LetMbeameanscongurationonnpoints.Then,thereexistsameanscongurationNon2npointssuchthatNhasO(1)spreadandsuchthatT(N)=T(M).Proof.LetV(M)denotethepointsinM,andchooseanarbitrarysetofvectors,u1;u2;:::;un.Foreachvi2V(M),wereplaceviwithxi=(vi;ui)andyi=(vi; ui)inSpan(V(M))Span(u1;u2;:::;un).LetNdenotethemeanscongurationwiththesedatapointsandwithcenters(cj;0)foreachcentercjinM.ItiseasytocheckthatclusterCinNcontainsxiandyiaftertiterationsifandonlyifclusterCinMcontainsviaftertiterations.ItfollowsthatT(N)=T(M).Takinguitobeorthogonalandoflengthd0,wecanmakeNhavespreadarbitrarilyclosetop 2. Moregenerally,thereisatradeobetweentheextradi-mensionalityandthereductionof.Forexample,byaddingoneextradimension,andtakingui=di,wecanmakethespreadlinearinn.4.DISCUSSION4.1SmoothedAnalysisWehaveshownk-meanscanhaveasuperpolynomialrun-ningtimeintheworstcase.However,weknowthealgo-rithmrunsecientlyinpractice.Itisnaturaltoaskhowthisdiscrepancycanbeformalized.Onenaturalapproachisthatofsmoothedanalysis,whichwasusedbySpielmanandTeng[13]toexplaintherunningtimeoftheSimplexalgorithm.Towardsthatend,assumethatthedatapointsarechosenfromindependent,normaldistributionswithvariance2.LettingDdenotethediameteroftheresultingpointset,weaskwhetherk-meansislikelytorunintimepolynomialinnandD .4.1.1HighDimensionThisquestionappearstobedicultingeneral,butapositiveresultisrelativelyeasytoproveinhighdimensions.Inthissection,wesketchaproofofthisfact.Proposition4.1.Givendatapointschosenfrominde-pendentnormaldistributionswithvariance2andwithdi-mensiond=\n(n=logn),k-meanswillexecuteinpolynomialtimewithhighprobability.Weanalyzethestandardk-meanspotentialfunction.ForameanscongurationM=(X;C),let(M)=Pni=1kxi cik2,whereci2Cistheclustercenterclosesttoxi.Clearly,0nD2,andonecanalsocheckthatisnon-increasingthroughoutanexecutionofk-means.Therefore,itsucestoshowthatthepotentialdecreasesbyanon-trivialamountduringeachiteration.Ontheonehand,itisknownthatifaclustercentermovesbyadistanceduringak-meansstepandiftheclusterhasmpointsattheendofthestep,thendecreasesbyatleast2m(see[7]and[11]).Ontheotherhand,ifourdatapointsarerandom,notwopossiblecenterscanbetooclose.Thiscanbeformalizedasfollows. Definition4.1.WesayasetofdatapointsXis\-separated"ifforanynon-identicalsubsetsSandT,thecen-tersofmassc(S)andc(T)satisfykc(S) c(T)k 2min(jSj;jTj).Lemma4.2.IfXisasetofndatapointschosenfromindependentnormaldistributionswithvariance2,thenXis-separatedwithprobabilityatleast1 22n .WeomittheproofoftheLemma.Proposition4.1followsbychoosing= n1=d22n=d.4.1.2TheGeneralCaseProposition4.1showsthatk-meansrunsinpolynomialtimewithhighprobabilityinsmoothedhigh-dimensionalsettings.Asimilarresultholdswhend=1basedonthespreadanalysisof[7]andthefactthatasmoothedpointsetislikelytohavepolynomialspread.Amuchmoresubtleanalysisseemstoberequiredforothervaluesofd.WehaverecentlyprovenanupperboundofnO(k)poly(n;D )[2],butitremainsamajoropenproblemtondaboundpolynomialinn,kandD inthegeneralcase.4.2VariantsSmoothedanalysisprovidesoneveryexplicitwayofcir-cumventingtheworst-caseperformanceofk-means.Namely,givenanarbitrarydataset,wecanperturbeachpointac-cordingtoanindependentnormaldistributionandthenrunk-means.Evenoursimpleanalysisinhighdimensionscanbeharnessedherebyrstliftington-dimensionalspace,andthenperturbing.Twoothermethodsalsosuggestthemselves.Firstofall,k-meansisoftenruninarelativelysmallnumberofdimen-sions,andregardless,onecanalwaysreducedtoO(logn)withsmalldistortion[10].Thus,itisnaturaltoaskhowk-meansperformsforsmalld.Weconjecturethatk-meansisworst-casesuperpolynomialid1.Evenwhend=1,nostronglypolynomialupperboundsareknown.Finally,Har-PeledandSadri[7]suggestedasimplevari-antofk-meanswhereonlyonedatapointisreassignedeachiteration.Thisvarianthasrunningtimepolynomialinnandthespread,whichweknowisnottrueforstan-dardk-means.Giventhisqualitativeimprovement,afur-therstudyofthisvariantcouldprovefruitful.5.REFERENCES[1]PankajK.AgarwalandNabilH.Mustafa.k-meansprojectiveclustering.InPODS'04:Proceedingsofthetwenty-thirdACMSIGMOD-SIGACT-SIGARTsymposiumonPrinciplesofdatabasesystems,pages155{165,NewYork,NY,USA,2004.ACMPress.[2]DavidArthurandSergeiVassilvitskii.Improvedsmoothedanalysisforthek-meansmethod.Manuscript,2006.[3]DavidArthurandSergeiVassilvitskii.k-meanslowerboundimplementation.http://www.stanford.edu/~darthur/kMeansLbTest.zip,2006.[4]SanjoyDasgupta.Howfastisk-means?InCOLTComputationalLearningTheory,volume2777,page735,2003.[5]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassication.Wiley-IntersciencePublication,2000.[6]FredericGibouandRonaldFedkiw.Afasthybridk-meanslevelsetalgorithmforsegmentation.In4thAnnualHawaiiInternationalConferenceonStatisticsandMathematics,pages281{291,2005.[7]SarielHar-PeledandBardiaSadri.Howfastisthek-meansmethod?Algorithmica,41(3):185{202,2005.[8]R.Herwig,A.J.Poustka,C.Muller,C.Bull,H.Lehrach,andJO'Brien.Large-scaleclusteringofcdna-ngerprintingdata.GenomeResearch,9:1093{1105,1999.[9]MaryInaba,NaokiKatoh,andHiroshiImai.Applicationsofweightedvoronoidiagramsandrandomizationtovariance-basedk-clustering:(extendedabstract).InSCG'94:ProceedingsofthetenthannualsymposiumonComputationalgeometry,pages332{339,NewYork,NY,USA,1994.ACMPress.[10]W.JohnsonandJ.Lindenstrauss.Extensionsoflipschitzmapsintoahilbertspace.ContemporaryMathematics,26:189{206,1984.[11]TapasKanungo,DavidM.Mount,NathanS.Netanyahu,ChristineD.Piatko,RuthSilverman,andAngelaY.Wu.Alocalsearchapproximationalgorithmfork-meansclustering.InSCG'02:ProceedingsoftheeighteenthannualsymposiumonComputationalgeometry,pages10{18,NewYork,NY,USA,2002.ACMPress.[12]StuartP.Lloyd.Leastsquaresquantizationinpcm.IEEETransactionsonInformationTheory,28(2):129{136,1982.[13]DanielA.SpielmanandShang-HuaTeng.Smoothedanalysisofalgorithms:Whythesimplexalgorithmusuallytakespolynomialtime.J.ACM,51(3):385{463,2004.
stanfordedu Sergei Vassilvitskii Stanford University Stanford CA sergeicsstanfordedu ABSTRACT The kmeans method is an old but popular clustering algo rithm known for its observed speed and its simplicity Until recently however no meaningful theoretic ID: 3550 Download Pdf
b The rolling shutter used by sensors in these cameras also produces warping in the output frames we have exagerrated the effect for illustrative purposes c We use gyroscopes to measure the cameras rotations during video capture d We use the measure
scsstanfordedu dm Stanford CA 94305 Education Massachusetts Institute of Technology Cambridge MA PhD in Electrical Engineering and Computer Science September 2000 Thesis title Selfcertifying F
Dept Stanford University Stanford CA 94305 drebolloruibgirod stanfordedu Abstract We address the problem of designing optimal quantizers for distributed source coding The generality of our formulation includes both the symmetric and asymmetric scena
PieterAbbeelpabbeel@cs.stanford.eduMorganQuigleymquigley@cs.stanford.eduAndrewY.Ngang@cs.stanford.eduComputerScienceDepartment,StanfordUniversity,Stanford,CA94305,USAAbstractInthemodel-basedpolicysear
stanfordedu tomasicsstanfordedu Abstract Proceedings of the 1998 IEEE International Conference on Computer Vision Bombay India An algorithm to detect depth discontinuities from a stereo pair of images is presented The algorithm matches individual pix
The method pro posed here exploits physics unique to smoke in order to design a numerical method that is both fast and ef64257cient on the relatively coarse grids traditionally used in computer graphics applications as compared to the much 64257ner
The key innovation is a representation of the data association posterior in information form in which the proxim ity of objects and tracks are expressed by numerical links Updating these links requires linear time compared to exponential time requir
The o64260ine phase of the method builds the local red ucedorder bases using an unsupervised learning algorithm In the online phase of the method the choice of the local basis is based on the current state of the system Inexp ensive rankone updates
stanfordedu Adam Barth Stanford University abarthcsstanfordedu Abstract The security policy of browsers provides no isolation be tween documents from the same origin scheme host and port even if those documents have different security char acteristic
edu Jure Leskovec Stanford University jurecsstanfordedu ABSTRACT Online content exhibits rich temporal dynamics and divers e real time user generated content further intensi64257es this proces s How ever temporal patterns by which online content grow
© 2021 docslides.com Inc.
All rights reserved.