stanfordedu Sergei Vassilvitskii Stanford University Stanford CA sergeicsstanfordedu ABSTRACT The kmeans method is an old but popular clustering algo rithm known for its observed speed and its simplicity Until recently however no meaningful theoretic ID: 3550
Download Pdf The PPT/PDF document "How Slow is the Means Method David Arthu..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
HowSlowisthekMeansMethod?DavidArthurStanfordUniversityStanford,CAdarthur@cs.stanford.eduSergeiVassilvitskiiyStanfordUniversityStanford,CAsergei@cs.stanford.eduABSTRACTThek-meansmethodisanoldbutpopularclusteringalgo-rithmknownforitsobservedspeedanditssimplicity.Untilrecently,however,nomeaningfultheoreticalboundswereknownonitsrunningtime.Inthispaper,wedemonstratethattheworst-caserunningtimeofk-meansissuperpolyno-mialbyimprovingthebestknownlowerboundfrom\n(n)iterationsto2\n(p n).CategoriesandSubjectDescriptors:F.2.2[AnalysisofAlgorithmsandProblemComplexity]:NonnumericalAlgorithmsandProblemsGeneralTerms:Algorithms,Theory.Keywords:K-means,LocalSearch,LowerBounds.1.INTRODUCTIONThek-meansmethodisawellknowngeometricclusteringalgorithmbasedonworkbyLloydin1982[12].Givenasetofndatapoints,thealgorithmusesalocalsearchapproachtopartitionthepointsintokclusters.Asetofkinitialclus-tercentersischosenarbitrarily.Eachpointisthenassignedtothecenterclosesttoit,andthecentersarerecomputedascentersofmassoftheirassignedpoints.Thisisrepeateduntiltheprocessstabilizes.Itcanbeshownthatnoparti-tionoccurstwiceduringthecourseofthealgorithm,andsothealgorithmisguaranteedtoterminate.Thek-meansmethodisstillverypopulartoday,andithasbeenappliedinawidevarietyofareasrangingfromcomputationalbiologytocomputergraphics(see[1,6,8] SupportedinpartbyanNDSEGFellowship,NSFGrantITR-0331640,andgrantsfromMedia-XandSNRC.ySupportedinpartbyNSFGrantITR-0331640,andgrantsfromMedia-XandSNRC.Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.SCG'06,June57,2006,Sedona,Arizona,USA.Copyright2006ACM1595933409/06/0006...$5.00.forsomerecentapplications).Themainattractionofthealgorithmliesinitssimplicityanditsobservedspeed.Indeed,therunningtimeofk-meansiswellstudiedex-perimentally(see,forexample,[7]).Intheirtextonpatternclassication,Dudaetal.remarkthat,\Inpracticethenum-berofiterationsisgenerallymuchlessthanthenumberofpoints"[5].However,fewmeaningfultheoreticalboundsontheworst-caserunningtimeofk-meansareknown.1.1RelatedWorkThereisatrivialupperboundofO(kn)iterationssincenopartitionofpointsintoclustersiseverrepeatedduringthecourseofthealgorithm.Ind-dimensionalspace,thisboundwasslightlyimprovedbyInabaetal.[9]toO(nkd)bycountingthenumberofdistinctVoronoipartitionsonnpoints.Morerecently,Dasgupta[4]presentedsometighterresultsforafewspecialcases.Hedemonstratedaworst-caselowerboundof\n(n)iterations,andanupperboundofO(n)fork5andd=1.ThisworkwasextendedbyHar-PeledandSadri[7]in2005.Againrestrictingtod=1,theauthorsshowanupperboundofO(n2)whereisthespreadofthepointset(de-nedastheratiobetweenthelargestpairwisedistanceandthesmallestpairwisedistance).Theyareunabletoboundtherunningtimeofk-meansingeneral,buttheysuggestafewmodicationsthatareeasiertoanalyze.Forexample,ifonereclassiesexactlyonepointperiteration,thenk-meansisguaranteedtoconvergeafterO(kn22)iterationsinanydimension.1.2OurResultsOurmainresultisalowerboundconstructionforwhichtherunningtimeofk-meansissuperpolynomial.Inparticu-lar,wepresentasetofndatapointsandasetofadversar-iallychosenclustercentersforwhichthealgorithmrequires2\n(p n)iterations.Wethenexpandthistoshowthateveniftheinitialclustercentersarechosenuniformlyatrandomfromthedatapoints,therunningtimeisstillsuperpolyno-mialwithhighprobability.Wealsoshowourconstructioncanbemodiedtohaveconstantspread,therebydisprovingarecentconjectureofHar-PeledandSadri[7].Explainingtherunningtimesobservedinpracticeremainsanopenproblem.Asarststep,weshowthatifthedatapointsareselectedfromindependentnormaldistributionsin\n(n=logn)dimensions,thenk-meanswillterminateinapolynomialnumberofstepswithhighprobability.Wealsobrie\rydiscussseveralotherwaysinwhichonemighthopetocircumventtheworst-caselowerbound. CCBPAQRCABQRPABPQRCABPQR(d)(b)(c)(a) Figure1:Anidealized\resetwidget"thatcanbeusedtoresetthecenterofsomeclusterCafterk-meanshasnishedexecuting:(a)Thecongurationrightbeforethesignalingbegins.(b)PswitchestoclusterC,andthecenterofAmovesawayfromQandR.(c)QswitchestoclusterC,therebyresettingthecenterofC.Inaddition,RswitchestoclusterB,andthecenterofBmovestowardsPandQ.(d)PandQswitchtoclusterB.NowCiscompletelyreset.2.PRELIMINARIESThek-meansalgorithm[12]isamethodforpartitioningdatapointsintoclusters.LetX=fx1;x2;:::;xngbeasetofpointsinR.Afterbeingseededwithasetofkcentersc1;c2;:::;ckinR,thealgorithmpartitionsthesepointsintoclustersasfollows.1.Foreachi2f1;:::;kg,settheclusterCitobethesetofpointsinXthatareclosertocithantheyaretocjforallj=i.2.Foreachi2f1;:::;kg,setcitobethecenterofmassofallpointsinCi:ci=1 jCijPxj2Cixj.3.Repeatsteps1and2untilciandCinolongerchange,atwhichpointreturntheclustersCi.IftherearetwocentersequallyclosetoapointinX,webreakthetiearbitrarily.Ifaclusterhasnodatapointsattheendofstep2,weeliminatetheclusterandcontinueasbefore.Ourlowerboundconstructionwillnotrelyoneitherofthesedegeneracies.Duringtheanalysisitwillbeusefultotalkaboutameansconguration.Definition2.1.AmeanscongurationM=(X;C)isasetofdatapointsXandasetofclustercentersC=fcigi=1;:::;k.NotethatameanscongurationMdenesanintermedi-atepointintheexecutionofthealgorithm.GivenameanscongurationM,letT(M)denotethenumberofiterationsrequiredbyk-meanstoconvergestartingatM.WesaythatMisnon-degenerateif,asthealgorithmisruntocomple-tion,(a)nopointiseverequidistantfromthetwoclosestclustercentersand(b)noclustereverhas0datapoints.3.LOWERBOUNDSInthissection,wedemonstrateameanscongurationwhichrequires2\n(p n)iterations.Astheconstructionisratherinvolved,webeginwithsomeintuition,andthenproceedwithaformalproof.WehavealsoimplementedtheconstructioninC++[3].Attheendofthesection,weconsideracouplemodica-tionstothemainconstruction.First,weshowthatevenifthestartingcentersarechosenuniformlyatrandomfromthedatapoints,thereexistexampleswhereasuperpolyno-mialnumberofiterationsisstillrequiredwithhighprob-ability.Finally,weshowhowtoreducethespreadofanyconstructionatthecostofincreasingthedimensionalityoftheinstance.3.1IntuitionThemainideabehindthelowerboundconstructionisthatofa\resetwidget".Theroleofthewidgetistorec-ognizewhenk-meanshasruntocompletionandtothenresetitbackintoitsinitialstate.Werequirethewidgettonotinterferewiththeoriginaldatapointsbeforeoraftertheresetoperation,therebyensuringthatthenewk-meanscongurationtakestwiceaslongtoruntocompletion.Ourlowerboundisobtainedbyrecursivelyaddingresetwidgets.ByensuringeachwidgethasO(k)newpointsandO(1)newclusters,wegettheboundof2\n(p n)iterations.Webeginwithanidealizeddescriptionofaresetwidget,illustratedinFigure1.Wethenbrie\rymentionafewissuesthatthisidealizeddiscussionomits.3.1.1AResetWidgetSupposewearegivenameanscongurationinR2andwearepromisedthatthenalcenter(cx;cy)ofsomeclusterCneverappearsasaclustercenterinanypreviousitera-tion.Wecallthisa\signaling"meansconguration.Wecandetectwhenk-meanshasruntocompletionbyliftingtheoriginalcongurationtoR3,andaddingapointP=(cx;cy;D )inanewclusterAwithcenterat(cx;cy;2D).IfDislargeandissmall,thenPwillswitchtoCafterk-meansnishesexecutingontheoriginaldataset,butnoearlier.Thiscreatesawidgetthattriggersattherighttime.ThenextstepistomakethewidgetactuallyresetC.WedothisbyaugmentingAtoalsoincludeapointQ=(dx;dy;D(1+0))whilemaintainingthecenterofAat(cx;cy;2D).Switch-ingPfromAtoCcausesthecenterofCtomovetowardsQ,andthecenterofAtomoveawayfromQ.AslongasDissucientlylarge,QwillfollowPintoConthenextiteration,regardlessofthevaluesofdxanddy.Inparticu-lar,thismeanswecanchoosedxanddysoastoresetthecenterofCtoitsinitialposition(atleastinthexandy coordinates).ToavoidchangingC'szcoordinate,wemaketwosymmetricresetwidgets,oneabovethexy-plane,andonebelow.SeeFigure1,parts(a)through(c).Unfortunately,thismethodisnotquitesucient.WehaveresetthecenterofCbyaddingpointstothecluster.Ask-meansprogressesthesecondtimethrough,thesepointswilllingerwithCandprovideaconstantdragbacktoitsoriginalposition.Toactuallymaketheresetcongurationproceedastheoriginaldid,Cmustlosethesenewpointsimmediatelyaftertheresetoccurs.Toensurethishappens,weaddathirdpointRnearthecenterofA,andanewclusterBnearR.NowBwillacquireRduringtheresetprocess,whichmovesitintopositiontorecapturethepointsPandQ.ThewholeprocessisillustratedinFigure1.WehavenowfullyresetC.Applyingthistechniquesimul-taneouslytoeachcluster,wecanhopetodoubletherunningtimeofk-means.3.1.2PitfallsAfewadditionalconsiderationscomeintoplaywhenfor-malizingthisintuition.1.Wecanonlyaddaresetwidgettoasignalingcong-uration.Torecursivelyaddresetwidgets,weneedtoensurethataddingaresetwidgetmaintainsthesig-nalingproperty.2.Wecanonlyresetaclusterifthatspecicclustersig-naledonthenaliteration.Thus,weneedtobeabletotakeasignalingcongurationandaugmentitsothateachclustersimultaneouslysignals.3.Wecannotaordtodoublethenumberofclustersbyaddingadierentresetwidgetforeachcluster.In-stead,wemusthaveonewidgetresetallclustersatthesametime.Toaccomplishthis,itisconvenienttohavetheresetwidgetclustercenteredequallyfarfromeachsignal,whichrequiresplacingcertainclustercen-tersonahypersphere.3.2TheFormalConstructionWenowformallypresenttheresetwidget.Thisrequiresacarefulplacementofpointsandclustercenters,buttheintuitionisexactlytheonedescribedabove.Werststateourmainresults.Definition3.1.Ameanscongurationissaidtobesig-nalingifatleastonenalclustercenterisdistinctfromeveryclustercenterarisinginpreviousiterations.Theorem3.1.LetMbeasignaling,non-degeneratemeanscongurationonndatapointswithkclusters.Thenthereexistsasignaling,non-degeneratemeanscongurationNonn+O(k)datapointswithk+O(1)clusterssuchthatT(N)2T(M).Startingwithanarbitraryconguration,wecanapplythisconstructionttimestoobtainameanscongurationwithO(t2)pointsandO(t)clustersforwhichT(M)2t.Thesuperpolynomialcomplexityofk-meansfollowsimme-diately.Corollary3.2.Theworst-casecomplexityofk-meansonndatapointsis2\n(p n).WeproveTheorem3.1intwoparts.First,weshowthatparticulartypesofmeanscongurations,calledsuper-signa-lingcongurations,canbeslightlyenlargedtocreatenon-degenerate,signalingmeanscongurationswithtwicethecomplexity.Wethenshowhowtoslightlyenlargenon-de-generate,signalingmeanscongurationstoobtainsuper-signalingcongurations,therebyestablishingtherecursion.Definition3.2.AmeanscongurationsMissaidtobesuper-signalingifithasthefollowingproperties.1.Thenalpositionsofallclustercenterslieonahy-persphere.2.Thenalpositionsofallclustercentersaredistinctfromallclustercentersarisinginpreviousiterations.3.ThereexistsameanscongurationM0withthesamesetofdatapointsasMandwiththesamenumberofclustersasM.Furthermore,T(M0)=T(M)andatleastonenalclustercenterinM0isdistinctfromanyotherclustercenterarisinginalliterationsstartingfromMandM0.Lemma3.3.LetMbeasuper-signaling,non-degeneratemeanscongurationonndatapointswithkclusters.Thenthereexistsasignaling,non-degeneratemeanscongurationNonn+O(k)datapointswithk+O(1)clusterssuchthatT(N)2T(M).Proof.Webeginwithaformaldenitionofourcon-struction,andthentracetheexecutionofk-meansinTable1andFigures3-7.LetM0begivenasinDenition3.2.LabeltheclustersinMandM0with1throughk,andletxi;t(respectivelyyi;t)denotethecenterofclusteriinM(respectivelyM0)aftertiterations.Alsolet~xidenotethenalcenterofclusteriinMandletnidenotethenalnumberofdatapointsinclusteri.SinceMissuper-signaling,wemayassumewithoutlossofgeneralitythatk~xikisindependentofi(e.g.thecenterofthehyperspherepassingthroughthexi'sliesattheorigin).Finally,letzi=1 2((ni+4)yi;0 (ni+2)~xi).LetV(M)denotethedatapointsinMandlet`denotethediameteroff0;zi;V(M)g.Letd,randbesuchthatdr`0andletd0besuchthat(d0)2=d2+k~xik2 .Finally,letu1;u2;:::;ukandv1;v2;:::;vkbevectorsinR2suchthat(a)kuik=ni+2 2,(b)vi=ui kuik,and(c)vi=vjforalli;j.NowconsiderthefollowingpointsinSpan(V(M))RR2R,Pi=(~xi;d0;rui;0)forik;P0i=( ~xi;d0+2d; rui;0)forik;Qi=(zi;d0+0:001d;rvi;0)forik;Q0i=( zi;d0+1:999d; rvi;0)forik;A=(0;d0+0:99d;0;0);A0=(0;d0+1:01d;0;0);X=(0;d0+0:99d;0;0:2d);X0=(0;d0+1:01d;0;0:2d):ForeachsuchpointZ,wedene Ztobethere\rectionofPaboutthehyperplaneSpan(V(M))f0gR2R|i.e. Pihascoordinates(~xi; d0;rui;0).LetV(N)denotethesetofallthesepointsalongwiththenaturalembeddingofV(M) V(M)(`)0:02d0:02d0:99d0:989d0:989d(r)0:001d0:02d0:001dd0dd0d0:001d0:02d0:001d0:99d A0AQi X0 P0i Q0iPiX A Qi PiA0 XX0Q0iP0i Figure2:ThedatapointsconstructedinLemma3.3(Figurenottoscale).Notedr`. t ClustersofN 0,...,T(M) Ci=Mi;twithcenter=(xi;t;0;0;0) G=fPi;P0i;Qi;Q0i;A;A0gwithcenter=(0;d0+d;0;0) H=fXgwithcenter=(0;d0+0:99d;0;0:2d) H0=fX0gwithcenter=(0;d0+1:01d;0;0:2d) T(M)+1 Ci=~Mi[fPi; Pigwithcenter=(~xi;0;rvi;0) G=fP0i;Qi;Q0i;A;A0gwithcenter(O(`);d0+d;O(rn);0)with1:254=3 H=fXgwithcenter=(0;d0+0:99d;0;0:2d) H0=fX0gwithcenter=(0;d0+1:01d;0;0:2d) T(M)+2 Ci=~Mi[fPi;Qi; Pi; Qigwithcenter=(yi;0;0;rvi;0) G=fP0i;Q0igwithcenter=(O(`);d0+1:9995d;O(rn);0) H=fA;Xgwithcenter=(0;d0+0:99d;0;0:1d) H0=fA0;X0gwithcenter=(0;d0+1:01d;0;0:1d) T(M)+3 Ci=M0i;1withcenter=(yi;1;0;0;0) G=fP0i;Q0igwithcenter=(O(`);d0+1:9995d;O(rn);0) H=fA;X;Pi;Qigwithcenter=(O(`);d0+0:0005d+0:9895 k+1d;O(rn); 2k+2) H0=fA0;X0gwithcenter=(0;d0+1:01d;0;0:1d) T(M)+4,..., Ci=M0i;t T(M) 2withcenter=(yi;t T(M) 2;0;0;0) 2T(M)+2 G=fP0i;Q0igwithcenter=(O(`);d0+1:9995d;O(rn);0) H=fPi;Qigwithcenter=(O(`);d0+0:0005d;O(rn);0) H0=fA;A0;X;X0gwithcenter=(0;d0+d;0;0:1d) Table1:TheclustersofNaftertiterationsofk-means(seeLemma3.3).Mi;t(respectivelyM0i;t)denotesthepointsinclusterofiofM(respectivelyM0)aftertiterations,and~MidenotesthenalpointsinclusteriofM.Alltableentriesdescribeclustersimmediatelyafterthecentersarerecomputed.Ratherthangoingthougheverycalculation,wediscussthekeyelementsinthefollowinggures. HH0G G H H0Ci Figure3:Clusteringat0tT(M)(seeLemma3.3).TheclusterscontainedwithinV(M)proceedindepen-dentlyoftheotherpoints.Theremainingclustersareprecariousbuttemporarilystable.Forexample,toseethatPidoesnotswitchfromclusterGtoCj,notethatthedistancesquaredfromPitothecenterofCjminusthedistancesquaredfromPitothecenterofGis(k~xi xj;tk2+(d0)2+kruik2) (k~xik2+d2+kruik2)=kxi xj;tk2 0.Thelastinequalityfollowsfromthefactthat`andthat,sinceMissuper-signaling,~xi=xj;t. H0G H G H0CiH Figure4:Clusteringatt=T(M)+1(seeLemma3.3).Wenowhavexi;t=~xiforalli,andthusbythecalculationinthepreviousstep,eachPiswitchestoclusterCi.Clearly,thiswillresultinasubstantialshiftofthecenterofG(andsimilarlyof G).Furthermore,theui'shavebeenchosensothatthecenterofCibecomes(~xi;0;rvi;0). HH H0CiG GH0 Figure5:Clusteringatt=T(M)+2(seeLemma3.3).FirstconsiderV(M).ThesepointscontinuetobeclosertotheCi'sthantootherclusters.EachCicenterhasmovedsincethepreviousiteration,buttheyhaveallmovedbyaconstantamount(namelyrkvik=r)inadirectionorthogonaltoSpan(V(M)).Therefore,theclosestcentertoeachpointinV(M)hasnotchanged,andthusthesepointsremainintheircurrentclusters.Ontheotherhand,sincethecenterofGmovedaway,A,A0,andQiallswitchtodierentclusters.ThersttwoclearlyswitchtoHandH0,butQicouldreasonablyswitchtoeitherHoranyCj.ThedistancesquaredfromQitothecenterofCjis(1:001d)2+r2kvi vjk2+O(`2),whichisminimizedwheni=j.ThedistancesquaredfromQitothecenterofHis(0:989d)2+(0:2d)2+O(r2).Since0:9892+0:221:0012anddr;`,itfollowsthatQiwillinfactswitchtoCi.NotethattheanalysissofardoesnotdependontheV(M)-coordinateofanyQi,sowemaychoosethosetomaketheV(M)-coordinateofeachCiequaltoyi;0attheendofthisstep. G H0CiGHH0 H Figure6:Clusteringatt=T(M)+3(seeLemma3.3).ByacquiringA,clusterHhasmovedclosertotheotherpoints.Infact,thedistancesquaredfromPitothecenterofHisnow(0:99d)2+(0:1d)2+O(r2)d2.Thus,eachPiswitchestoH,andasimilarcalculationshowseachQialsoswitchestoH.NowconsiderV(M).Asinthepreviousstep,wemayignorethervicomponentofeachCicenter.TheV(M)componentofeachCicenterisnowyi;0,whichmeanstheclusteringproceedsaccordingtoM0,andthepointsinV(M)associatedwithCiattheendofthisstepareM0i;1. H0 H H0 GCiHG Figure7:ClusteringatT(M)+4t2T(M)+2(seeLemma3.3).ThecenterofHmovesbecausePiandQihavebeenabsorbedintoH.AlsoAandXswitchtoH0.Beyondthat,thecongurationisnowverystable,andtheclusteringonV(M)willproceednormallyaccordingtoM0.inSpan(V(M))f0gf0;0gf0g.ThissetupisillustratedinFigure2.WealsodeneclusterswithinitialcentersinSpan(V(M))RR2Rasfollows.Ciwithcenter=(xi;0;0;0;0)forik;Gwithcenter=(0;d0+d;0;0);Hwithcenter=(0;d0+0:99d;0;0:2d);H0withcenter=(0;d0+1:01d;0;0:2d):ForeachsuchclusterCotherthantheCi's,wedene Ctobeaclusterwhoseinitialcenterisobtainedbyre\rectingtheinitialcenterofCaboutthehyperplaneSpan(V(M))f0gR2R.LetNdenotethemeanscongurationwithalltheseclus-tercentersandwithdatapointsV(N).Wetracetheevolu-tionofk-meansonNviaTable1andFigures3-7.Baseduponthis,weseethatT(N)T(M)+T(M0)=2T(M),andthatNisnon-degenerateandsignaling.SinceNhasn+O(k)datapointsandk+O(1)clusters,theresultfol-lows. Thiscompletesthersthalfofourconstruction,inwhichwetransformasuper-signalingcongurationintoasignalingcongurationwithtwicethecomplexity.Wenowshowhowtotransformasignalingcongurationintoasuper-signalingcongurationwithequalcomplexity.Lemma3.4.LetNbeasignaling,non-degeneratemeanscongurationonndatapointswithkclusters.Thenthereexistsasuper-signaling,non-degeneratemeanscongurationMonn+O(k)datapointswithk+O(1)clusterssuchthatT(M)T(N).Proof.Letxi;tdenotethecenterofclusteriinNaftertiterationsandlet~xidenotethenalcenterofclusteriinN.SinceNissignaling,wemayassumewithoutlossofgeneralitythat~x1isdistinctfromallotherxi;t.LetV(N)denotethesetofdatapointsinNandlet`denotethediameterofV(N).Letdandbesuchthatd`andletd0besuchthat(d0)2=d2 .Also,leta,bandcbepointsinV(N)suchthatb=a+c 2andsuchthatthedistancefromatoV(N)ismuchlargerthanboth`andkc ak.Now,take=1 3k+9,andconsiderthefollowingpointsinSpan(V(N))R,P=(~x1;d0);Xi=(~xi;d0+d)forik;A;B;C=(a;0);(b;0);(c;0);A0;B0;C0=(a;d0+d);(b;d0+d);(c;d0+d);Q=(k+4)~x1 X~xi 3b;d0+(k+14=3)d:ForeachsuchpointZ62fA;B;Cg,wealsodene Ztobethere\rectionofZaboutthehyperplaneSpan(V(N))f0g.LetV(M)denotethesetofallthesepointsaswellasthenaturalembeddingofV(N)inSpan(V(N))f0g.ThisisillustratedinFigure8.WealsodeneclusterswithcentersinSpan(V(N))Rasfollows.Ciwithcenter=(xi;0;0)forik;Hwithcenter=((a+b)=2;0);H0withcenter=(c;0);Jwithcenter=(x1;d0+d); Jwithcenter=(x1; d0 d):LetMdenotethemeanscongurationwithalltheseclus-tercentersandwithdatapointsV(M).Wetracetheevo-lutionofk-meansonMviaTable2andFigures9-11.Baseduponthis,weseethatT(M)T(N),thatMisnon-degenerate,andalsothatthenalclustersetsofMaredistinctfromallclustersetsarisinginpreviouscongura-tions.AlsoletM0denotethemeanscongurationwithdatapointsV(M)andwithclustercentersasaboveexceptwithHcenteredat(a;0)andH0centeredat((b+c)=2;0).Then,thesamecalculationshowsthatT(M0)=T(M)andthatthenalclustersetforHisdistinctfromallotherclustersetsarisinginMorM0.Finally,sinceMandM0arenon-degenerate,thereexistsa0suchthatwemaymoveeachdatapointbyuptowithoutalteringthek-meansexecution.Takingadvantageofthis,wecanensurethatthecentersofdistinctclustersetsaredistinct,andthatthenalclustercentersofM0lieonahypersphere.ThismakesMsuper-signaling,andtheresultfollows. Theorem3.1followsimmediatelyfromLemma3.3andLemma3.4. Xid(k+5)d 3k+9d0dd0dd 3k+9d(k+5)(`)V(N)XiABC P B0 C0 QPQB0A0C0 A0 Figure8:ThedatapointsconstructedinLemma3.4.Noted`. t ClustersofM 0,...,T(N) Ci=Ni;twithcenter=(xi;t;0)for1ik H=fA;Bgwithcenter=(a+b 2;0) H0=fCgwithcenter=(c;0) J=fP;Xi;A0;B0;C0;Qgwithcenter=(~x1;d0+d) T(N)+1 C1=~N1[fP; Pgwithcenter=(~x1;0) Ci=~Niwithcenter=(~xi;0)for2ik H=fA;Bgwithcenter=(a+b 2;0) H0=fCgwithcenter=(c;0) J=fXi;A0;B0;C0;Qgwithcenter=(~x1;d0+d+ k+4) T(N)+2 C1=~N1[fP;X1; P; X1gwithcenter=(~x1;0) Ci=~Ni[fXi; Xigwithcenter=(~xi;0)for2ik H=fA;B;A0;B0; A0; B0gwithcenter=(a+b 2;0) H0=fC;C0; C0gwithcenter=(c;0) J=fQgwithcenter=((k+4)~x1 P~xi 3b;d0+(k+14=3)d) Table2:TheclustersofMaftertiterationsofk-means(seeLemma3.4).Ni;tdenotesthepointsinclusterofiofNaftertiterations,and~NidenotesthenalpointsinclusteriofN.Alltableentriesdescribeclustersimmediatelyafterthecentersarerecomputed.Ratherthangoingthougheverycalculation,wediscussthekeyelementsinthefollowingpictures. JJHH0Ci Figure9:Clusteringat0tT(N)(seeLemma3.4).AswiththerstpartoftheconstructionforLemma3.3,theclusterscontainedwithinV(N)proceedindependentlyoftheotherpoints.Theremainingclustersareprecariousbuttemporarilystable.Forexample,toseethatPdoesnotswitchfromclusterJtoCi,notethatthedistancesquaredfromPtothecenterofCiminusthedistancesquaredfromPtothecenterofJisk~x1 xi;tk2+(d0)2 d2=kx1 xi;tk2 0.Thelastinequalityfollowsfromthefactthat`andthat,sinceNissignaling,~x1=xi;t. H0JCiH J Figure10:Clusteringatt=T(N)+1(seeLemma3.4).Wenowhavex1;t=~x1,andthusbythecalculationinthepreviousstep,PswitchestoclusterC1.Since PalsoswitchestoclusterC1,thecenterofC1doesnotchange.However,thecentersofJandJ0bothmoveslightlyfurtherawayfromtheotherpoints. HH0JCi J Figure11:Clusteringatt=T(N)+2(seeLemma3.4).ThepointsXi;A0;B0;C0wereallchosentobeonlybarelystablewithinJ.Thus,afterthecenterofJmoves,thesepointsswitchtotheclosestclustersinV(N).Forexample,thedistancefromXitothecenterofCiisapproximatelyd+ 3k+9,andthedistancefromXitothecenterofJisapproximatelyd+ k+4 3k+9d+ 3k+9Again,onlythecentersofJand Jmoveasaresultofthis,anditiseasytocheckthenewcongurationisstable. 3.3ProbabilityBoostingTheconstructionusedtoproveTheorem3.1requiresbothaspecicsetofdatapointsandaspecicsetofclustercen-ters.Inpractice,however,onlythedatapointsarespeciedandtheinitialclustercentersarechosenbythealgorithm.Typically,theinitialcentersarechosenuniformlyatran-domfromthedatapoints.Giventhis,onemightaskifthesuperpolynomiallowerboundcanactuallyarisewithnon-vanishingprobability.Inthissection,weshowhowtomodifyourlowerboundconstructiontoapplywithhighprobabilityeveniftheclus-tercentersarechosenrandomlyfromtheexistingdatapoints.Itfollowsthatk-meanscanstillbeveryslowforcertainsetsofdatapoints,evenaccountingfortherandomchoiceofclustercenters.Proposition3.5.LetMbeameanscongurationonnpoints.Then,thereexistsasetofO(n3logn)pointssuchthatifameanscongurationNisconstructedwiththesedatapointsandwith4nlognclustercenterschosenran-domlyfromthesetofdatapoints,thenT(N)T(M)withprobability1 O(1 n).Proof.LetkbethenumberofclustersinM.Forikandjm,letui;jdenoteorthogonalunitvectorsinRmk.LetV(M)denotethesetofdatapointsinMandlet`denotethediameterofV(M).Letd,randbesuchthatdr`.Also,letnidenotethenumberofpointsinclusteriinMafteroneiteration.ReplacingMwithtwoidenticaloverlappingcopiesifnecessary,wemayassumethatni1.Finally,letxi;tdenotethecenterofclusteriinMaftertiterations.LetmbeapositiveintegertobexedlaterandconsiderthepointsetinSpan(V(M))RkmRobtainedbyrstembeddingtwocopiesofV(M)atSpan(V(M))f0gf0gandthenaddingthefollowingpoints.1.Pi;j=(xi;0;P(i0;j0)=(i;j)rui0;j0;d+j)forik;jm.2.Qi;`=(nixi;1 xi;0 ni 1;Pi0=iPj0rui0;j0;d `)forikand`ni 1.3.Oj=(0;Pi0Pj0rui0;j0;d+j)forjm.ConsiderameanscongurationNwiththesedatapointsandwith4nlognclustercenterschosenfromthesepointsatrandom.Let 0=fO1;O2;:::;Omgandfori0,let i=fPi;1;Pi;2:::;Pi;mg.SupposethatNbeginswithallofitsclustercentersin =[i iandthateach ihasatleastoneclustercenter.OnecancheckthatT(N)T(M)inthiscase.Now,letm=n3logn k.Then,eachclustercenterwillbeinsome iwithprobability1 O(1 n2logn).Sincethereare4nlognclusters,allclusterswillbein withprobability1 O(1 n).Furthermore,theprobabilitythatnoclustercenterischoseninaxed iisatmost(1 1 2k)4nlogn1 n2.Thus,each ihasatleastoneclustercenterwithprobability1 O(1 n).Theresultnowfollows. 3.4LowspreadRecallthespreadofapointsetistheratioofthelargestpairwisedistancetothesmallestpairwisedistance.Har-PeledandSadri[7]conjecturedthatk-meansmightrunintimepolynomialinnand.Inthissection,however,weshowthatthespreadcanbereducedtoO(1)withoutde-creasingthenumberofiterationsrequired.Proposition3.6.LetMbeameanscongurationonnpoints.Then,thereexistsameanscongurationNon2npointssuchthatNhasO(1)spreadandsuchthatT(N)=T(M).Proof.LetV(M)denotethepointsinM,andchooseanarbitrarysetofvectors,u1;u2;:::;un.Foreachvi2V(M),wereplaceviwithxi=(vi;ui)andyi=(vi; ui)inSpan(V(M))Span(u1;u2;:::;un).LetNdenotethemeanscongurationwiththesedatapointsandwithcenters(cj;0)foreachcentercjinM.ItiseasytocheckthatclusterCinNcontainsxiandyiaftertiterationsifandonlyifclusterCinMcontainsviaftertiterations.ItfollowsthatT(N)=T(M).Takinguitobeorthogonalandoflengthd0,wecanmakeNhavespreadarbitrarilyclosetop 2. Moregenerally,thereisatradeobetweentheextradi-mensionalityandthereductionof.Forexample,byaddingoneextradimension,andtakingui=di,wecanmakethespreadlinearinn.4.DISCUSSION4.1SmoothedAnalysisWehaveshownk-meanscanhaveasuperpolynomialrun-ningtimeintheworstcase.However,weknowthealgo-rithmrunsecientlyinpractice.Itisnaturaltoaskhowthisdiscrepancycanbeformalized.Onenaturalapproachisthatofsmoothedanalysis,whichwasusedbySpielmanandTeng[13]toexplaintherunningtimeoftheSimplexalgorithm.Towardsthatend,assumethatthedatapointsarechosenfromindependent,normaldistributionswithvariance2.LettingDdenotethediameteroftheresultingpointset,weaskwhetherk-meansislikelytorunintimepolynomialinnandD .4.1.1HighDimensionThisquestionappearstobedicultingeneral,butapositiveresultisrelativelyeasytoproveinhighdimensions.Inthissection,wesketchaproofofthisfact.Proposition4.1.Givendatapointschosenfrominde-pendentnormaldistributionswithvariance2andwithdi-mensiond=\n(n=logn),k-meanswillexecuteinpolynomialtimewithhighprobability.Weanalyzethestandardk-meanspotentialfunction.ForameanscongurationM=(X;C),let(M)=Pni=1kxi cik2,whereci2Cistheclustercenterclosesttoxi.Clearly,0nD2,andonecanalsocheckthatisnon-increasingthroughoutanexecutionofk-means.Therefore,itsucestoshowthatthepotentialdecreasesbyanon-trivialamountduringeachiteration.Ontheonehand,itisknownthatifaclustercentermovesbyadistanceduringak-meansstepandiftheclusterhasmpointsattheendofthestep,thendecreasesbyatleast2m(see[7]and[11]).Ontheotherhand,ifourdatapointsarerandom,notwopossiblecenterscanbetooclose.Thiscanbeformalizedasfollows. Definition4.1.WesayasetofdatapointsXis\-separated"ifforanynon-identicalsubsetsSandT,thecen-tersofmassc(S)andc(T)satisfykc(S) c(T)k 2min(jSj;jTj).Lemma4.2.IfXisasetofndatapointschosenfromindependentnormaldistributionswithvariance2,thenXis-separatedwithprobabilityatleast1 22n .WeomittheproofoftheLemma.Proposition4.1followsbychoosing= n1=d22n=d.4.1.2TheGeneralCaseProposition4.1showsthatk-meansrunsinpolynomialtimewithhighprobabilityinsmoothedhigh-dimensionalsettings.Asimilarresultholdswhend=1basedonthespreadanalysisof[7]andthefactthatasmoothedpointsetislikelytohavepolynomialspread.Amuchmoresubtleanalysisseemstoberequiredforothervaluesofd.WehaverecentlyprovenanupperboundofnO(k)poly(n;D )[2],butitremainsamajoropenproblemtondaboundpolynomialinn,kandD inthegeneralcase.4.2VariantsSmoothedanalysisprovidesoneveryexplicitwayofcir-cumventingtheworst-caseperformanceofk-means.Namely,givenanarbitrarydataset,wecanperturbeachpointac-cordingtoanindependentnormaldistributionandthenrunk-means.Evenoursimpleanalysisinhighdimensionscanbeharnessedherebyrstliftington-dimensionalspace,andthenperturbing.Twoothermethodsalsosuggestthemselves.Firstofall,k-meansisoftenruninarelativelysmallnumberofdimen-sions,andregardless,onecanalwaysreducedtoO(logn)withsmalldistortion[10].Thus,itisnaturaltoaskhowk-meansperformsforsmalld.Weconjecturethatk-meansisworst-casesuperpolynomialid1.Evenwhend=1,nostronglypolynomialupperboundsareknown.Finally,Har-PeledandSadri[7]suggestedasimplevari-antofk-meanswhereonlyonedatapointisreassignedeachiteration.Thisvarianthasrunningtimepolynomialinnandthespread,whichweknowisnottrueforstan-dardk-means.Giventhisqualitativeimprovement,afur-therstudyofthisvariantcouldprovefruitful.5.REFERENCES[1]PankajK.AgarwalandNabilH.Mustafa.k-meansprojectiveclustering.InPODS'04:Proceedingsofthetwenty-thirdACMSIGMOD-SIGACT-SIGARTsymposiumonPrinciplesofdatabasesystems,pages155{165,NewYork,NY,USA,2004.ACMPress.[2]DavidArthurandSergeiVassilvitskii.Improvedsmoothedanalysisforthek-meansmethod.Manuscript,2006.[3]DavidArthurandSergeiVassilvitskii.k-meanslowerboundimplementation.http://www.stanford.edu/~darthur/kMeansLbTest.zip,2006.[4]SanjoyDasgupta.Howfastisk-means?InCOLTComputationalLearningTheory,volume2777,page735,2003.[5]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassication.Wiley-IntersciencePublication,2000.[6]FredericGibouandRonaldFedkiw.Afasthybridk-meanslevelsetalgorithmforsegmentation.In4thAnnualHawaiiInternationalConferenceonStatisticsandMathematics,pages281{291,2005.[7]SarielHar-PeledandBardiaSadri.Howfastisthek-meansmethod?Algorithmica,41(3):185{202,2005.[8]R.Herwig,A.J.Poustka,C.Muller,C.Bull,H.Lehrach,andJO'Brien.Large-scaleclusteringofcdna-ngerprintingdata.GenomeResearch,9:1093{1105,1999.[9]MaryInaba,NaokiKatoh,andHiroshiImai.Applicationsofweightedvoronoidiagramsandrandomizationtovariance-basedk-clustering:(extendedabstract).InSCG'94:ProceedingsofthetenthannualsymposiumonComputationalgeometry,pages332{339,NewYork,NY,USA,1994.ACMPress.[10]W.JohnsonandJ.Lindenstrauss.Extensionsoflipschitzmapsintoahilbertspace.ContemporaryMathematics,26:189{206,1984.[11]TapasKanungo,DavidM.Mount,NathanS.Netanyahu,ChristineD.Piatko,RuthSilverman,andAngelaY.Wu.Alocalsearchapproximationalgorithmfork-meansclustering.InSCG'02:ProceedingsoftheeighteenthannualsymposiumonComputationalgeometry,pages10{18,NewYork,NY,USA,2002.ACMPress.[12]StuartP.Lloyd.Leastsquaresquantizationinpcm.IEEETransactionsonInformationTheory,28(2):129{136,1982.[13]DanielA.SpielmanandShang-HuaTeng.Smoothedanalysisofalgorithms:Whythesimplexalgorithmusuallytakespolynomialtime.J.ACM,51(3):385{463,2004.