191K - views

How Slow is the Means Method David Arthur Stanford University Stanford CA darthu

stanfordedu Sergei Vassilvitskii Stanford University Stanford CA sergeicsstanfordedu ABSTRACT The kmeans method is an old but popular clustering algo rithm known for its observed speed and its simplicity Until recently however no meaningful theoretic

Embed :
Pdf Download Link

Download Pdf - The PPT/PDF document "How Slow is the Means Method David Arthu..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

How Slow is the Means Method David Arthur Stanford University Stanford CA darthu






Presentation on theme: "How Slow is the Means Method David Arthur Stanford University Stanford CA darthu"— Presentation transcript:

HowSlowisthek­MeansMethod?DavidArthurStanfordUniversityStanford,CAdarthur@cs.stanford.eduSergeiVassilvitskiiyStanfordUniversityStanford,CAsergei@cs.stanford.eduABSTRACTThek-meansmethodisanoldbutpopularclusteringalgo-rithmknownforitsobservedspeedanditssimplicity.Untilrecently,however,nomeaningfultheoreticalboundswereknownonitsrunningtime.Inthispaper,wedemonstratethattheworst-caserunningtimeofk-meansissuperpolyno-mialbyimprovingthebestknownlowerboundfrom\n(n)iterationsto2\n(p n).CategoriesandSubjectDescriptors:F.2.2[AnalysisofAlgorithmsandProblemComplexity]:NonnumericalAlgorithmsandProblemsGeneralTerms:Algorithms,Theory.Keywords:K-means,LocalSearch,LowerBounds.1.INTRODUCTIONThek-meansmethodisawellknowngeometricclusteringalgorithmbasedonworkbyLloydin1982[12].Givenasetofndatapoints,thealgorithmusesalocalsearchapproachtopartitionthepointsintokclusters.Asetofkinitialclus-tercentersischosenarbitrarily.Eachpointisthenassignedtothecenterclosesttoit,andthecentersarerecomputedascentersofmassoftheirassignedpoints.Thisisrepeateduntiltheprocessstabilizes.Itcanbeshownthatnoparti-tionoccurstwiceduringthecourseofthealgorithm,andsothealgorithmisguaranteedtoterminate.Thek-meansmethodisstillverypopulartoday,andithasbeenappliedinawidevarietyofareasrangingfromcomputationalbiologytocomputergraphics(see[1,6,8] SupportedinpartbyanNDSEGFellowship,NSFGrantITR-0331640,andgrantsfromMedia-XandSNRC.ySupportedinpartbyNSFGrantITR-0331640,andgrantsfromMedia-XandSNRC.Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.SCG'06,June5–7,2006,Sedona,Arizona,USA.Copyright2006ACM1­59593­340­9/06/0006...$5.00.forsomerecentapplications).Themainattractionofthealgorithmliesinitssimplicityanditsobservedspeed.Indeed,therunningtimeofk-meansiswellstudiedex-perimentally(see,forexample,[7]).Intheirtextonpatternclassi cation,Dudaetal.remarkthat,\Inpracticethenum-berofiterationsisgenerallymuchlessthanthenumberofpoints"[5].However,fewmeaningfultheoreticalboundsontheworst-caserunningtimeofk-meansareknown.1.1RelatedWorkThereisatrivialupperboundofO(kn)iterationssincenopartitionofpointsintoclustersiseverrepeatedduringthecourseofthealgorithm.Ind-dimensionalspace,thisboundwasslightlyimprovedbyInabaetal.[9]toO(nkd)bycountingthenumberofdistinctVoronoipartitionsonnpoints.Morerecently,Dasgupta[4]presentedsometighterresultsforafewspecialcases.Hedemonstratedaworst-caselowerboundof\n(n)iterations,andanupperboundofO(n)fork5andd=1.ThisworkwasextendedbyHar-PeledandSadri[7]in2005.Againrestrictingtod=1,theauthorsshowanupperboundofO(n2)whereisthespreadofthepointset(de- nedastheratiobetweenthelargestpairwisedistanceandthesmallestpairwisedistance).Theyareunabletoboundtherunningtimeofk-meansingeneral,buttheysuggestafewmodi cationsthatareeasiertoanalyze.Forexample,ifonereclassi esexactlyonepointperiteration,thenk-meansisguaranteedtoconvergeafterO(kn22)iterationsinanydimension.1.2OurResultsOurmainresultisalowerboundconstructionforwhichtherunningtimeofk-meansissuperpolynomial.Inparticu-lar,wepresentasetofndatapointsandasetofadversar-iallychosenclustercentersforwhichthealgorithmrequires2\n(p n)iterations.Wethenexpandthistoshowthateveniftheinitialclustercentersarechosenuniformlyatrandomfromthedatapoints,therunningtimeisstillsuperpolyno-mialwithhighprobability.Wealsoshowourconstructioncanbemodi edtohaveconstantspread,therebydisprovingarecentconjectureofHar-PeledandSadri[7].Explainingtherunningtimesobservedinpracticeremainsanopenproblem.Asa rststep,weshowthatifthedatapointsareselectedfromindependentnormaldistributionsin\n(n=logn)dimensions,thenk-meanswillterminateinapolynomialnumberofstepswithhighprobability.Wealsobrie\rydiscussseveralotherwaysinwhichonemighthopetocircumventtheworst-caselowerbound.             CCBPAQRCABQRPABPQRCABPQR(d)(b)(c)(a) Figure1:Anidealized\resetwidget"thatcanbeusedtoresetthecenterofsomeclusterCafterk-meanshas nishedexecuting:(a)Thecon gurationrightbeforethesignalingbegins.(b)PswitchestoclusterC,andthecenterofAmovesawayfromQandR.(c)QswitchestoclusterC,therebyresettingthecenterofC.Inaddition,RswitchestoclusterB,andthecenterofBmovestowardsPandQ.(d)PandQswitchtoclusterB.NowCiscompletelyreset.2.PRELIMINARIESThek-meansalgorithm[12]isamethodforpartitioningdatapointsintoclusters.LetX=fx1;x2;:::;xngbeasetofpointsinR.Afterbeingseededwithasetofkcentersc1;c2;:::;ckinR,thealgorithmpartitionsthesepointsintoclustersasfollows.1.Foreachi2f1;:::;kg,settheclusterCitobethesetofpointsinXthatareclosertocithantheyaretocjforallj=i.2.Foreachi2f1;:::;kg,setcitobethecenterofmassofallpointsinCi:ci=1 jCijPxj2Cixj.3.Repeatsteps1and2untilciandCinolongerchange,atwhichpointreturntheclustersCi.IftherearetwocentersequallyclosetoapointinX,webreakthetiearbitrarily.Ifaclusterhasnodatapointsattheendofstep2,weeliminatetheclusterandcontinueasbefore.Ourlowerboundconstructionwillnotrelyoneitherofthesedegeneracies.Duringtheanalysisitwillbeusefultotalkaboutameanscon guration.Definition2.1.Ameanscon gurationM=(X;C)isasetofdatapointsXandasetofclustercentersC=fcigi=1;:::;k.Notethatameanscon gurationMde nesanintermedi-atepointintheexecutionofthealgorithm.Givenameanscon gurationM,letT(M)denotethenumberofiterationsrequiredbyk-meanstoconvergestartingatM.WesaythatMisnon-degenerateif,asthealgorithmisruntocomple-tion,(a)nopointiseverequidistantfromthetwoclosestclustercentersand(b)noclustereverhas0datapoints.3.LOWERBOUNDSInthissection,wedemonstrateameanscon gurationwhichrequires2\n(p n)iterations.Astheconstructionisratherinvolved,webeginwithsomeintuition,andthenproceedwithaformalproof.WehavealsoimplementedtheconstructioninC++[3].Attheendofthesection,weconsideracouplemodi ca-tionstothemainconstruction.First,weshowthatevenifthestartingcentersarechosenuniformlyatrandomfromthedatapoints,thereexistexampleswhereasuperpolyno-mialnumberofiterationsisstillrequiredwithhighprob-ability.Finally,weshowhowtoreducethespreadofanyconstructionatthecostofincreasingthedimensionalityoftheinstance.3.1IntuitionThemainideabehindthelowerboundconstructionisthatofa\resetwidget".Theroleofthewidgetistorec-ognizewhenk-meanshasruntocompletionandtothenresetitbackintoitsinitialstate.Werequirethewidgettonotinterferewiththeoriginaldatapointsbeforeoraftertheresetoperation,therebyensuringthatthenewk-meanscon gurationtakestwiceaslongtoruntocompletion.Ourlowerboundisobtainedbyrecursivelyaddingresetwidgets.ByensuringeachwidgethasO(k)newpointsandO(1)newclusters,wegettheboundof2\n(p n)iterations.Webeginwithanidealizeddescriptionofaresetwidget,illustratedinFigure1.Wethenbrie\rymentionafewissuesthatthisidealizeddiscussionomits.3.1.1AResetWidgetSupposewearegivenameanscon gurationinR2andwearepromisedthatthe nalcenter(cx;cy)ofsomeclusterCneverappearsasaclustercenterinanypreviousitera-tion.Wecallthisa\signaling"meanscon guration.Wecandetectwhenk-meanshasruntocompletionbyliftingtheoriginalcon gurationtoR3,andaddingapointP=(cx;cy;D)inanewclusterAwithcenterat(cx;cy;2D).IfDislargeandissmall,thenPwillswitchtoCafterk-means nishesexecutingontheoriginaldataset,butnoearlier.Thiscreatesawidgetthattriggersattherighttime.ThenextstepistomakethewidgetactuallyresetC.WedothisbyaugmentingAtoalsoincludeapointQ=(dx;dy;D(1+0))whilemaintainingthecenterofAat(cx;cy;2D).Switch-ingPfromAtoCcausesthecenterofCtomovetowardsQ,andthecenterofAtomoveawayfromQ.AslongasDissucientlylarge,QwillfollowPintoConthenextiteration,regardlessofthevaluesofdxanddy.Inparticu-lar,thismeanswecanchoosedxanddysoastoresetthecenterofCtoitsinitialposition(atleastinthexandy coordinates).ToavoidchangingC'szcoordinate,wemaketwosymmetricresetwidgets,oneabovethexy-plane,andonebelow.SeeFigure1,parts(a)through(c).Unfortunately,thismethodisnotquitesucient.WehaveresetthecenterofCbyaddingpointstothecluster.Ask-meansprogressesthesecondtimethrough,thesepointswilllingerwithCandprovideaconstantdragbacktoitsoriginalposition.Toactuallymaketheresetcon gurationproceedastheoriginaldid,Cmustlosethesenewpointsimmediatelyaftertheresetoccurs.Toensurethishappens,weaddathirdpointRnearthecenterofA,andanewclusterBnearR.NowBwillacquireRduringtheresetprocess,whichmovesitintopositiontorecapturethepointsPandQ.ThewholeprocessisillustratedinFigure1.WehavenowfullyresetC.Applyingthistechniquesimul-taneouslytoeachcluster,wecanhopetodoubletherunningtimeofk-means.3.1.2PitfallsAfewadditionalconsiderationscomeintoplaywhenfor-malizingthisintuition.1.Wecanonlyaddaresetwidgettoasignalingcon g-uration.Torecursivelyaddresetwidgets,weneedtoensurethataddingaresetwidgetmaintainsthesig-nalingproperty.2.Wecanonlyresetaclusterifthatspeci cclustersig-naledonthe naliteration.Thus,weneedtobeabletotakeasignalingcon gurationandaugmentitsothateachclustersimultaneouslysignals.3.Wecannota ordtodoublethenumberofclustersbyaddingadi erentresetwidgetforeachcluster.In-stead,wemusthaveonewidgetresetallclustersatthesametime.Toaccomplishthis,itisconvenienttohavetheresetwidgetclustercenteredequallyfarfromeachsignal,whichrequiresplacingcertainclustercen-tersonahypersphere.3.2TheFormalConstructionWenowformallypresenttheresetwidget.Thisrequiresacarefulplacementofpointsandclustercenters,buttheintuitionisexactlytheonedescribedabove.We rststateourmainresults.Definition3.1.Ameanscon gurationissaidtobesig-nalingifatleastone nalclustercenterisdistinctfromeveryclustercenterarisinginpreviousiterations.Theorem3.1.LetMbeasignaling,non-degeneratemeanscon gurationonndatapointswithkclusters.Thenthereexistsasignaling,non-degeneratemeanscon gurationNonn+O(k)datapointswithk+O(1)clusterssuchthatT(N)2T(M).Startingwithanarbitrarycon guration,wecanapplythisconstructionttimestoobtainameanscon gurationwithO(t2)pointsandO(t)clustersforwhichT(M)2t.Thesuperpolynomialcomplexityofk-meansfollowsimme-diately.Corollary3.2.Theworst-casecomplexityofk-meansonndatapointsis2\n(p n).WeproveTheorem3.1intwoparts.First,weshowthatparticulartypesofmeanscon gurations,calledsuper-signa-lingcon gurations,canbeslightlyenlargedtocreatenon-degenerate,signalingmeanscon gurationswithtwicethecomplexity.Wethenshowhowtoslightlyenlargenon-de-generate,signalingmeanscon gurationstoobtainsuper-signalingcon gurations,therebyestablishingtherecursion.Definition3.2.Ameanscon gurationsMissaidtobesuper-signalingifithasthefollowingproperties.1.The nalpositionsofallclustercenterslieonahy-persphere.2.The nalpositionsofallclustercentersaredistinctfromallclustercentersarisinginpreviousiterations.3.Thereexistsameanscon gurationM0withthesamesetofdatapointsasMandwiththesamenumberofclustersasM.Furthermore,T(M0)=T(M)andatleastone nalclustercenterinM0isdistinctfromanyotherclustercenterarisinginalliterationsstartingfromMandM0.Lemma3.3.LetMbeasuper-signaling,non-degeneratemeanscon gurationonndatapointswithkclusters.Thenthereexistsasignaling,non-degeneratemeanscon gurationNonn+O(k)datapointswithk+O(1)clusterssuchthatT(N)2T(M).Proof.Webeginwithaformalde nitionofourcon-struction,andthentracetheexecutionofk-meansinTable1andFigures3-7.LetM0begivenasinDe nition3.2.LabeltheclustersinMandM0with1throughk,andletxi;t(respectivelyyi;t)denotethecenterofclusteriinM(respectivelyM0)aftertiterations.Alsolet~xidenotethe nalcenterofclusteriinMandletnidenotethe nalnumberofdatapointsinclusteri.SinceMissuper-signaling,wemayassumewithoutlossofgeneralitythatk~xikisindependentofi(e.g.thecenterofthehyperspherepassingthroughthexi'sliesattheorigin).Finally,letzi=1 2((ni+4)yi;0(ni+2)~xi).LetV(M)denotethedatapointsinMandlet`denotethediameteroff0;zi;V(M)g.Letd,randbesuchthatdr`�0andletd0besuchthat(d0)2=d2+k~xik2.Finally,letu1;u2;:::;ukandv1;v2;:::;vkbevectorsinR2suchthat(a)kuik=ni+2 2,(b)vi=ui kuik,and(c)vi=vjforalli;j.NowconsiderthefollowingpointsinSpan(V(M))RR2R,Pi=(~xi;d0;rui;0)forik;P0i=(~xi;d0+2d;rui;0)forik;Qi=(zi;d0+0:001d;rvi;0)forik;Q0i=(zi;d0+1:999d;rvi;0)forik;A=(0;d0+0:99d;0;0);A0=(0;d0+1:01d;0;0);X=(0;d0+0:99d;0;0:2d);X0=(0;d0+1:01d;0;0:2d):ForeachsuchpointZ,wede ne Ztobethere\rectionofPaboutthehyperplaneSpan(V(M))f0gR2R|i.e. Pihascoordinates(~xi;d0;rui;0).LetV(N)denotethesetofallthesepointsalongwiththenaturalembeddingofV(M) V(M)(`)0:02d0:02d0:99d0:989d0:989d(r)0:001d0:02d0:001dd0dd0d0:001d0:02d0:001d0:99d A0AQi X0 P0i Q0iPiX A Qi PiA0 XX0Q0iP0i Figure2:ThedatapointsconstructedinLemma3.3(Figurenottoscale).Notedr`. t ClustersofN 0,...,T(M) Ci=Mi;twithcenter=(xi;t;0;0;0) G=fPi;P0i;Qi;Q0i;A;A0gwithcenter=(0;d0+d;0;0) H=fXgwithcenter=(0;d0+0:99d;0;0:2d) H0=fX0gwithcenter=(0;d0+1:01d;0;0:2d) T(M)+1 Ci=~Mi[fPi; Pigwithcenter=(~xi;0;rvi;0) G=fP0i;Qi;Q0i;A;A0gwithcenter(O(`);d0+ d;O(rn);0)with1:25 4=3 H=fXgwithcenter=(0;d0+0:99d;0;0:2d) H0=fX0gwithcenter=(0;d0+1:01d;0;0:2d) T(M)+2 Ci=~Mi[fPi;Qi; Pi; Qigwithcenter=(yi;0;0;rvi;0) G=fP0i;Q0igwithcenter=(O(`);d0+1:9995d;O(rn);0) H=fA;Xgwithcenter=(0;d0+0:99d;0;0:1d) H0=fA0;X0gwithcenter=(0;d0+1:01d;0;0:1d) T(M)+3 Ci=M0i;1withcenter=(yi;1;0;0;0) G=fP0i;Q0igwithcenter=(O(`);d0+1:9995d;O(rn);0) H=fA;X;Pi;Qigwithcenter=(O(`);d0+0:0005d+0:9895 k+1d;O(rn); 2k+2) H0=fA0;X0gwithcenter=(0;d0+1:01d;0;0:1d) T(M)+4,..., Ci=M0i;tT(M)2withcenter=(yi;tT(M)2;0;0;0) 2T(M)+2 G=fP0i;Q0igwithcenter=(O(`);d0+1:9995d;O(rn);0) H=fPi;Qigwithcenter=(O(`);d0+0:0005d;O(rn);0) H0=fA;A0;X;X0gwithcenter=(0;d0+d;0;0:1d) Table1:TheclustersofNaftertiterationsofk-means(seeLemma3.3).Mi;t(respectivelyM0i;t)denotesthepointsinclusterofiofM(respectivelyM0)aftertiterations,and~Midenotesthe nalpointsinclusteriofM.Alltableentriesdescribeclustersimmediatelyafterthecentersarerecomputed.Ratherthangoingthougheverycalculation,wediscussthekeyelementsinthefollowing gures. HH0G G H H0Ci Figure3:Clusteringat0tT(M)(seeLemma3.3).TheclusterscontainedwithinV(M)proceedindepen-dentlyoftheotherpoints.Theremainingclustersareprecariousbuttemporarilystable.Forexample,toseethatPidoesnotswitchfromclusterGtoCj,notethatthedistancesquaredfromPitothecenterofCjminusthedistancesquaredfromPitothecenterofGis(k~xixj;tk2+(d0)2+kruik2)(k~xik2+d2+kruik2)=kxixj;tk2�0.Thelastinequalityfollowsfromthefactthat`andthat,sinceMissuper-signaling,~xi=xj;t. H0G H G H0CiH Figure4:Clusteringatt=T(M)+1(seeLemma3.3).Wenowhavexi;t=~xiforalli,andthusbythecalculationinthepreviousstep,eachPiswitchestoclusterCi.Clearly,thiswillresultinasubstantialshiftofthecenterofG(andsimilarlyof G).Furthermore,theui'shavebeenchosensothatthecenterofCibecomes(~xi;0;rvi;0). HH H0CiG GH0 Figure5:Clusteringatt=T(M)+2(seeLemma3.3).FirstconsiderV(M).ThesepointscontinuetobeclosertotheCi'sthantootherclusters.EachCicenterhasmovedsincethepreviousiteration,buttheyhaveallmovedbyaconstantamount(namelyrkvik=r)inadirectionorthogonaltoSpan(V(M)).Therefore,theclosestcentertoeachpointinV(M)hasnotchanged,andthusthesepointsremainintheircurrentclusters.Ontheotherhand,sincethecenterofGmovedaway,A,A0,andQiallswitchtodi erentclusters.The rsttwoclearlyswitchtoHandH0,butQicouldreasonablyswitchtoeitherHoranyCj.ThedistancesquaredfromQitothecenterofCjis(1:001d)2+r2kvivjk2+O(`2),whichisminimizedwheni=j.ThedistancesquaredfromQitothecenterofHis(0:989d)2+(0:2d)2+O(r2).Since0:9892+0:22�1:0012anddr;`,itfollowsthatQiwillinfactswitchtoCi.NotethattheanalysissofardoesnotdependontheV(M)-coordinateofanyQi,sowemaychoosethosetomaketheV(M)-coordinateofeachCiequaltoyi;0attheendofthisstep. G H0CiGHH0 H Figure6:Clusteringatt=T(M)+3(seeLemma3.3).ByacquiringA,clusterHhasmovedclosertotheotherpoints.Infact,thedistancesquaredfromPitothecenterofHisnow(0:99d)2+(0:1d)2+O(r2)d2.Thus,eachPiswitchestoH,andasimilarcalculationshowseachQialsoswitchestoH.NowconsiderV(M).Asinthepreviousstep,wemayignorethervicomponentofeachCicenter.TheV(M)componentofeachCicenterisnowyi;0,whichmeanstheclusteringproceedsaccordingtoM0,andthepointsinV(M)associatedwithCiattheendofthisstepareM0i;1. H0 H H0 GCiHG Figure7:ClusteringatT(M)+4t2T(M)+2(seeLemma3.3).ThecenterofHmovesbecausePiandQihavebeenabsorbedintoH.AlsoAandXswitchtoH0.Beyondthat,thecon gurationisnowverystable,andtheclusteringonV(M)willproceednormallyaccordingtoM0.inSpan(V(M))f0gf0;0gf0g.ThissetupisillustratedinFigure2.Wealsode neclusterswithinitialcentersinSpan(V(M))RR2Rasfollows.Ciwithcenter=(xi;0;0;0;0)forik;Gwithcenter=(0;d0+d;0;0);Hwithcenter=(0;d0+0:99d;0;0:2d);H0withcenter=(0;d0+1:01d;0;0:2d):ForeachsuchclusterCotherthantheCi's,wede ne Ctobeaclusterwhoseinitialcenterisobtainedbyre\rectingtheinitialcenterofCaboutthehyperplaneSpan(V(M))f0gR2R.LetNdenotethemeanscon gurationwithalltheseclus-tercentersandwithdatapointsV(N).Wetracetheevolu-tionofk-meansonNviaTable1andFigures3-7.Baseduponthis,weseethatT(N)T(M)+T(M0)=2T(M),andthatNisnon-degenerateandsignaling.SinceNhasn+O(k)datapointsandk+O(1)clusters,theresultfol-lows. Thiscompletesthe rsthalfofourconstruction,inwhichwetransformasuper-signalingcon gurationintoasignalingcon gurationwithtwicethecomplexity.Wenowshowhowtotransformasignalingcon gurationintoasuper-signalingcon gurationwithequalcomplexity.Lemma3.4.LetNbeasignaling,non-degeneratemeanscon gurationonndatapointswithkclusters.Thenthereexistsasuper-signaling,non-degeneratemeanscon gurationMonn+O(k)datapointswithk+O(1)clusterssuchthatT(M)T(N).Proof.Letxi;tdenotethecenterofclusteriinNaftertiterationsandlet~xidenotethe nalcenterofclusteriinN.SinceNissignaling,wemayassumewithoutlossofgeneralitythat~x1isdistinctfromallotherxi;t.LetV(N)denotethesetofdatapointsinNandlet`denotethediameterofV(N).Letdandbesuchthatd`andletd0besuchthat(d0)2=d2.Also,leta,bandcbepointsinV(N)suchthatb=a+c 2andsuchthatthedistancefromatoV(N)ismuchlargerthanboth`andkcak.Now,take =1 3k+9,andconsiderthefollowingpointsinSpan(V(N))R,P=(~x1;d0);Xi=(~xi;d0+ d)forik;A;B;C=(a;0);(b;0);(c;0);A0;B0;C0=(a;d0+ d);(b;d0+ d);(c;d0+ d);Q=(k+4)~x1X~xi3b;d0+(k+14=3)d:ForeachsuchpointZ62fA;B;Cg,wealsode ne Ztobethere\rectionofZaboutthehyperplaneSpan(V(N))f0g.LetV(M)denotethesetofallthesepointsaswellasthenaturalembeddingofV(N)inSpan(V(N))f0g.ThisisillustratedinFigure8.Wealsode neclusterswithcentersinSpan(V(N))Rasfollows.Ciwithcenter=(xi;0;0)forik;Hwithcenter=((a+b)=2;0);H0withcenter=(c;0);Jwithcenter=(x1;d0+d); Jwithcenter=(x1;d0d):LetMdenotethemeanscon gurationwithalltheseclus-tercentersandwithdatapointsV(M).Wetracetheevo-lutionofk-meansonMviaTable2andFigures9-11.Baseduponthis,weseethatT(M)T(N),thatMisnon-degenerate,andalsothatthe nalclustersetsofMaredistinctfromallclustersetsarisinginpreviouscon gura-tions.AlsoletM0denotethemeanscon gurationwithdatapointsV(M)andwithclustercentersasaboveexceptwithHcenteredat(a;0)andH0centeredat((b+c)=2;0).Then,thesamecalculationshowsthatT(M0)=T(M)andthatthe nalclustersetforHisdistinctfromallotherclustersetsarisinginMorM0.Finally,sinceMandM0arenon-degenerate,thereexistsa�0suchthatwemaymoveeachdatapointbyuptowithoutalteringthek-meansexecution.Takingadvantageofthis,wecanensurethatthecentersofdistinctclustersetsaredistinct,andthatthe nalclustercentersofM0lieonahypersphere.ThismakesMsuper-signaling,andtheresultfollows. Theorem3.1followsimmediatelyfromLemma3.3andLemma3.4. Xid(k+5)d 3k+9d0dd0dd 3k+9d(k+5)(`)V(N)XiABC P B0 C0 QPQB0A0C0 A0 Figure8:ThedatapointsconstructedinLemma3.4.Noted`. t ClustersofM 0,...,T(N) Ci=Ni;twithcenter=(xi;t;0)for1ik H=fA;Bgwithcenter=(a+b 2;0) H0=fCgwithcenter=(c;0) J=fP;Xi;A0;B0;C0;Qgwithcenter=(~x1;d0+d) T(N)+1 C1=~N1[fP; Pgwithcenter=(~x1;0) Ci=~Niwithcenter=(~xi;0)for2ik H=fA;Bgwithcenter=(a+b 2;0) H0=fCgwithcenter=(c;0) J=fXi;A0;B0;C0;Qgwithcenter=(~x1;d0+d+ k+4) T(N)+2 C1=~N1[fP;X1; P; X1gwithcenter=(~x1;0) Ci=~Ni[fXi; Xigwithcenter=(~xi;0)for2ik H=fA;B;A0;B0; A0; B0gwithcenter=(a+b 2;0) H0=fC;C0; C0gwithcenter=(c;0) J=fQgwithcenter=((k+4)~x1P~xi3b;d0+(k+14=3)d) Table2:TheclustersofMaftertiterationsofk-means(seeLemma3.4).Ni;tdenotesthepointsinclusterofiofNaftertiterations,and~Nidenotesthe nalpointsinclusteriofN.Alltableentriesdescribeclustersimmediatelyafterthecentersarerecomputed.Ratherthangoingthougheverycalculation,wediscussthekeyelementsinthefollowingpictures. JJHH0Ci Figure9:Clusteringat0tT(N)(seeLemma3.4).Aswiththe rstpartoftheconstructionforLemma3.3,theclusterscontainedwithinV(N)proceedindependentlyoftheotherpoints.Theremainingclustersareprecariousbuttemporarilystable.Forexample,toseethatPdoesnotswitchfromclusterJtoCi,notethatthedistancesquaredfromPtothecenterofCiminusthedistancesquaredfromPtothecenterofJisk~x1xi;tk2+(d0)2d2=kx1xi;tk2�0.Thelastinequalityfollowsfromthefactthat`andthat,sinceNissignaling,~x1=xi;t. H0JCiH J Figure10:Clusteringatt=T(N)+1(seeLemma3.4).Wenowhavex1;t=~x1,andthusbythecalculationinthepreviousstep,PswitchestoclusterC1.Since PalsoswitchestoclusterC1,thecenterofC1doesnotchange.However,thecentersofJandJ0bothmoveslightlyfurtherawayfromtheotherpoints. HH0JCi J Figure11:Clusteringatt=T(N)+2(seeLemma3.4).ThepointsXi;A0;B0;C0wereallchosentobeonlybarelystablewithinJ.Thus,afterthecenterofJmoves,thesepointsswitchtotheclosestclustersinV(N).Forexample,thedistancefromXitothecenterofCiisapproximatelyd+ 3k+9,andthedistancefromXitothecenterofJisapproximatelyd+ k+4 3k+9�d+ 3k+9Again,onlythecentersofJand Jmoveasaresultofthis,anditiseasytocheckthenewcon gurationisstable. 3.3ProbabilityBoostingTheconstructionusedtoproveTheorem3.1requiresbothaspeci csetofdatapointsandaspeci csetofclustercen-ters.Inpractice,however,onlythedatapointsarespeci edandtheinitialclustercentersarechosenbythealgorithm.Typically,theinitialcentersarechosenuniformlyatran-domfromthedatapoints.Giventhis,onemightaskifthesuperpolynomiallowerboundcanactuallyarisewithnon-vanishingprobability.Inthissection,weshowhowtomodifyourlowerboundconstructiontoapplywithhighprobabilityeveniftheclus-tercentersarechosenrandomlyfromtheexistingdatapoints.Itfollowsthatk-meanscanstillbeveryslowforcertainsetsofdatapoints,evenaccountingfortherandomchoiceofclustercenters.Proposition3.5.LetMbeameanscon gurationonnpoints.Then,thereexistsasetofO(n3logn)pointssuchthatifameanscon gurationNisconstructedwiththesedatapointsandwith4nlognclustercenterschosenran-domlyfromthesetofdatapoints,thenT(N)T(M)withprobability1O(1 n).Proof.LetkbethenumberofclustersinM.Forikandjm,letui;jdenoteorthogonalunitvectorsinRmk.LetV(M)denotethesetofdatapointsinMandlet`denotethediameterofV(M).Letd,randbesuchthatdr`.Also,letnidenotethenumberofpointsinclusteriinMafteroneiteration.ReplacingMwithtwoidenticaloverlappingcopiesifnecessary,wemayassumethatni�1.Finally,letxi;tdenotethecenterofclusteriinMaftertiterations.Letmbeapositiveintegertobe xedlaterandconsiderthepointsetinSpan(V(M))RkmRobtainedby rstembeddingtwocopiesofV(M)atSpan(V(M))f0gf0gandthenaddingthefollowingpoints.1.Pi;j=(xi;0;P(i0;j0)=(i;j)rui0;j0;d+j)forik;jm.2.Qi;`=(nixi;1xi;0 ni1;Pi0=iPj0rui0;j0;d`)forikand`ni1.3.Oj=(0;Pi0Pj0rui0;j0;d+j)forjm.Considerameanscon gurationNwiththesedatapointsandwith4nlognclustercenterschosenfromthesepointsatrandom.Let0=fO1;O2;:::;Omgandfori�0,leti=fPi;1;Pi;2:::;Pi;mg.SupposethatNbeginswithallofitsclustercentersin=[iiandthateachihasatleastoneclustercenter.OnecancheckthatT(N)T(M)inthiscase.Now,letm=n3logn k.Then,eachclustercenterwillbeinsomeiwithprobability1O(1 n2logn).Sincethereare4nlognclusters,allclusterswillbeinwithprobability1O(1 n).Furthermore,theprobabilitythatnoclustercenterischosenina xediisatmost(11 2k)4nlogn1 n2.Thus,eachihasatleastoneclustercenterwithprobability1O(1 n).Theresultnowfollows. 3.4LowspreadRecallthespreadofapointsetistheratioofthelargestpairwisedistancetothesmallestpairwisedistance.Har-PeledandSadri[7]conjecturedthatk-meansmightrunintimepolynomialinnand.Inthissection,however,weshowthatthespreadcanbereducedtoO(1)withoutde-creasingthenumberofiterationsrequired.Proposition3.6.LetMbeameanscon gurationonnpoints.Then,thereexistsameanscon gurationNon2npointssuchthatNhasO(1)spreadandsuchthatT(N)=T(M).Proof.LetV(M)denotethepointsinM,andchooseanarbitrarysetofvectors,u1;u2;:::;un.Foreachvi2V(M),wereplaceviwithxi=(vi;ui)andyi=(vi;ui)inSpan(V(M))Span(u1;u2;:::;un).LetNdenotethemeanscon gurationwiththesedatapointsandwithcenters(cj;0)foreachcentercjinM.ItiseasytocheckthatclusterCinNcontainsxiandyiaftertiterationsifandonlyifclusterCinMcontainsviaftertiterations.ItfollowsthatT(N)=T(M).Takinguitobeorthogonalandoflengthd0,wecanmakeNhavespreadarbitrarilyclosetop 2. Moregenerally,thereisatradeo betweentheextradi-mensionalityandthereductionof.Forexample,byaddingoneextradimension,andtakingui=di,wecanmakethespreadlinearinn.4.DISCUSSION4.1SmoothedAnalysisWehaveshownk-meanscanhaveasuperpolynomialrun-ningtimeintheworstcase.However,weknowthealgo-rithmrunsecientlyinpractice.Itisnaturaltoaskhowthisdiscrepancycanbeformalized.Onenaturalapproachisthatofsmoothedanalysis,whichwasusedbySpielmanandTeng[13]toexplaintherunningtimeoftheSimplexalgorithm.Towardsthatend,assumethatthedatapointsarechosenfromindependent,normaldistributionswithvariance2.LettingDdenotethediameteroftheresultingpointset,weaskwhetherk-meansislikelytorunintimepolynomialinnandD .4.1.1HighDimensionThisquestionappearstobedicultingeneral,butapositiveresultisrelativelyeasytoproveinhighdimensions.Inthissection,wesketchaproofofthisfact.Proposition4.1.Givendatapointschosenfrominde-pendentnormaldistributionswithvariance2andwithdi-mensiond=\n(n=logn),k-meanswillexecuteinpolynomialtimewithhighprobability.Weanalyzethestandardk-meanspotentialfunction.Forameanscon gurationM=(X;C),let(M)=Pni=1kxicik2,whereci2Cistheclustercenterclosesttoxi.Clearly,0nD2,andonecanalsocheckthatisnon-increasingthroughoutanexecutionofk-means.Therefore,itsucestoshowthatthepotentialdecreasesbyanon-trivialamountduringeachiteration.Ontheonehand,itisknownthatifaclustercentermovesbyadistanceduringak-meansstepandiftheclusterhasmpointsattheendofthestep,thendecreasesbyatleast2m(see[7]and[11]).Ontheotherhand,ifourdatapointsarerandom,notwopossiblecenterscanbetooclose.Thiscanbeformalizedasfollows. Definition4.1.WesayasetofdatapointsXis\-separated"ifforanynon-identicalsubsetsSandT,thecen-tersofmassc(S)andc(T)satisfykc(S)c(T)k 2min(jSj;jTj).Lemma4.2.IfXisasetofndatapointschosenfromindependentnormaldistributionswithvariance2,thenXis-separatedwithprobabilityatleast122n .WeomittheproofoftheLemma.Proposition4.1followsbychoosing= n1=d22n=d.4.1.2TheGeneralCaseProposition4.1showsthatk-meansrunsinpolynomialtimewithhighprobabilityinsmoothedhigh-dimensionalsettings.Asimilarresultholdswhend=1basedonthespreadanalysisof[7]andthefactthatasmoothedpointsetislikelytohavepolynomialspread.Amuchmoresubtleanalysisseemstoberequiredforothervaluesofd.WehaverecentlyprovenanupperboundofnO(k)poly(n;D )[2],butitremainsamajoropenproblemto ndaboundpolynomialinn,kandD inthegeneralcase.4.2VariantsSmoothedanalysisprovidesoneveryexplicitwayofcir-cumventingtheworst-caseperformanceofk-means.Namely,givenanarbitrarydataset,wecanperturbeachpointac-cordingtoanindependentnormaldistributionandthenrunk-means.Evenoursimpleanalysisinhighdimensionscanbeharnessedhereby rstliftington-dimensionalspace,andthenperturbing.Twoothermethodsalsosuggestthemselves.Firstofall,k-meansisoftenruninarelativelysmallnumberofdimen-sions,andregardless,onecanalwaysreducedtoO(logn)withsmalldistortion[10].Thus,itisnaturaltoaskhowk-meansperformsforsmalld.Weconjecturethatk-meansisworst-casesuperpolynomiali d�1.Evenwhend=1,nostronglypolynomialupperboundsareknown.Finally,Har-PeledandSadri[7]suggestedasimplevari-antofk-meanswhereonlyonedatapointisreassignedeachiteration.Thisvarianthasrunningtimepolynomialinnandthespread,whichweknowisnottrueforstan-dardk-means.Giventhisqualitativeimprovement,afur-therstudyofthisvariantcouldprovefruitful.5.REFERENCES[1]PankajK.AgarwalandNabilH.Mustafa.k-meansprojectiveclustering.InPODS'04:Proceedingsofthetwenty-thirdACMSIGMOD-SIGACT-SIGARTsymposiumonPrinciplesofdatabasesystems,pages155{165,NewYork,NY,USA,2004.ACMPress.[2]DavidArthurandSergeiVassilvitskii.Improvedsmoothedanalysisforthek-meansmethod.Manuscript,2006.[3]DavidArthurandSergeiVassilvitskii.k-meanslowerboundimplementation.http://www.stanford.edu/~darthur/kMeansLbTest.zip,2006.[4]SanjoyDasgupta.Howfastisk-means?InCOLTComputationalLearningTheory,volume2777,page735,2003.[5]R.O.Duda,P.E.Hart,andD.G.Stork.PatternClassi cation.Wiley-IntersciencePublication,2000.[6]FredericGibouandRonaldFedkiw.Afasthybridk-meanslevelsetalgorithmforsegmentation.In4thAnnualHawaiiInternationalConferenceonStatisticsandMathematics,pages281{291,2005.[7]SarielHar-PeledandBardiaSadri.Howfastisthek-meansmethod?Algorithmica,41(3):185{202,2005.[8]R.Herwig,A.J.Poustka,C.Muller,C.Bull,H.Lehrach,andJO'Brien.Large-scaleclusteringofcdna- ngerprintingdata.GenomeResearch,9:1093{1105,1999.[9]MaryInaba,NaokiKatoh,andHiroshiImai.Applicationsofweightedvoronoidiagramsandrandomizationtovariance-basedk-clustering:(extendedabstract).InSCG'94:ProceedingsofthetenthannualsymposiumonComputationalgeometry,pages332{339,NewYork,NY,USA,1994.ACMPress.[10]W.JohnsonandJ.Lindenstrauss.Extensionsoflipschitzmapsintoahilbertspace.ContemporaryMathematics,26:189{206,1984.[11]TapasKanungo,DavidM.Mount,NathanS.Netanyahu,ChristineD.Piatko,RuthSilverman,andAngelaY.Wu.Alocalsearchapproximationalgorithmfork-meansclustering.InSCG'02:ProceedingsoftheeighteenthannualsymposiumonComputationalgeometry,pages10{18,NewYork,NY,USA,2002.ACMPress.[12]StuartP.Lloyd.Leastsquaresquantizationinpcm.IEEETransactionsonInformationTheory,28(2):129{136,1982.[13]DanielA.SpielmanandShang-HuaTeng.Smoothedanalysisofalgorithms:Whythesimplexalgorithmusuallytakespolynomialtime.J.ACM,51(3):385{463,2004.