Although it o64256ers no accuracy guarantees its simplicity and speed are very appealing in practice By augmenting kmeans with a very simple ran domized seeding technique we obtain an algorithm that is 920log competitive with the optimal clustering ID: 8348
Download Pdf The PPT/PDF document "kmeans The Advantages of Careful Seedin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
k-means++:TheAdvantagesofCarefulSeedingDavidArthurSergeiVassilvitskiiyAbstractThek-meansmethodisawidelyusedclusteringtechniquethatseekstominimizetheaveragesquareddistancebetweenpointsinthesamecluster.Althoughitoersnoaccuracyguarantees,itssimplicityandspeedareveryappealinginpractice.Byaugmentingk-meanswithaverysimple,ran-domizedseedingtechnique,weobtainanalgorithmthatis(logk)-competitivewiththeoptimalclustering.Prelim-inaryexperimentsshowthatouraugmentationimprovesboththespeedandtheaccuracyofk-means,oftenquitedramatically.1IntroductionClusteringisoneoftheclassicproblemsinmachinelearningandcomputationalgeometry.Inthepopulark-meansformulation,oneisgivenanintegerkandasetofndatapointsinRd.Thegoalistochoosekcenterssoastominimize,thesumofthesquareddistancesbetweeneachpointanditsclosestcenter.SolvingthisproblemexactlyisNP-hard,evenwithjusttwoclusters[10],buttwenty-veyearsago,Lloyd[20]proposedalocalsearchsolutionthatisstillverywidelyusedtoday(seeforexample[1,11,15]).Indeed,arecentsurveyofdataminingtechniquesstatesthatit\isbyfarthemostpopularclusteringalgorithmusedinscienticandindustrialapplications"[5].Usuallyreferredtosimplyask-means,Lloyd'salgorithmbeginswithkarbitrarycenters,typicallychosenuniformlyatrandomfromthedatapoints.Eachpointisthenassignedtothenearestcenter,andeachcenterisrecomputedasthecenterofmassofallpointsassignedtoit.Thesetwosteps(assignmentandcentercalculation)arerepeateduntiltheprocessstabilizes.Onecancheckthatthetotalerrorismonotoni-callydecreasing,whichensuresthatnoclusteringisre-peatedduringthecourseofthealgorithm.Sincethereareatmostknpossibleclusterings,theprocesswillal-waysterminate.Inpractice,veryfewiterationsareusu-allyrequired,whichmakesthealgorithmmuchfaster StanfordUniversity,SupportedinpartbyNDSEGFellow-ship,NSFGrantITR-0331640,andgrantsfromMedia-XandSNRC.yStanfordUniversity,SupportedinpartbyNSFGrantITR-0331640,andgrantsfromMedia-XandSNRC.thanmostofitscompetitors.Unfortunately,theempiricalspeedandsimplicityofthek-meansalgorithmcomeatthepriceofaccuracy.Therearemanynaturalexamplesforwhichthealgo-rithmgeneratesarbitrarilybadclusterings(i.e., OPTisunboundedevenwhennandkarexed).Furthermore,theseexamplesdonotrelyonanadversarialplacementofthestartingcenters,andtheratiocanbeunboundedwithhighprobabilityevenwiththestandardrandom-izedseedingtechnique.Inthispaper,weproposeawayofinitializingk-meansbychoosingrandomstartingcenterswithveryspecicprobabilities.Specically,wechooseapointpasacenterwithprobabilityproportionaltop'scontributiontotheoverallpotential.Lettingdenotethepotentialafterchoosingcentersinthisway,weshowthefollowing.Theorem1.1.Foranysetofdatapoints,E[]8(lnk+2)OPT.Thissamplingisbothfastandsimple,anditalreadyachievesapproximationguaranteesthatk-meanscan-not.Weproposeusingittoseedtheinitialcentersfork-means,leadingtoacombinedalgorithmwecallk-means++.ThiscomplementsaveryrecentresultofOstrovskyetal.[24],whoindependentlyproposedmuchthesamealgorithm.Whereastheyshowedthisrandomizedseed-ingisO(1)-competitiveondatasetsfollowingacertainseparationcondition,weshowitisO(logk)-competitiveonalldatasets.WealsoshowthattheanalysisforTheorem1.1istightuptoaconstantfactor,andthatitcanbeeas-ilyextendedtovariouspotentialfunctionsinarbitrarymetricspaces.Inparticular,wecanalsogetasim-pleO(logk)approximationalgorithmforthek-medianobjective.Furthermore,weprovidepreliminaryexperi-mentaldatashowingthatinpractice,k-means++reallydoesoutperformk-meansintermsofbothaccuracyandspeed,oftenbyasubstantialmargin.1.1RelatedworkAsafundamentalprobleminmachinelearning,k-meanshasarichhistory.Becauseofitssimplicityanditsobservedspeed,Lloyd'smethod[20]remainsthemostpopularapproachinpractice, despiteitslimitedaccuracy.TheconvergencetimeofLloyd'smethodhasbeenthesubjectofarecentseriesofpapers[2,4,8,14];inthisworkwefocusonimprovingitsaccuracy.Inthetheorycommunity,Inabaetal.[16]werethersttogiveanexactalgorithmforthek-meansproblem,withtherunningtimeofO(nkd).Sincethen,anumberofpolynomialtimeapproximationschemeshavebeendeveloped(see[9,13,19,21]andthereferencestherein).Whiletheauthorsdevelopinterestinginsightsintothestructureoftheclusteringproblem,theiralgorithmsarehighlyexponential(orworse)ink,andareunfortunatelyimpracticalevenforrelativelysmalln,kandd.Kanungoetal.[17]proposedanO(n3 d)algorithmthatis(9+)-competitive.However,n3comparesunfavorablywiththealmostlinearrunningtimeofLloyd'smethod,andtheexponentialdependenceondcanalsobeproblematic.Forthesereasons,Kanungoetal.alsosuggestedawayofcombiningtheirtechniqueswithLloyd'salgorithm,butinordertoavoidtheexponentialdependenceond,theirapproachsacricesallapproximationguarantees.MettuandPlaxton[22]alsoachievedaconstant-probabilityO(1)approximationusingatechniquecalledsuccessivesampling.TheymatchourrunningtimeofO(nkd),butonlyifkissucientlylargeandthespreadissucientlysmall.Inpractice,ourapproachissimpler,andourexperimentalresultsseemtobebetterintermsofbothspeedandaccuracy.Veryrecently,Ostrovskyetal.[24]independentlyproposedanalgorithmthatisessentiallyidenticaltoours,althoughtheiranalysisisquitedierent.LettingOPT;kdenotetheoptimalpotentialforak-clusteringonagivendataset,theyprovek-means++isO(1)-competitiveinthecasewhereOPT;k OPT;k 12.Theintuitionhereisthatifthisconditiondoesnothold,thenthedataisnotwellsuitedforclusteringwiththegivenvaluefork.Combiningthisresultwithoursgivesastrongcharacterizationofthealgorithm'sperformance.Inparticular,k-means++isneverworsethanO(logk)-competitive,andonverywellformeddatasets,itimprovestobeingO(1)-competitive.Overall,theseedingtechniqueweproposeissimilarinspirittothatusedbyMeyerson[23]foronlinefacilitylocation,andMishraetal.[12]andCharikaretal.[6]inthecontextofk-medianclustering.However,ouranalysisisquitedierentfromthoseworks.2PreliminariesInthissection,weformallydenethek-meansproblem,aswellasthek-meansandk-means++algorithms.Forthek-meansproblem,wearegivenanintegerkandasetofndatapointsXRd.WewishtochoosekcentersCsoastominimizethepotentialfunction,=Xx2Xminc2Ckx ck2:Choosingthesecentersimplicitlydenesaclustering{foreachcenter,wesetoneclustertobethesetofdatapointsthatareclosertothatcenterthantoanyother.Asnotedabove,ndinganexactsolutiontothek-meansproblemisNP-hard.Throughoutthepaper,wewillletCOPTdenotetheoptimalclusteringforagiveninstanceofthek-meansproblem,andwewillletOPTdenotethecorrespondingpotential.GivenaclusteringCwithpotential,wealsolet(A)denotethecontributionofAXtothepotential(i.e.,(A)=Px2Aminc2Ckx ck2).2.1Thek-meansalgorithmThek-meansmethodisasimpleandfastalgorithmthatattemptstolocallyimproveanarbitraryk-meansclustering.Itworksasfollows.1.ArbitrarilychoosekinitialcentersC=fc1;:::;ckg.2.Foreachi2f1;:::;kg,settheclusterCitobethesetofpointsinXthatareclosertocithantheyaretocjforallj=i.3.Foreachi2f1;:::;kg,setcitobethecenterofmassofallpointsinCi:ci=1 jCijPx2Cix.4.RepeatSteps2and3untilCnolongerchanges.ItisstandardpracticetochoosetheinitialcentersuniformlyatrandomfromX.ForStep2,tiesmaybebrokenarbitrarily,aslongasthemethodisconsistent.Steps2and3arebothguaranteedtodecrease,sothealgorithmmakeslocalimprovementstoanarbitraryclusteringuntilitisnolongerpossibletodoso.ToseethatStep3doesinfactdecreases,itishelpfultorecallastandardresultfromlinearalgebra(see[14]).Lemma2.1.LetSbeasetofpointswithcenterofmassc(S),andletzbeanarbitrarypoint.Then,Px2Skx zk2 Px2Skx c(S)k2=jSjkc(S) zk2.MonotonicityforStep3followsfromtakingStobeasingleclusterandztobeitsinitialcenter.Asdiscussedabove,thek-meansalgorithmisat-tractiveinpracticebecauseitissimpleanditisgener-allyfast.Unfortunately,itisguaranteedonlytondalocaloptimum,whichcanoftenbequitepoor.2.2Thek-means++algorithmThek-meansalgo-rithmbeginswithanarbitrarysetofclustercenters.Weproposeaspecicwayofchoosingthesecenters.At anygiventime,letD(x)denotetheshortestdistancefromadatapointxtotheclosestcenterwehaveal-readychosen.Then,wedenethefollowingalgorithm,whichwecallk-means++.1a.Chooseaninitialcenterc1uniformlyatrandomfromX.1b.Choosethenextcenterci,selectingci=x02XwithprobabilityD(x0)2 x2XD(x)2.1c.RepeatStep1buntilwehavechosenatotalofkcenters.2-4.Proceedaswiththestandardk-meansalgorithm.WecalltheweightingusedinStep1bsimply\D2weighting".3k-means++isO(logk)-competitiveInthissection,weproveourmainresult.Theorem3.1.IfCisconstructedwithk-means++,thenthecorrespondingpotentialfunctionsatisesE[]8(lnk+2)OPT.Infact,weprovethisholdsafteronlyStep1ofthealgorithmabove.Steps2through4canthenonlydecrease.Notsurprisingly,ourexperimentsshowthislocaloptimizationisimportantinpractice,althoughitisdiculttoquantifythistheoretically.Ouranalysisconsistsoftwoparts.First,weshowthatk-means++iscompetitiveinthoseclustersofCOPTfromwhichitchoosesacenter.Thisiseasiestinthecaseofourrstcenter,whichischosenuniformlyatrandom.Lemma3.1.LetAbeanarbitraryclusterinCOPT,andletCbetheclusteringwithjustonecenter,whichischosenuniformlyatrandomfromA.Then,E[(A)]=2OPT(A).Proof.Letc(A)denotethecenterofmassofthedatapointsinA.ByLemma2.1,weknowthatsinceCOPTisoptimal,itmustbeusingc(A)asthecentercorrespondingtotheclusterA.Usingthesamelemmaagain,weseeE[(A)]isgivenby,Xa02A1 jAj Xa2Aka a0k2!=1 jAjXa02A Xa2Aka c(A)k2+jAjka0 c(A)k2!=2Xa2Aka c(A)k2;andtheresultfollows.OurnextstepistoproveananalogofLemma3.1fortheremainingcenters,whicharechosenwithD2weighting.Lemma3.2.LetAbeanarbitraryclusterinCOPT,andletCbeanarbitraryclustering.IfweaddarandomcentertoCfromA,chosenwithD2weighting,thenE[(A)]8OPT(A).Proof.Theprobabilitythatwechoosesomexeda0asourcenter,giventhatwearechoosingourcenterfromA,ispreciselyD(a0)2 a2AD(a)2.Furthermore,afterchoos-ingthecentera0,apointawillcontributepreciselymin(D(a);ka a0k)2tothepotential.Therefore,E[(A)]=Xa02AD(a0)2 Pa2AD(a)2Xa2Amin(D(a);ka a0k)2:NotebythetriangleinequalitythatD(a0)D(a)+ka a0kforalla;a0.Fromthis,thepower-meaninequality1impliesthatD(a0)22D(a)2+2ka a0k2.Summingoveralla,wethenhavethatD(a0)22 jAjPa2AD(a)2+2 jAjPa2Aka a0k2,andhence,E[(A)]isatmost,2 jAjXa02APa2AD(a)2 Pa2AD(a)2Xa2Amin(D(a);ka a0k)2+2 jAjXa02APa2Aka a0k2 Pa2AD(a)2Xa2Amin(D(a);ka a0k)2:Intherstexpression,wesubstitutemin(D(a);ka a0k)2ka a0k2,andinthesecondexpression,wesubstitutemin(D(a);ka a0k)2D(a)2.Simplifying,wethenhave,E[(A)]4 jAjXa02AXa2Aka a0k2=8OPT(A):ThelaststepherefollowsfromLemma3.1.WehavenowshownthatseedingbyD2weightingiscompetitiveaslongasitchoosescentersfromeachclusterofCOPT,whichcompletesthersthalfofourargument.WenowuseinductiontoshowthetotalerroringeneralisatmostO(logk). 1Thepower-meaninequalitystatesforanyrealnumbersa1;;amthata2i1 m(ai)2.ItfollowsfromtheCauchy-Schwarzinequality.Weareonlyusingthecasem=2here,butwewillneedthegeneralcaseforLemma3.3. Lemma3.3.LetCbeanarbitraryclustering.Chooseu0\uncovered"clustersfromCOPT,andletXudenotethesetofpointsintheseclusters.AlsoletXc=X Xu.NowsupposeweaddturandomcenterstoC,chosenwithD2weighting.LetC0denotethetheresultingclustering,andlet0denotethecorrespondingpotential.Then,E[0]isatmost,(Xc)+8OPT(Xu)(1+Ht)+u t u(Xu):Here,Htdenotestheharmonicsum,1+1 2++1 t.Proof.Weprovethisbyinduction,showingthatiftheresultholdsfor(t 1;u)and(t 1;u 1),thenitalsoholdsfor(t;u).Therefore,itsucestocheckt=0;u0andt=u=1asourbasecases.Ift=0andu0,theresultfollowsfromthefactthat1+Ht=u t u=1.Next,supposet=u=1.Wechooseouronenewcenterfromtheoneuncoveredclusterwithprobabilityexactly(Xu) .Inthiscase,Lemma3.2guaranteesthatE[0](Xc)+8OPT(Xu).Since0evenifwechooseacenterfromacoveredcluster,wehave,E[0](Xu) (Xc)+8OPT(Xu)+(Xc) 2(Xc)+8OPT(Xu):Since1+Ht=2here,wehaveshowntheresultholdsforbothbasecases.Wenowproceedtoprovetheinductivestep.Itisconvenientheretoconsidertwocases.Firstsupposewechooseourrstcenterfromacoveredcluster.Asabove,thishappenswithprobabilityexactly(Xc) .Notethatthisnewcentercanonlydecrease.Bearingthisinmind,applytheinductivehypothesiswiththesamechoiceofcoveredclusters,butwithtdecreasedbyone.ItfollowsthatourcontributiontoE[0]inthiscaseisatmost,(Xc) (Xc)+8OPT(Xu)(1+Ht 1)+u t+1 u(Xu):Ontheotherhand,supposewechooseourrstcenterfromsomeuncoveredclusterA.Thishappenswithprobability(A) .Letpadenotetheprobabilitythatwechoosea2Aasourcenter,giventhecenterissomewhereinA,andletadenote(A)afterwechooseaasourcenter.Onceagain,weapplyourinductivehypothesis,thistimeaddingAtothesetofcoveredclusters,aswellasdecreasingbothtanduby1.ItfollowsthatourcontributiontoE[OPT]inthiscaseisatmost,(A) Xa2Apa(Xc)+a+8OPT(Xu) 8OPT(A)(1+Ht 1)+u t u 1(Xu) (A)(A) (Xc)+8OPT(Xu)(1+Ht 1)+u t u 1(Xu) (A):ThelaststepherefollowsfromthefactthatPa2Apaa8OPT(A),whichisimpliedbyLemma3.2.Now,thepower-meaninequalityimpliesthatPAXu(A)21 u(Xu)2.Therefore,ifwesumoveralluncoveredclustersA,weobtainapotentialcontri-butionofatmost,(Xu) (Xc)+8OPT(Xu)(1+Ht 1)+1 u t u 1(Xu)2 1 u(Xu)2=(Xu) (Xc)+8OPT(Xu)(1+Ht 1)+u t u(Xu):CombiningthepotentialcontributiontoE[0]frombothcases,wenowobtainthedesiredbound:E[0](Xc)+8OPT(Xu)(1+Ht 1)+u t u(Xu)+(Xc) (Xu) u(Xc)+8OPT(Xu)1+Ht 1+1 u+u t u(Xu):Theinductivestepnowfollowsfromthefactthat1 u1 t.WespecializeLemma3.3toobtainourmainresult.Theorem3.1IfCisconstructedwithk-means++,thenthecorrespondingpotentialfunctionsatisesE[]8(lnk+2)OPT.Proof.ConsidertheclusteringCafterwehavecom-pletedStep1.LetAdenotetheCOPTclusterinwhichwechosetherstcenter.ApplyingLemma3.3with t=u=k 1andwithAbeingtheonlycoveredclus-ter,wehave,E[OPT](A)+8OPT 8OPT(A)(1+Hk 1):TheresultnowfollowsfromLemma3.1,andfromthefactthatHk 11+lnk.4AmatchinglowerboundInthissection,weshowthattheD2seedingusedbyk-means++isnobetterthan\n(logk)-competitiveinexpectation,therebyprovingTheorem3.1istightwithinaconstantfactor.Fixk,andthenchoosen,,withnkand.WeconstructXwithnpoints.Firstchoosekcentersc1;c2;:::;cksuchthatkci cjk2=2 n k n2foralli=j.Now,foreachci,adddatapointsxi;1;xi;2;;xi;n karrangedinaregularsimplexwithcenterci,sidelength,andradiusq n k 2n.Ifwedothisinorthogonaldimensionsforeachi,wethenhave,kxi;i0 xj;j0k=ifi=j,orotherwise.Weproveourseedingtechniqueisinexpectation\n(logk)worsethantheoptimalclusteringinthiscase.Clearly,theoptimalclusteringhascentersfcig,whichleadstoanoptimalpotentialofOPT=n k 22.Conversely,usinganinductionsimilartothatofLemma3.3,weshowD2seedingcannotmatchthisbound.Asbefore,weboundtheexpectedpotentialintermsofthenumberofcenterslefttochooseandthenumberofuncoveredclusters(thoseclustersofC0fromwhichwehavenotchosenacenter).Lemma4.1.LetCbeanarbitraryclusteringonXwithk t1centers,butwithuclustersfromCOPTuncovered.NowsupposeweaddtrandomcenterstoC,chosenwithD2weighting.LetC0denotethetheresultingclustering,andlet0denotethecorrespondingpotential.Furthermore,let=n k2 n,=2 2k2 2andH0u=Pui=1k i ki.Then,E[0]isatleast,t+1n2(1+H0u)+n k2 2n2(u t):Proof.Weprovethisbyinductionont.Ift=0,notethat,0==n un k k2+un k2:Sincen un kn k,wehaven un k k n un kn k k n k=.Also,;1.Therefore,0n un k2+un k2:Finally,sincen2uun k2andn2un2H0u,wehave,0n2(1+H0u)+n k2 2n2u:Thiscompletesthebasecase.Wenowproceedtoprovetheinductivestep.AswithLemma3.3,weconsidertwocases.Theprobabilitythatourrstcenterischosenfromanuncoveredclusteris,un k2 un k2+(k u)n k2 (k t)2u2 u2+(k u)2u2 u2+(k u)2:Applyingourinductivehypothesiswithtandubothdecreasedby1,weobtainapotentialcontributionfromthiscaseofatleast,u2 u2+(k u)2t+1n2(1+H0u 1)+n k2 2n2(u t):Theprobabilitythatourrstcenterischosenfromacoveredclusteris(k u)n k2 (k t)2 un k2+(k u)n k2 (k t)2(k u)n k2 (k t)2 (k u)n k2(k u)2 u2+(k u)2(k u)2 u2+(k u)2:Applyingourinductivehypothesiswithtdecreasedby1butwithuconstant,weobtainapotentialcontributionfromthiscaseofatleast,(k u)2 u2+(k u)2t+1n2(1+H0u)+n k2 2n2(u t+1):Therefore,E[0]isatleast,t+1n2(1+H0u)+n k2 2n2(u t)+t+1 u2+(k u)2(k u)2n k2 2n2 u2H0(u) H0(u 1)n2: However,H0u H0u 1=k u kuand=2 2k2 2,sou2H0(u) H0(u 1)n2=(k u)2n k2 2n2;andtheresultfollows.Asintheprevioussection,weobtainthedesiredresultbyspecializingtheinduction.Theorem4.1.D2seedingisnobetterthan2(lnk)-competitive.Proof.Supposeaclusteringwithpotentialiscon-structedusingk-means++onXdescribedabove.Ap-plyLemma4.1withu=t=k 1aftertherstcenterhasbeenchosen.Notingthat1+H0k 1=1+Pk 1i=1 1 i 1 k=Hklnk,wethenhave,E[]kn2lnk:Now,xkandbutletnandapproachinnity.Thenandbothapproach1,andtheresultfollowsfromthefactthatOPT=n k 22.5GeneralizationsAlthoughthek-meansalgorithmitselfappliesonlyinvectorspaceswiththepotentialfunction=Px2Xminc2Ckx ck2,wenotethatourseedingtech-niquedoesnothavethesamelimitations.Inthissec-tion,wediscussextendingourresultstoarbitrarymet-ricspaceswiththemoregeneralpotentialfunction,[`]=Px2Xminc2Ckx ck`for`1.Inparticular,notethatthecaseof`=1isthek-medianspotentialfunction.Thesegeneralizationsrequireonlyonechangetothealgorithmitself.InsteadofusingD2seeding,weswitchtoD`seeding{i.e.,wechoosex0asacenterwithprobabilityD(x0)` x2XD(x)`.Fortheanalysis,themostimportantchangeap-pearsinLemma3.1.Ouroriginalproofusesaninnerproductstructurethatisnotavailableinthegeneralcase.However,aslightlyweakerresultcanbeprovenusingonlythetriangleinequality.Lemma5.1.LetAbeanarbitraryclusterinCOPT,andletCbetheclusteringwithjustonecenter,whichischo-senuniformlyatrandomfromA.Then,E[[`](A)]2`[`]OPT(A).Proof.LetcdenotethecenterofAinCOPT.Then,E[[`](A)]=1 jAjXa02AXa2Aka a0k`2` 1 jAjXa02AXa2A ka ck`+ka0 ck`=2`[`]OPT(A):Thesecondstepherefollowsfromthetriangleinequalityandthepower-meaninequality.Therestofourupperboundanalysiscarriesthroughwithoutchange,exceptthatintheproofofLemma3.2,weloseafactorof2` 1fromthepower-meaninequality,insteadofjust2.Puttingeverythingtogether,weobtainthegeneraltheorem.Theorem5.1.IfCisconstructedwithD`seeding,thenthecorrespondingpotentialfunction[`]satises,E[[`]]22`(lnk+2)[`]OPT.6EmpiricalresultsInordertoevaluatek-means++inpractice,wehaveimplementedandtesteditinC++[3].Inthissection,wediscusstheresultsofthesepreliminaryexperiments.WefoundthatD2seedingsubstantiallyimprovesboththerunningtimeandtheaccuracyofk-means.6.1DatasetsWeevaluatedtheperformanceofk-meansandk-means++onfourdatasets.Therstdataset,Norm25,issynthetic.Togenerateit,wechose25\true"centersuniformlyatrandomfroma15-dimensionalhypercubeofsidelength500.WethenaddedpointsfromGaussiandistributionsofvariance1aroundeachtruecenter.Thus,weobtainedanumberofwellseparatedGaussianswiththethetruecentersprovidingagoodapproximationtotheoptimalclustering.Wechosetheremainingdatasetsfromreal-worldexamplesotheUC-IrvineMachineLearningReposi-tory.TheClouddataset[7]consistsof1024pointsin10dimensions,anditisPhilippeCollard'srstcloudcoverdatabase.TheIntrusiondataset[18]consistsof494019pointsin35dimensions,anditrepresentsfeaturesavail-abletoanintrusiondetectionsystem.Finally,theSpamdataset[25]consistsof4601pointsin58dimensions,anditrepresentsfeaturesavailabletoane-mailspamdetectionsystem.Foreachdataset,wetestedk=10;25;and50.6.2MetricsSinceweweretestingrandomizedseed-ingprocesses,weran20trialsforeachcase.Wereport Average Minimum AverageT k k-meansk-means++ k-meansk-means++ k-meansk-means++ 10 1:365105 8.47% 1:174105 0.93% 0.12 46.72% 25 4:233104 99.96% 1:914104 99.92% 0.90 87.79% 50 7:750103 99.81% 1:474101 0.53% 2.04 1.62% Table1:ExperimentalresultsontheNorm25dataset(n=10000,d=15).Fork-means,welisttheactualpotentialandtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means:100% 1 k-means++value k-meansvalue. Average Minimum AverageT k k-meansk-means++ k-meansk-means++ k-meansk-means++ 10 7:921103 22.33% 6:284103 10.37% 0.08 51.09% 25 3:637103 42.76% 2:550103 22.60% 0.11 43.21% 50 1:867103 39.01% 1:407103 23.07% 0.16 41.99% Table2:ExperimentalresultsontheClouddataset(n=1024,d=10).Fork-means,welisttheactualpotentialandtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means. Average Minimum AverageT k k-meansk-means++ k-meansk-means++ k-meansk-means++ 10 3:387108 93.37% 3:206108 94.40% 63.94 44.49% 25 3:149108 99.20% 3:100108 99.32% 257.34 49.19% 50 3:079108 99.84% 3:076108 99.87% 917.00 66.70% Table3:ExperimentalresultsontheIntrusiondataset(n=494019,d=35).Fork-means,welisttheactualpotentialandtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means. Average Minimum AverageT k k-meansk-means++ k-meansk-means++ k-meansk-means++ 10 3:698104 49.43% 3:684104 54.59% 2.36 69.00% 25 3:288104 88.76% 3:280104 89.58% 7.36 79.84% 50 3:183104 95.35% 2:384104 94.30% 12.20 75.76% Table4:ExperimentalresultsontheSpamdataset(n=4601,d=58).Fork-means,welisttheactualpotentialandtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means. theminimumandtheaveragepotential(actuallydi-videdbythenumberofpoints),aswellasthemeanrunningtime.Ourimplementationsarestandardwithnospecialoptimizations.6.3ResultsTheresultsfork-meansandk-means++aredisplayedinTables1through4.Welisttheabsoluteresultsfork-means,andthepercentageimprovementachievedbyk-means++(e.g.,a90%improvementintherunningtimeisequivalenttoafactor10speedup).Weobservethatk-means++consistentlyoutperformedk-means,bothbyachievingalowerpotentialvalue,insomecasesbyseveralordersofmagnitude,andalsobyhavingafasterrunningtime.TheD2seedingisslightlyslowerthanuniformseeding,butitstillleadstoafasteralgorithmsinceithelpsthelocalsearchconvergeafterfeweriterations.Thesyntheticexampleisacasewherestandardk-meansdoesverybadly.Eventhoughthereisan\obvious"clustering,theuniformseedingwillinevitablymergesomeoftheseclusters,andthelocalsearchwillneverbeabletosplitthemapart(see[12]forfurtherdiscussionofthisphenomenon).Thecarefulseedingmethodofk-means++avoidedthisproblemaltogether,anditalmostalwaysattainedtheoptimalclusteringonthesyntheticdataset.Thedierencebetweenk-meansandk-means++onthereal-worlddatasetswasalsosubstantial.Ineverycase,k-means++achievedatleasta10%accuracyimprovementoverk-means,anditoftenperformedmuchbetter.Indeed,ontheSpamandIntrusiondatasets,k-means++achievedpotentials20to1000timessmallerthanthoseachievedbystandardk-means.Eachtrialalsocompletedtwotothreetimesfaster,andeachindividualtrialwasmuchmorelikelytoachieveagoodclustering.7ConclusionandfutureworkWehavepresentedanewwaytoseedthek-meansalgorithmthatisO(logk)-competitivewiththeoptimalclustering.Furthermore,ourseedingtechniqueisasfastandassimpleasthek-meansalgorithmitself,whichmakesitattractiveinpractice.Towardsthatend,weranpreliminaryexperimentsonseveralreal-worlddatasets,andweobservedthatk-means++substantiallyoutperformedstandardk-meansintermsofbothspeedandaccuracy.AlthoughouranalysisoftheexpectedpotentialE[]achievedbyk-means++istighttowithinacon-stantfactor,afewopenquestionsstillremain.Mostimportantly,itisstandardpracticetorunthek-meansalgorithmmultipletimes,andthenkeeponlythebestclusteringfound.Thisraisesthequestionofwhetherk-means++achievesasymptoticallybetterresultsifitisallowedseveraltrials.Forexample,ifk-means++isrun2ktimes,ourargumentscanbemodiedtoshowitislikelytoachieveaconstantapproximationatleastonce.Weaskwhetherasimilarboundcanbeachievedforasmallernumberoftrials.Also,experimentsshowedthatk-means++generallyperformedbetterifitselectedseveralnewcentersduringeachiteration,andthengreedilychosetheonethatdecreasedasmuchaspossible.Unfortunately,ourproofsdonotcarryovertothisscenario.Itwouldbeinterestingtoseeacomparable(orbetter)asymptoticresultprovenhere.Finally,wearecurrentlyworkingonamorethor-oughexperimentalanalysis.Inparticular,wearemea-suringtheperformanceofnotonlyk-means++andstan-dardk-means,butalsoothervariantsthathavebeensuggestedinthetheorycommunity.AcknowledgementsWewouldliketothankRajeevMotwaniforhishelpfulcomments.References[1]PankajK.AgarwalandNabilH.Mustafa.k-meansprojectiveclustering.InPODS'04:Proceedingsofthetwenty-thirdACMSIGMOD-SIGACT-SIGARTsym-posiumonPrinciplesofdatabasesystems,pages155{165,NewYork,NY,USA,2004.ACMPress.[2]D.ArthurandS.Vassilvitskii.Worst-caseandsmoothedanalysisoftheICPalgorithm,withanap-plicationtothek-meansmethod.InSymposiumonFoundationsofComputerScience,2006.[3]DavidArthurandSergeiVassilvitskii.k-means++testcode.http://www.stanford.edu/~darthur/kMeansppTest.zip.[4]DavidArthurandSergeiVassilvitskii.Howslowisthek-meansmethod?InSCG'06:Proceedingsofthetwenty-secondannualsymposiumoncomputationalgeometry.ACMPress,2006.[5]PavelBerkhin.Surveyofclusteringdataminingtechniques.Technicalreport,AccrueSoftware,SanJose,CA,2002.[6]MosesCharikar,LiadanO'Callaghan,andRinaPani-grahy.Betterstreamingalgorithmsforclusteringprob-lems.InSTOC'03:Proceedingsofthethirty-fthan-nualACMsymposiumonTheoryofcomputing,pages30{39,NewYork,NY,USA,2003.ACMPress.[7]PhilippeCollard'scloudcoverdatabase.ftp://ftp.ics.uci.edu/pub/machine-learning-databases/undocumented/taylor/cloud.data.[8]SanjoyDasgupta.Howfastis-means?InBernhardScholkopfandManfredK.Warmuth,editors,COLT,volume2777ofLectureNotesinComputerScience,page735.Springer,2003. [9]W.FernandezdelaVega,MarekKarpinski,ClaireKenyon,andYuvalRabani.Approximationschemesforclusteringproblems.InSTOC'03:Proceedingsofthethirty-fthannualACMsymposiumonTheoryofcomputing,pages50{58,NewYork,NY,USA,2003.ACMPress.[10]P.Drineas,A.Frieze,R.Kannan,S.Vempala,andV.Vinay.Clusteringlargegraphsviathesingularvaluedecomposition.Mach.Learn.,56(1-3):9{33,2004.[11]FredericGibouandRonaldFedkiw.Afasthybridk-meanslevelsetalgorithmforsegmentation.In4thAnnualHawaiiInternationalConferenceonStatisticsandMathematics,pages281{291,2005.[12]SudiptoGuha,AdamMeyerson,NinaMishra,RajeevMotwani,andLiadanO'Callaghan.Clusteringdatastreams:Theoryandpractice.IEEETransactionsonKnowledgeandDataEngineering,15(3):515{528,2003.[13]SarielHar-PeledandSohamMazumdar.Oncoresetsfork-meansandk-medianclustering.InSTOC'04:Proceedingsofthethirty-sixthannualACMsymposiumonTheoryofcomputing,pages291{300,NewYork,NY,USA,2004.ACMPress.[14]SarielHar-PeledandBardiaSadri.Howfastisthek-meansmethod?InSODA'05:ProceedingsofthesixteenthannualACM-SIAMsymposiumonDiscretealgorithms,pages877{885,Philadelphia,PA,USA,2005.SocietyforIndustrialandAppliedMathematics.[15]R.Herwig,A.J.Poustka,C.Muller,C.Bull,H.Lehrach,andJO'Brien.Large-scaleclusteringofcdna-ngerprintingdata.GenomeResearch,9:1093{1105,1999.[16]MaryInaba,NaokiKatoh,andHiroshiImai.Applica-tionsofweightedvoronoidiagramsandrandomizationtovariance-basedk-clustering:(extendedabstract).InSCG'94:ProceedingsofthetenthannualsymposiumonComputationalgeometry,pages332{339,NewYork,NY,USA,1994.ACMPress.[17]TapasKanungo,DavidM.Mount,NathanS.Ne-tanyahu,ChristineD.Piatko,RuthSilverman,andAn-gelaY.Wu.Alocalsearchapproximationalgorithmfork-meansclustering.Comput.Geom.,28(2-3):89{112,2004.[18]KDDCup1999dataset.http://kdd.ics.uci.edu//databases/kddcup99/kddcup99.html.[19]AmitKumar,YogishSabharwal,andSandeepSen.Asimplelineartime(1+)-approximationalgorithmfork-meansclusteringinanydimensions.InFOCS'04:Proceedingsofthe45thAnnualIEEESymposiumonFoundationsofComputerScience(FOCS'04),pages454{462,Washington,DC,USA,2004.IEEECom-puterSociety.[20]StuartP.Lloyd.Leastsquaresquantizationinpcm.IEEETransactionsonInformationTheory,28(2):129{136,1982.[21]JirMatousek.Onapproximategeometrick-clustering.Discrete&ComputationalGeometry,24(1):61{84,2000.[22]RamgopalR.MettuandC.GregPlaxton.Optimaltimeboundsforapproximateclustering.InAdnanDarwicheandNirFriedman,editors,UAI,pages344{351.MorganKaufmann,2002.[23]A.Meyerson.Onlinefacilitylocation.InFOCS'01:Proceedingsofthe42ndIEEEsymposiumonFounda-tionsofComputerScience,page426,Washington,DC,USA,2001.IEEEComputerSociety.[24]R.Ostrovsky,Y.Rabani,L.Schulman,andC.Swamy.TheeectivenessofLloyd-typemethodsforthek-Meansproblem.InSymposiumonFoundationsofComputerScience,2006.[25]Spame-maildatabase.http://www.ics.uci.edu/~mlearn/databases/spambase/.