78K - views

kmeans The Advantages of Careful Seeding David Arthur Sergei Vassilvitskii Abstract The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between

Although it o64256ers no accuracy guarantees its simplicity and speed are very appealing in practice By augmenting kmeans with a very simple ran domized seeding technique we obtain an algorithm that is 920log competitive with the optimal clustering

Embed :
Pdf Download Link

Download Pdf - The PPT/PDF document "kmeans The Advantages of Careful Seedin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

kmeans The Advantages of Careful Seeding David Arthur Sergei Vassilvitskii Abstract The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between






Presentation on theme: "kmeans The Advantages of Careful Seeding David Arthur Sergei Vassilvitskii Abstract The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between"— Presentation transcript:

k-means++:TheAdvantagesofCarefulSeedingDavidArthurSergeiVassilvitskiiyAbstractThek-meansmethodisawidelyusedclusteringtechniquethatseekstominimizetheaveragesquareddistancebetweenpointsinthesamecluster.Althoughito ersnoaccuracyguarantees,itssimplicityandspeedareveryappealinginpractice.Byaugmentingk-meanswithaverysimple,ran-domizedseedingtechnique,weobtainanalgorithmthatis(logk)-competitivewiththeoptimalclustering.Prelim-inaryexperimentsshowthatouraugmentationimprovesboththespeedandtheaccuracyofk-means,oftenquitedramatically.1IntroductionClusteringisoneoftheclassicproblemsinmachinelearningandcomputationalgeometry.Inthepopulark-meansformulation,oneisgivenanintegerkandasetofndatapointsinRd.Thegoalistochoosekcenterssoastominimize,thesumofthesquareddistancesbetweeneachpointanditsclosestcenter.SolvingthisproblemexactlyisNP-hard,evenwithjusttwoclusters[10],buttwenty- veyearsago,Lloyd[20]proposedalocalsearchsolutionthatisstillverywidelyusedtoday(seeforexample[1,11,15]).Indeed,arecentsurveyofdataminingtechniquesstatesthatit\isbyfarthemostpopularclusteringalgorithmusedinscienti candindustrialapplications"[5].Usuallyreferredtosimplyask-means,Lloyd'salgorithmbeginswithkarbitrarycenters,typicallychosenuniformlyatrandomfromthedatapoints.Eachpointisthenassignedtothenearestcenter,andeachcenterisrecomputedasthecenterofmassofallpointsassignedtoit.Thesetwosteps(assignmentandcentercalculation)arerepeateduntiltheprocessstabilizes.Onecancheckthatthetotalerrorismonotoni-callydecreasing,whichensuresthatnoclusteringisre-peatedduringthecourseofthealgorithm.Sincethereareatmostknpossibleclusterings,theprocesswillal-waysterminate.Inpractice,veryfewiterationsareusu-allyrequired,whichmakesthealgorithmmuchfaster StanfordUniversity,SupportedinpartbyNDSEGFellow-ship,NSFGrantITR-0331640,andgrantsfromMedia-XandSNRC.yStanfordUniversity,SupportedinpartbyNSFGrantITR-0331640,andgrantsfromMedia-XandSNRC.thanmostofitscompetitors.Unfortunately,theempiricalspeedandsimplicityofthek-meansalgorithmcomeatthepriceofaccuracy.Therearemanynaturalexamplesforwhichthealgo-rithmgeneratesarbitrarilybadclusterings(i.e., OPTisunboundedevenwhennandkare xed).Furthermore,theseexamplesdonotrelyonanadversarialplacementofthestartingcenters,andtheratiocanbeunboundedwithhighprobabilityevenwiththestandardrandom-izedseedingtechnique.Inthispaper,weproposeawayofinitializingk-meansbychoosingrandomstartingcenterswithveryspeci cprobabilities.Speci cally,wechooseapointpasacenterwithprobabilityproportionaltop'scontributiontotheoverallpotential.Lettingdenotethepotentialafterchoosingcentersinthisway,weshowthefollowing.Theorem1.1.Foranysetofdatapoints,E[]8(lnk+2)OPT.Thissamplingisbothfastandsimple,anditalreadyachievesapproximationguaranteesthatk-meanscan-not.Weproposeusingittoseedtheinitialcentersfork-means,leadingtoacombinedalgorithmwecallk-means++.ThiscomplementsaveryrecentresultofOstrovskyetal.[24],whoindependentlyproposedmuchthesamealgorithm.Whereastheyshowedthisrandomizedseed-ingisO(1)-competitiveondatasetsfollowingacertainseparationcondition,weshowitisO(logk)-competitiveonalldatasets.WealsoshowthattheanalysisforTheorem1.1istightuptoaconstantfactor,andthatitcanbeeas-ilyextendedtovariouspotentialfunctionsinarbitrarymetricspaces.Inparticular,wecanalsogetasim-pleO(logk)approximationalgorithmforthek-medianobjective.Furthermore,weprovidepreliminaryexperi-mentaldatashowingthatinpractice,k-means++reallydoesoutperformk-meansintermsofbothaccuracyandspeed,oftenbyasubstantialmargin.1.1RelatedworkAsafundamentalprobleminmachinelearning,k-meanshasarichhistory.Becauseofitssimplicityanditsobservedspeed,Lloyd'smethod[20]remainsthemostpopularapproachinpractice, despiteitslimitedaccuracy.TheconvergencetimeofLloyd'smethodhasbeenthesubjectofarecentseriesofpapers[2,4,8,14];inthisworkwefocusonimprovingitsaccuracy.Inthetheorycommunity,Inabaetal.[16]werethe rsttogiveanexactalgorithmforthek-meansproblem,withtherunningtimeofO(nkd).Sincethen,anumberofpolynomialtimeapproximationschemeshavebeendeveloped(see[9,13,19,21]andthereferencestherein).Whiletheauthorsdevelopinterestinginsightsintothestructureoftheclusteringproblem,theiralgorithmsarehighlyexponential(orworse)ink,andareunfortunatelyimpracticalevenforrelativelysmalln,kandd.Kanungoetal.[17]proposedanO(n3d)algorithmthatis(9+)-competitive.However,n3comparesunfavorablywiththealmostlinearrunningtimeofLloyd'smethod,andtheexponentialdependenceondcanalsobeproblematic.Forthesereasons,Kanungoetal.alsosuggestedawayofcombiningtheirtechniqueswithLloyd'salgorithm,butinordertoavoidtheexponentialdependenceond,theirapproachsacri cesallapproximationguarantees.MettuandPlaxton[22]alsoachievedaconstant-probabilityO(1)approximationusingatechniquecalledsuccessivesampling.TheymatchourrunningtimeofO(nkd),butonlyifkissucientlylargeandthespreadissucientlysmall.Inpractice,ourapproachissimpler,andourexperimentalresultsseemtobebetterintermsofbothspeedandaccuracy.Veryrecently,Ostrovskyetal.[24]independentlyproposedanalgorithmthatisessentiallyidenticaltoours,althoughtheiranalysisisquitedi erent.LettingOPT;kdenotetheoptimalpotentialforak-clusteringonagivendataset,theyprovek-means++isO(1)-competitiveinthecasewhereOPT;k OPT;k12.Theintuitionhereisthatifthisconditiondoesnothold,thenthedataisnotwellsuitedforclusteringwiththegivenvaluefork.Combiningthisresultwithoursgivesastrongcharacterizationofthealgorithm'sperformance.Inparticular,k-means++isneverworsethanO(logk)-competitive,andonverywellformeddatasets,itimprovestobeingO(1)-competitive.Overall,theseedingtechniqueweproposeissimilarinspirittothatusedbyMeyerson[23]foronlinefacilitylocation,andMishraetal.[12]andCharikaretal.[6]inthecontextofk-medianclustering.However,ouranalysisisquitedi erentfromthoseworks.2PreliminariesInthissection,weformallyde nethek-meansproblem,aswellasthek-meansandk-means++algorithms.Forthek-meansproblem,wearegivenanintegerkandasetofndatapointsXRd.WewishtochoosekcentersCsoastominimizethepotentialfunction,=Xx2Xminc2Ckxck2:Choosingthesecentersimplicitlyde nesaclustering{foreachcenter,wesetoneclustertobethesetofdatapointsthatareclosertothatcenterthantoanyother.Asnotedabove, ndinganexactsolutiontothek-meansproblemisNP-hard.Throughoutthepaper,wewillletCOPTdenotetheoptimalclusteringforagiveninstanceofthek-meansproblem,andwewillletOPTdenotethecorrespondingpotential.GivenaclusteringCwithpotential,wealsolet(A)denotethecontributionofAXtothepotential(i.e.,(A)=Px2Aminc2Ckxck2).2.1Thek-meansalgorithmThek-meansmethodisasimpleandfastalgorithmthatattemptstolocallyimproveanarbitraryk-meansclustering.Itworksasfollows.1.ArbitrarilychoosekinitialcentersC=fc1;:::;ckg.2.Foreachi2f1;:::;kg,settheclusterCitobethesetofpointsinXthatareclosertocithantheyaretocjforallj=i.3.Foreachi2f1;:::;kg,setcitobethecenterofmassofallpointsinCi:ci=1 jCijPx2Cix.4.RepeatSteps2and3untilCnolongerchanges.ItisstandardpracticetochoosetheinitialcentersuniformlyatrandomfromX.ForStep2,tiesmaybebrokenarbitrarily,aslongasthemethodisconsistent.Steps2and3arebothguaranteedtodecrease,sothealgorithmmakeslocalimprovementstoanarbitraryclusteringuntilitisnolongerpossibletodoso.ToseethatStep3doesinfactdecreases,itishelpfultorecallastandardresultfromlinearalgebra(see[14]).Lemma2.1.LetSbeasetofpointswithcenterofmassc(S),andletzbeanarbitrarypoint.Then,Px2Skxzk2Px2Skxc(S)k2=jSjkc(S)zk2.MonotonicityforStep3followsfromtakingStobeasingleclusterandztobeitsinitialcenter.Asdiscussedabove,thek-meansalgorithmisat-tractiveinpracticebecauseitissimpleanditisgener-allyfast.Unfortunately,itisguaranteedonlyto ndalocaloptimum,whichcanoftenbequitepoor.2.2Thek-means++algorithmThek-meansalgo-rithmbeginswithanarbitrarysetofclustercenters.Weproposeaspeci cwayofchoosingthesecenters.At anygiventime,letD(x)denotetheshortestdistancefromadatapointxtotheclosestcenterwehaveal-readychosen.Then,wede nethefollowingalgorithm,whichwecallk-means++.1a.Chooseaninitialcenterc1uniformlyatrandomfromX.1b.Choosethenextcenterci,selectingci=x02XwithprobabilityD(x0)2 x2XD(x)2.1c.RepeatStep1buntilwehavechosenatotalofkcenters.2-4.Proceedaswiththestandardk-meansalgorithm.WecalltheweightingusedinStep1bsimply\D2weighting".3k-means++isO(logk)-competitiveInthissection,weproveourmainresult.Theorem3.1.IfCisconstructedwithk-means++,thenthecorrespondingpotentialfunctionsatis esE[]8(lnk+2)OPT.Infact,weprovethisholdsafteronlyStep1ofthealgorithmabove.Steps2through4canthenonlydecrease.Notsurprisingly,ourexperimentsshowthislocaloptimizationisimportantinpractice,althoughitisdiculttoquantifythistheoretically.Ouranalysisconsistsoftwoparts.First,weshowthatk-means++iscompetitiveinthoseclustersofCOPTfromwhichitchoosesacenter.Thisiseasiestinthecaseofour rstcenter,whichischosenuniformlyatrandom.Lemma3.1.LetAbeanarbitraryclusterinCOPT,andletCbetheclusteringwithjustonecenter,whichischosenuniformlyatrandomfromA.Then,E[(A)]=2OPT(A).Proof.Letc(A)denotethecenterofmassofthedatapointsinA.ByLemma2.1,weknowthatsinceCOPTisoptimal,itmustbeusingc(A)asthecentercorrespondingtotheclusterA.Usingthesamelemmaagain,weseeE[(A)]isgivenby,Xa02A1 jAj Xa2Akaa0k2!=1 jAjXa02A Xa2Akac(A)k2+jAjka0c(A)k2!=2Xa2Akac(A)k2;andtheresultfollows.OurnextstepistoproveananalogofLemma3.1fortheremainingcenters,whicharechosenwithD2weighting.Lemma3.2.LetAbeanarbitraryclusterinCOPT,andletCbeanarbitraryclustering.IfweaddarandomcentertoCfromA,chosenwithD2weighting,thenE[(A)]8OPT(A).Proof.Theprobabilitythatwechoosesome xeda0asourcenter,giventhatwearechoosingourcenterfromA,ispreciselyD(a0)2 a2AD(a)2.Furthermore,afterchoos-ingthecentera0,apointawillcontributepreciselymin(D(a);kaa0k)2tothepotential.Therefore,E[(A)]=Xa02AD(a0)2 Pa2AD(a)2Xa2Amin(D(a);kaa0k)2:NotebythetriangleinequalitythatD(a0)D(a)+kaa0kforalla;a0.Fromthis,thepower-meaninequality1impliesthatD(a0)22D(a)2+2kaa0k2.Summingoveralla,wethenhavethatD(a0)22 jAjPa2AD(a)2+2 jAjPa2Akaa0k2,andhence,E[(A)]isatmost,2 jAjXa02APa2AD(a)2 Pa2AD(a)2Xa2Amin(D(a);kaa0k)2+2 jAjXa02APa2Akaa0k2 Pa2AD(a)2Xa2Amin(D(a);kaa0k)2:Inthe rstexpression,wesubstitutemin(D(a);kaa0k)2kaa0k2,andinthesecondexpression,wesubstitutemin(D(a);kaa0k)2D(a)2.Simplifying,wethenhave,E[(A)]4 jAjXa02AXa2Akaa0k2=8OPT(A):ThelaststepherefollowsfromLemma3.1.WehavenowshownthatseedingbyD2weightingiscompetitiveaslongasitchoosescentersfromeachclusterofCOPT,whichcompletesthe rsthalfofourargument.WenowuseinductiontoshowthetotalerroringeneralisatmostO(logk). 1Thepower-meaninequalitystatesforanyrealnumbersa1;;amthata2i1 m(ai)2.ItfollowsfromtheCauchy-Schwarzinequality.Weareonlyusingthecasem=2here,butwewillneedthegeneralcaseforLemma3.3. Lemma3.3.LetCbeanarbitraryclustering.Chooseu�0\uncovered"clustersfromCOPT,andletXudenotethesetofpointsintheseclusters.AlsoletXc=XXu.NowsupposeweaddturandomcenterstoC,chosenwithD2weighting.LetC0denotethetheresultingclustering,andlet0denotethecorrespondingpotential.Then,E[0]isatmost,(Xc)+8OPT(Xu)(1+Ht)+ut u(Xu):Here,Htdenotestheharmonicsum,1+1 2++1 t.Proof.Weprovethisbyinduction,showingthatiftheresultholdsfor(t1;u)and(t1;u1),thenitalsoholdsfor(t;u).Therefore,itsucestocheckt=0;u�0andt=u=1asourbasecases.Ift=0andu�0,theresultfollowsfromthefactthat1+Ht=ut u=1.Next,supposet=u=1.Wechooseouronenewcenterfromtheoneuncoveredclusterwithprobabilityexactly(Xu) .Inthiscase,Lemma3.2guaranteesthatE[0](Xc)+8OPT(Xu).Since0evenifwechooseacenterfromacoveredcluster,wehave,E[0](Xu) (Xc)+8OPT(Xu)+(Xc) 2(Xc)+8OPT(Xu):Since1+Ht=2here,wehaveshowntheresultholdsforbothbasecases.Wenowproceedtoprovetheinductivestep.Itisconvenientheretoconsidertwocases.Firstsupposewechooseour rstcenterfromacoveredcluster.Asabove,thishappenswithprobabilityexactly(Xc) .Notethatthisnewcentercanonlydecrease.Bearingthisinmind,applytheinductivehypothesiswiththesamechoiceofcoveredclusters,butwithtdecreasedbyone.ItfollowsthatourcontributiontoE[0]inthiscaseisatmost,(Xc) (Xc)+8OPT(Xu)(1+Ht1)+ut+1 u(Xu):Ontheotherhand,supposewechooseour rstcenterfromsomeuncoveredclusterA.Thishappenswithprobability(A) .Letpadenotetheprobabilitythatwechoosea2Aasourcenter,giventhecenterissomewhereinA,andletadenote(A)afterwechooseaasourcenter.Onceagain,weapplyourinductivehypothesis,thistimeaddingAtothesetofcoveredclusters,aswellasdecreasingbothtanduby1.ItfollowsthatourcontributiontoE[OPT]inthiscaseisatmost,(A) Xa2Apa(Xc)+a+8OPT(Xu)8OPT(A)(1+Ht1)+ut u1(Xu)(A)(A) (Xc)+8OPT(Xu)(1+Ht1)+ut u1(Xu)(A):ThelaststepherefollowsfromthefactthatPa2Apaa8OPT(A),whichisimpliedbyLemma3.2.Now,thepower-meaninequalityimpliesthatPAXu(A)21 u(Xu)2.Therefore,ifwesumoveralluncoveredclustersA,weobtainapotentialcontri-butionofatmost,(Xu) (Xc)+8OPT(Xu)(1+Ht1)+1 ut u1(Xu)21 u(Xu)2=(Xu) (Xc)+8OPT(Xu)(1+Ht1)+ut u(Xu):CombiningthepotentialcontributiontoE[0]frombothcases,wenowobtainthedesiredbound:E[0](Xc)+8OPT(Xu)(1+Ht1)+ut u(Xu)+(Xc) (Xu) u(Xc)+8OPT(Xu)1+Ht1+1 u+ut u(Xu):Theinductivestepnowfollowsfromthefactthat1 u1 t.WespecializeLemma3.3toobtainourmainresult.Theorem3.1IfCisconstructedwithk-means++,thenthecorrespondingpotentialfunctionsatis esE[]8(lnk+2)OPT.Proof.ConsidertheclusteringCafterwehavecom-pletedStep1.LetAdenotetheCOPTclusterinwhichwechosethe rstcenter.ApplyingLemma3.3with t=u=k1andwithAbeingtheonlycoveredclus-ter,wehave,E[OPT](A)+8OPT8OPT(A)(1+Hk1):TheresultnowfollowsfromLemma3.1,andfromthefactthatHk11+lnk.4AmatchinglowerboundInthissection,weshowthattheD2seedingusedbyk-means++isnobetterthan\n(logk)-competitiveinexpectation,therebyprovingTheorem3.1istightwithinaconstantfactor.Fixk,andthenchoosen,,withnkand.WeconstructXwithnpoints.Firstchoosekcentersc1;c2;:::;cksuchthatkcicjk2=2nk n2foralli=j.Now,foreachci,adddatapointsxi;1;xi;2;;xi;n karrangedinaregularsimplexwithcenterci,sidelength,andradiusq nk 2n.Ifwedothisinorthogonaldimensionsforeachi,wethenhave,kxi;i0xj;j0k=ifi=j,orotherwise.Weproveourseedingtechniqueisinexpectation\n(logk)worsethantheoptimalclusteringinthiscase.Clearly,theoptimalclusteringhascentersfcig,whichleadstoanoptimalpotentialofOPT=nk 22.Conversely,usinganinductionsimilartothatofLemma3.3,weshowD2seedingcannotmatchthisbound.Asbefore,weboundtheexpectedpotentialintermsofthenumberofcenterslefttochooseandthenumberofuncoveredclusters(thoseclustersofC0fromwhichwehavenotchosenacenter).Lemma4.1.LetCbeanarbitraryclusteringonXwithkt1centers,butwithuclustersfromCOPTuncovered.NowsupposeweaddtrandomcenterstoC,chosenwithD2weighting.LetC0denotethetheresultingclustering,andlet0denotethecorrespondingpotential.Furthermore,let =nk2 n, =22k2 2andH0u=Pui=1ki ki.Then,E[0]isatleast, t+1n2(1+H0u) +n k22n2(ut):Proof.Weprovethisbyinductionont.Ift=0,notethat,0==nun kk2+un k2:Sincenun kn k,wehavenun kk nun kn kk n k= .Also, ; 1.Therefore,0 nun k2 +un k2:Finally,sincen2uun k2 andn2un2H0u ,wehave,0 n2(1+H0u) +n k22n2u:Thiscompletesthebasecase.Wenowproceedtoprovetheinductivestep.AswithLemma3.3,weconsidertwocases.Theprobabilitythatour rstcenterischosenfromanuncoveredclusteris,un k2 un k2+(ku)n k2(kt)2u2 u2+(ku)2 u2 u2+(ku)2:Applyingourinductivehypothesiswithtandubothdecreasedby1,weobtainapotentialcontributionfromthiscaseofatleast,u2 u2+(ku)2 t+1n2(1+H0u1) +n k22n2(ut):Theprobabilitythatour rstcenterischosenfromacoveredclusteris(ku)n k2(kt)2 un k2+(ku)n k2(kt)2(ku)n k2(kt)2 (ku)n k2(ku)2 u2+(ku)2 (ku)2 u2+(ku)2:Applyingourinductivehypothesiswithtdecreasedby1butwithuconstant,weobtainapotentialcontributionfromthiscaseofatleast,(ku)2 u2+(ku)2 t+1n2(1+H0u) +n k22n2(ut+1):Therefore,E[0]isatleast, t+1n2(1+H0u) +n k22n2(ut)+ t+1 u2+(ku)2(ku)2n k22n2u2H0(u)H0(u1)n2 : However,H0uH0u1=ku kuand =22k2 2,sou2H0(u)H0(u1)n2 =(ku)2n k22n2;andtheresultfollows.Asintheprevioussection,weobtainthedesiredresultbyspecializingtheinduction.Theorem4.1.D2seedingisnobetterthan2(lnk)-competitive.Proof.Supposeaclusteringwithpotentialiscon-structedusingk-means++onXdescribedabove.Ap-plyLemma4.1withu=t=k1afterthe rstcenterhasbeenchosen.Notingthat1+H0k1=1+Pk1i=11 i1 k=Hk�lnk,wethenhave,E[] k n2lnk:Now, xkandbutletnandapproachin nity.Then and bothapproach1,andtheresultfollowsfromthefactthatOPT=nk 22.5GeneralizationsAlthoughthek-meansalgorithmitselfappliesonlyinvectorspaceswiththepotentialfunction=Px2Xminc2Ckxck2,wenotethatourseedingtech-niquedoesnothavethesamelimitations.Inthissec-tion,wediscussextendingourresultstoarbitrarymet-ricspaceswiththemoregeneralpotentialfunction,[`]=Px2Xminc2Ckxck`for`1.Inparticular,notethatthecaseof`=1isthek-medianspotentialfunction.Thesegeneralizationsrequireonlyonechangetothealgorithmitself.InsteadofusingD2seeding,weswitchtoD`seeding{i.e.,wechoosex0asacenterwithprobabilityD(x0)` x2XD(x)`.Fortheanalysis,themostimportantchangeap-pearsinLemma3.1.Ouroriginalproofusesaninnerproductstructurethatisnotavailableinthegeneralcase.However,aslightlyweakerresultcanbeprovenusingonlythetriangleinequality.Lemma5.1.LetAbeanarbitraryclusterinCOPT,andletCbetheclusteringwithjustonecenter,whichischo-senuniformlyatrandomfromA.Then,E[[`](A)]2`[`]OPT(A).Proof.LetcdenotethecenterofAinCOPT.Then,E[[`](A)]=1 jAjXa02AXa2Akaa0k`2`1 jAjXa02AXa2Akack`+ka0ck`=2`[`]OPT(A):Thesecondstepherefollowsfromthetriangleinequalityandthepower-meaninequality.Therestofourupperboundanalysiscarriesthroughwithoutchange,exceptthatintheproofofLemma3.2,weloseafactorof2`1fromthepower-meaninequality,insteadofjust2.Puttingeverythingtogether,weobtainthegeneraltheorem.Theorem5.1.IfCisconstructedwithD`seeding,thenthecorrespondingpotentialfunction[`]satis es,E[[`]]22`(lnk+2)[`]OPT.6EmpiricalresultsInordertoevaluatek-means++inpractice,wehaveimplementedandtesteditinC++[3].Inthissection,wediscusstheresultsofthesepreliminaryexperiments.WefoundthatD2seedingsubstantiallyimprovesboththerunningtimeandtheaccuracyofk-means.6.1DatasetsWeevaluatedtheperformanceofk-meansandk-means++onfourdatasets.The rstdataset,Norm25,issynthetic.Togenerateit,wechose25\true"centersuniformlyatrandomfroma15-dimensionalhypercubeofsidelength500.WethenaddedpointsfromGaussiandistributionsofvariance1aroundeachtruecenter.Thus,weobtainedanumberofwellseparatedGaussianswiththethetruecentersprovidingagoodapproximationtotheoptimalclustering.Wechosetheremainingdatasetsfromreal-worldexampleso theUC-IrvineMachineLearningReposi-tory.TheClouddataset[7]consistsof1024pointsin10dimensions,anditisPhilippeCollard's rstcloudcoverdatabase.TheIntrusiondataset[18]consistsof494019pointsin35dimensions,anditrepresentsfeaturesavail-abletoanintrusiondetectionsystem.Finally,theSpamdataset[25]consistsof4601pointsin58dimensions,anditrepresentsfeaturesavailabletoane-mailspamdetectionsystem.Foreachdataset,wetestedk=10;25;and50.6.2MetricsSinceweweretestingrandomizedseed-ingprocesses,weran20trialsforeachcase.Wereport Average Minimum AverageT k k-meansk-means++ k-meansk-means++ k-meansk-means++ 10 1:365105 8.47% 1:174105 0.93% 0.12 46.72% 25 4:233104 99.96% 1:914104 99.92% 0.90 87.79% 50 7:750103 99.81% 1:474101 0.53% 2.04 1.62% Table1:ExperimentalresultsontheNorm25dataset(n=10000,d=15).Fork-means,welisttheactualpotentialandtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means:100%1k-means++value k-meansvalue. Average Minimum AverageT k k-meansk-means++ k-meansk-means++ k-meansk-means++ 10 7:921103 22.33% 6:284103 10.37% 0.08 51.09% 25 3:637103 42.76% 2:550103 22.60% 0.11 43.21% 50 1:867103 39.01% 1:407103 23.07% 0.16 41.99% Table2:ExperimentalresultsontheClouddataset(n=1024,d=10).Fork-means,welisttheactualpotentialandtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means. Average Minimum AverageT k k-meansk-means++ k-meansk-means++ k-meansk-means++ 10 3:387108 93.37% 3:206108 94.40% 63.94 44.49% 25 3:149108 99.20% 3:100108 99.32% 257.34 49.19% 50 3:079108 99.84% 3:076108 99.87% 917.00 66.70% Table3:ExperimentalresultsontheIntrusiondataset(n=494019,d=35).Fork-means,welisttheactualpotentialandtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means. Average Minimum AverageT k k-meansk-means++ k-meansk-means++ k-meansk-means++ 10 3:698104 49.43% 3:684104 54.59% 2.36 69.00% 25 3:288104 88.76% 3:280104 89.58% 7.36 79.84% 50 3:183104 95.35% 2:384104 94.30% 12.20 75.76% Table4:ExperimentalresultsontheSpamdataset(n=4601,d=58).Fork-means,welisttheactualpotentialandtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means. theminimumandtheaveragepotential(actuallydi-videdbythenumberofpoints),aswellasthemeanrunningtime.Ourimplementationsarestandardwithnospecialoptimizations.6.3ResultsTheresultsfork-meansandk-means++aredisplayedinTables1through4.Welisttheabsoluteresultsfork-means,andthepercentageimprovementachievedbyk-means++(e.g.,a90%improvementintherunningtimeisequivalenttoafactor10speedup).Weobservethatk-means++consistentlyoutperformedk-means,bothbyachievingalowerpotentialvalue,insomecasesbyseveralordersofmagnitude,andalsobyhavingafasterrunningtime.TheD2seedingisslightlyslowerthanuniformseeding,butitstillleadstoafasteralgorithmsinceithelpsthelocalsearchconvergeafterfeweriterations.Thesyntheticexampleisacasewherestandardk-meansdoesverybadly.Eventhoughthereisan\obvious"clustering,theuniformseedingwillinevitablymergesomeoftheseclusters,andthelocalsearchwillneverbeabletosplitthemapart(see[12]forfurtherdiscussionofthisphenomenon).Thecarefulseedingmethodofk-means++avoidedthisproblemaltogether,anditalmostalwaysattainedtheoptimalclusteringonthesyntheticdataset.Thedi erencebetweenk-meansandk-means++onthereal-worlddatasetswasalsosubstantial.Ineverycase,k-means++achievedatleasta10%accuracyimprovementoverk-means,anditoftenperformedmuchbetter.Indeed,ontheSpamandIntrusiondatasets,k-means++achievedpotentials20to1000timessmallerthanthoseachievedbystandardk-means.Eachtrialalsocompletedtwotothreetimesfaster,andeachindividualtrialwasmuchmorelikelytoachieveagoodclustering.7ConclusionandfutureworkWehavepresentedanewwaytoseedthek-meansalgorithmthatisO(logk)-competitivewiththeoptimalclustering.Furthermore,ourseedingtechniqueisasfastandassimpleasthek-meansalgorithmitself,whichmakesitattractiveinpractice.Towardsthatend,weranpreliminaryexperimentsonseveralreal-worlddatasets,andweobservedthatk-means++substantiallyoutperformedstandardk-meansintermsofbothspeedandaccuracy.AlthoughouranalysisoftheexpectedpotentialE[]achievedbyk-means++istighttowithinacon-stantfactor,afewopenquestionsstillremain.Mostimportantly,itisstandardpracticetorunthek-meansalgorithmmultipletimes,andthenkeeponlythebestclusteringfound.Thisraisesthequestionofwhetherk-means++achievesasymptoticallybetterresultsifitisallowedseveraltrials.Forexample,ifk-means++isrun2ktimes,ourargumentscanbemodi edtoshowitislikelytoachieveaconstantapproximationatleastonce.Weaskwhetherasimilarboundcanbeachievedforasmallernumberoftrials.Also,experimentsshowedthatk-means++generallyperformedbetterifitselectedseveralnewcentersduringeachiteration,andthengreedilychosetheonethatdecreasedasmuchaspossible.Unfortunately,ourproofsdonotcarryovertothisscenario.Itwouldbeinterestingtoseeacomparable(orbetter)asymptoticresultprovenhere.Finally,wearecurrentlyworkingonamorethor-oughexperimentalanalysis.Inparticular,wearemea-suringtheperformanceofnotonlyk-means++andstan-dardk-means,butalsoothervariantsthathavebeensuggestedinthetheorycommunity.AcknowledgementsWewouldliketothankRajeevMotwaniforhishelpfulcomments.References[1]PankajK.AgarwalandNabilH.Mustafa.k-meansprojectiveclustering.InPODS'04:Proceedingsofthetwenty-thirdACMSIGMOD-SIGACT-SIGARTsym-posiumonPrinciplesofdatabasesystems,pages155{165,NewYork,NY,USA,2004.ACMPress.[2]D.ArthurandS.Vassilvitskii.Worst-caseandsmoothedanalysisoftheICPalgorithm,withanap-plicationtothek-meansmethod.InSymposiumonFoundationsofComputerScience,2006.[3]DavidArthurandSergeiVassilvitskii.k-means++testcode.http://www.stanford.edu/~darthur/kMeansppTest.zip.[4]DavidArthurandSergeiVassilvitskii.Howslowisthek-meansmethod?InSCG'06:Proceedingsofthetwenty-secondannualsymposiumoncomputationalgeometry.ACMPress,2006.[5]PavelBerkhin.Surveyofclusteringdataminingtechniques.Technicalreport,AccrueSoftware,SanJose,CA,2002.[6]MosesCharikar,LiadanO'Callaghan,andRinaPani-grahy.Betterstreamingalgorithmsforclusteringprob-lems.InSTOC'03:Proceedingsofthethirty- fthan-nualACMsymposiumonTheoryofcomputing,pages30{39,NewYork,NY,USA,2003.ACMPress.[7]PhilippeCollard'scloudcoverdatabase.ftp://ftp.ics.uci.edu/pub/machine-learning-databases/undocumented/taylor/cloud.data.[8]SanjoyDasgupta.Howfastis-means?InBernhardScholkopfandManfredK.Warmuth,editors,COLT,volume2777ofLectureNotesinComputerScience,page735.Springer,2003. [9]W.FernandezdelaVega,MarekKarpinski,ClaireKenyon,andYuvalRabani.Approximationschemesforclusteringproblems.InSTOC'03:Proceedingsofthethirty- fthannualACMsymposiumonTheoryofcomputing,pages50{58,NewYork,NY,USA,2003.ACMPress.[10]P.Drineas,A.Frieze,R.Kannan,S.Vempala,andV.Vinay.Clusteringlargegraphsviathesingularvaluedecomposition.Mach.Learn.,56(1-3):9{33,2004.[11]FredericGibouandRonaldFedkiw.Afasthybridk-meanslevelsetalgorithmforsegmentation.In4thAnnualHawaiiInternationalConferenceonStatisticsandMathematics,pages281{291,2005.[12]SudiptoGuha,AdamMeyerson,NinaMishra,RajeevMotwani,andLiadanO'Callaghan.Clusteringdatastreams:Theoryandpractice.IEEETransactionsonKnowledgeandDataEngineering,15(3):515{528,2003.[13]SarielHar-PeledandSohamMazumdar.Oncoresetsfork-meansandk-medianclustering.InSTOC'04:Proceedingsofthethirty-sixthannualACMsymposiumonTheoryofcomputing,pages291{300,NewYork,NY,USA,2004.ACMPress.[14]SarielHar-PeledandBardiaSadri.Howfastisthek-meansmethod?InSODA'05:ProceedingsofthesixteenthannualACM-SIAMsymposiumonDiscretealgorithms,pages877{885,Philadelphia,PA,USA,2005.SocietyforIndustrialandAppliedMathematics.[15]R.Herwig,A.J.Poustka,C.Muller,C.Bull,H.Lehrach,andJO'Brien.Large-scaleclusteringofcdna- ngerprintingdata.GenomeResearch,9:1093{1105,1999.[16]MaryInaba,NaokiKatoh,andHiroshiImai.Applica-tionsofweightedvoronoidiagramsandrandomizationtovariance-basedk-clustering:(extendedabstract).InSCG'94:ProceedingsofthetenthannualsymposiumonComputationalgeometry,pages332{339,NewYork,NY,USA,1994.ACMPress.[17]TapasKanungo,DavidM.Mount,NathanS.Ne-tanyahu,ChristineD.Piatko,RuthSilverman,andAn-gelaY.Wu.Alocalsearchapproximationalgorithmfork-meansclustering.Comput.Geom.,28(2-3):89{112,2004.[18]KDDCup1999dataset.http://kdd.ics.uci.edu//databases/kddcup99/kddcup99.html.[19]AmitKumar,YogishSabharwal,andSandeepSen.Asimplelineartime(1+)-approximationalgorithmfork-meansclusteringinanydimensions.InFOCS'04:Proceedingsofthe45thAnnualIEEESymposiumonFoundationsofComputerScience(FOCS'04),pages454{462,Washington,DC,USA,2004.IEEECom-puterSociety.[20]StuartP.Lloyd.Leastsquaresquantizationinpcm.IEEETransactionsonInformationTheory,28(2):129{136,1982.[21]JirMatousek.Onapproximategeometrick-clustering.Discrete&ComputationalGeometry,24(1):61{84,2000.[22]RamgopalR.MettuandC.GregPlaxton.Optimaltimeboundsforapproximateclustering.InAdnanDarwicheandNirFriedman,editors,UAI,pages344{351.MorganKaufmann,2002.[23]A.Meyerson.Onlinefacilitylocation.InFOCS'01:Proceedingsofthe42ndIEEEsymposiumonFounda-tionsofComputerScience,page426,Washington,DC,USA,2001.IEEEComputerSociety.[24]R.Ostrovsky,Y.Rabani,L.Schulman,andC.Swamy.Thee ectivenessofLloyd-typemethodsforthek-Meansproblem.InSymposiumonFoundationsofComputerScience,2006.[25]Spame-maildatabase.http://www.ics.uci.edu/~mlearn/databases/spambase/.