kmeans The Advantages of Careful Seeding David Arthur Sergei Vassilvitskii Abstract The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between - PDF document

Download presentation
kmeans  The Advantages of Careful Seeding David Arthur Sergei Vassilvitskii Abstract The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between
kmeans  The Advantages of Careful Seeding David Arthur Sergei Vassilvitskii Abstract The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between

Embed / Share - kmeans The Advantages of Careful Seeding David Arthur Sergei Vassilvitskii Abstract The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between


Presentation on theme: "kmeans The Advantages of Careful Seeding David Arthur Sergei Vassilvitskii Abstract The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between"— Presentation transcript


k-means++:TheAdvantagesofCarefulSeedingDavidArthurSergeiVassilvitskiiyAbstractThek-meansmethodisawidelyusedclusteringtechniquethatseekstominimizetheaveragesquareddistancebetweenpointsinthesamecluster.Althoughito ersnoaccuracyguarantees,itssimplicityandspeedareveryappealinginpractice.Byaugmentingk-meanswithaverysimple,ran-domizedseedingtechnique,weobtainanalgorithmthatis(logk)-competitivewiththeoptimalclustering.Prelim-inaryexperimentsshowthatouraugmentationimprovesboththespeedandtheaccuracyofk-means,oftenquitedramatically.1IntroductionClusteringisoneoftheclassicproblemsinmachinelearningandcomputationalgeometry.Inthepopulark-meansformulation,oneisgivenanintegerkandasetofndatapointsinRd.Thegoalistochoosekcenterssoastominimize,thesumofthesquareddistancesbetweeneachpointanditsclosestcenter.SolvingthisproblemexactlyisNP-hard,evenwithjusttwoclusters[10],buttwenty- veyearsago,Lloyd[20]proposedalocalsearchsolutionthatisstillverywidelyusedtoday(seeforexample[1,11,15]).Indeed,arecentsurveyofdataminingtechniquesstatesthatit\isbyfarthemostpopularclusteringalgorithmusedinscienti candindustrialapplications"[5].Usuallyreferredtosimplyask-means,Lloyd'salgorithmbeginswithkarbitrarycenters,typicallychosenuniformlyatrandomfromthedatapoints.Eachpointisthenassignedtothenearestcenter,andeachcenterisrecomputedasthecenterofmassofallpointsassignedtoit.Thesetwosteps(assignmentandcentercalculation)arerepeateduntiltheprocessstabilizes.Onecancheckthatthetotalerrorismonotoni-callydecreasing,whichensuresthatnoclusteringisre-peatedduringthecourseofthealgorithm.Sincethereareatmostknpossibleclusterings,theprocesswillal-waysterminate.Inpractice,veryfewiterationsareusu-allyrequired,whichmakesthealgorithmmuchfaster StanfordUniversity,SupportedinpartbyNDSEGFellow-ship,NSFGrantITR-0331640,andgrantsfromMedia-XandSNRC.yStanfordUniversity,SupportedinpartbyNSFGrantITR-0331640,andgrantsfromMedia-XandSNRC.thanmostofitscompetitors.Unfortunately,theempiricalspeedandsimplicityofthek-meansalgorithmcomeatthepriceofaccuracy.Therearemanynaturalexamplesforwhichthealgo-rithmgeneratesarbitrarilybadclusterings(i.e., OPTisunboundedevenwhennandkare xed).Furthermore,theseexamplesdonotrelyonanadversarialplacementofthestartingcenters,andtheratiocanbeunboundedwithhighprobabilityevenwiththestandardrandom-izedseedingtechnique.Inthispaper,weproposeawayofinitializingk-meansbychoosingrandomstartingcenterswithveryspeci cprobabilities.Speci cally,wechooseapointpasacenterwithprobabilityproportionaltop'scontributiontotheoverallpotential.Lettingdenotethepotentialafterchoosingcentersinthisway,weshowthefollowing.Theorem1.1.Foranysetofdatapoints,E[]8(lnk+2)OPT.Thissamplingisbothfastandsimple,anditalreadyachievesapproximationguaranteesthatk-meanscan-not.Weproposeusingittoseedtheinitialcentersfork-means,leadingtoacombinedalgorithmwecallk-means++.ThiscomplementsaveryrecentresultofOstrovskyetal.[24],whoindependentlyproposedmuchthesamealgorithm.Whereastheyshowedthisrandomizedseed-ingisO(1)-competitiveondatasetsfollowingacertainseparationcondition,weshowitisO(logk)-competitiveonalldatasets.WealsoshowthattheanalysisforTheorem1.1istightuptoaconstantfactor,andthatitcanbeeas-ilyextendedtovariouspotentialfunctionsinarbitrarymetricspaces.Inparticular,wecanalsogetasim-pleO(logk)approximationalgorithmforthek-medianobjective.Furthermore,weprovidepreliminaryexperi-mentaldatashowingthatinpractice,k-means++reallydoesoutperformk-meansintermsofbothaccuracyandspeed,oftenbyasubstantialmargin.1.1RelatedworkAsafundamentalprobleminmachinelearning,k-meanshasarichhistory.Becauseofitssimplicityanditsobservedspeed,Lloyd'smethod[20]remainsthemostpopularapproachinpractice, despiteitslimitedaccuracy.TheconvergencetimeofLloyd'smethodhasbeenthesubjectofarecentseriesofpapers[2,4,8,14];inthisworkwefocusonimprovingitsaccuracy.Inthetheorycommunity,Inabaetal.[16]werethe rsttogiveanexactalgorithmforthek-meansproblem,withtherunningtimeofO(nkd).Sincethen,anumberofpolynomialtimeapproximationschemeshavebeendeveloped(see[9,13,19,21]andthereferencestherein).Whiletheauthorsdevelopinterestinginsightsintothestructureoftheclusteringproblem,theiralgorithmsarehighlyexponential(orworse)ink,andareunfortunatelyimpracticalevenforrelativelysmalln,kandd.Kanungoetal.[17]proposedanO(n3d)algorithmthatis(9+)-competitive.However,n3comparesunfavorablywiththealmostlinearrunningtimeofLloyd'smethod,andtheexponentialdependenceondcanalsobeproblematic.Forthesereasons,Kanungoetal.alsosuggestedawayofcombiningtheirtechniqueswithLloyd'salgorithm,butinordertoavoidtheexponentialdependenceond,theirapproachsacri cesallapproximationguarantees.MettuandPlaxton[22]alsoachievedaconstant-probabilityO(1)approximationusingatechniquecalledsuccessivesampling.TheymatchourrunningtimeofO(nkd),butonlyifkissucientlylargeandthespreadissucientlysmall.Inpractice,ourapproachissimpler,andourexperimentalresultsseemtobebetterintermsofbothspeedandaccuracy.Veryrecently,Ostrovskyetal.[24]independentlyproposedanalgorithmthatisessentiallyidenticaltoours,althoughtheiranalysisisquitedi erent.LettingOPT;kdenotetheoptimalpotentialforak-clusteringonagivendataset,theyprovek-means++isO(1)-competitiveinthecasewhereOPT;k OPT;k12.Theintuitionhereisthatifthisconditiondoesnothold,thenthedataisnotwellsuitedforclusteringwiththegivenvaluefork.Combiningthisresultwithoursgivesastrongcharacterizationofthealgorithm'sperformance.Inparticular,k-means++isneverworsethanO(logk)-competitive,andonverywellformeddatasets,itimprovestobeingO(1)-competitive.Overall,theseedingtechniqueweproposeissimilarinspirittothatusedbyMeyerson[23]foronlinefacilitylocation,andMishraetal.[12]andCharikaretal.[6]inthecontextofk-medianclustering.However,ouranalysisisquitedi erentfromthoseworks.2PreliminariesInthissection,weformallyde nethek-meansproblem,aswellasthek-meansandk-means++algorithms.Forthek-meansproblem,wearegivenanintegerkandasetofndatapointsXRd.WewishtochoosekcentersCsoastominimizethepotentialfunction,=Xx2Xminc2Ckxck2:Choosingthesecentersimplicitlyde nesaclustering{foreachcenter,wesetoneclustertobethesetofdatapointsthatareclosertothatcenterthantoanyother.Asnotedabove, ndinganexactsolutiontothek-meansproblemisNP-hard.Throughoutthepaper,wewillletCOPTdenotetheoptimalclusteringforagiveninstanceofthek-meansproblem,andwewillletOPTdenotethecorrespondingpotential.GivenaclusteringCwithpotential,wealsolet(A)denotethecontributionofAXtothepotential(i.e.,(A)=Px2Aminc2Ckxck2).2.1Thek-meansalgorithmThek-meansmethodisasimpleandfastalgorithmthatattemptstolocallyimproveanarbitraryk-meansclustering.Itworksasfollows.1.ArbitrarilychoosekinitialcentersC=fc1;:::;ckg.2.Foreachi2f1;:::;kg,settheclusterCitobethesetofpointsinXthatareclosertocithantheyaretocjforallj=i.3.Foreachi2f1;:::;kg,setcitobethecenterofmassofallpointsinCi:ci=1 jCijPx2Cix.4.RepeatSteps2and3untilCnolongerchanges.ItisstandardpracticetochoosetheinitialcentersuniformlyatrandomfromX.ForStep2,tiesmaybebrokenarbitrarily,aslongasthemethodisconsistent.Steps2and3arebothguaranteedtodecrease,sothealgorithmmakeslocalimprovementstoanarbitraryclusteringuntilitisnolongerpossibletodoso.ToseethatStep3doesinfactdecreases,itishelpfultorecallastandardresultfromlinearalgebra(see[14]).Lemma2.1.LetSbeasetofpointswithcenterofmassc(S),andletzbeanarbitrarypoint.Then,Px2Skxzk2Px2Skxc(S)k2=jSjkc(S)zk2.MonotonicityforStep3followsfromtakingStobeasingleclusterandztobeitsinitialcenter.Asdiscussedabove,thek-meansalgorithmisat-tractiveinpracticebecauseitissimpleanditisgener-allyfast.Unfortunately,itisguaranteedonlyto ndalocaloptimum,whichcanoftenbequitepoor.2.2Thek-means++algorithmThek-meansalgo-rithmbeginswithanarbitrarysetofclustercenters.Weproposeaspeci cwayofchoosingthesecenters.At anygiventime,letD(x)denotetheshortestdistancefromadatapointxtotheclosestcenterwehaveal-readychosen.Then,wede nethefollowingalgorithm,whichwecallk-means++.1a.Chooseaninitialcenterc1uniformlyatrandomfromX.1b.Choosethenextcenterci,selectingci=x02XwithprobabilityD(x0)2 x2XD(x)2.1c.RepeatStep1buntilwehavechosenatotalofkcenters.2-4.Proceedaswiththestandardk-meansalgorithm.WecalltheweightingusedinStep1bsimply\D2weighting".3k-means++isO(logk)-competitiveInthissection,weproveourmainresult.Theorem3.1.IfCisconstructedwithk-means++,thenthecorrespondingpotentialfunctionsatis esE[]8(lnk+2)OPT.Infact,weprovethisholdsafteronlyStep1ofthealgorithmabove.Steps2through4canthenonlydecrease.Notsurprisingly,ourexperimentsshowthislocaloptimizationisimportantinpractice,althoughitisdiculttoquantifythistheoretically.Ouranalysisconsistsoftwoparts.First,weshowthatk-means++iscompetitiveinthoseclustersofCOPTfromwhichitchoosesacenter.Thisiseasiestinthecaseofour rstcenter,whichischosenuniformlyatrandom.Lemma3.1.LetAbeanarbitraryclusterinCOPT,andletCbetheclusteringwithjustonecenter,whichischosenuniformlyatrandomfromA.Then,E[(A)]=2OPT(A).Proof.Letc(A)denotethecenterofmassofthedatapointsinA.ByLemma2.1,weknowthatsinceCOPTisoptimal,itmustbeusingc(A)asthecentercorrespondingtotheclusterA.Usingthesamelemmaagain,weseeE[(A)]isgivenby,Xa02A1 jAj Xa2Akaa0k2!=1 jAjXa02A Xa2Akac(A)k2+jAjka0c(A)k2!=2Xa2Akac(A)k2;andtheresultfollows.OurnextstepistoproveananalogofLemma3.1fortheremainingcenters,whicharechosenwithD2weighting.Lemma3.2.LetAbeanarbitraryclusterinCOPT,andletCbeanarbitraryclustering.IfweaddarandomcentertoCfromA,chosenwithD2weighting,thenE[(A)]8OPT(A).Proof.Theprobabilitythatwechoosesome xeda0asourcenter,giventhatwearechoosingourcenterfromA,ispreciselyD(a0)2 a2AD(a)2.Furthermore,afterchoos-ingthecentera0,apointawillcontributepreciselymin(D(a);kaa0k)2tothepotential.Therefore,E[(A)]=Xa02AD(a0)2 Pa2AD(a)2Xa2Amin(D(a);kaa0k)2:NotebythetriangleinequalitythatD(a0)D(a)+kaa0kforalla;a0.Fromthis,thepower-meaninequality1impliesthatD(a0)22D(a)2+2kaa0k2.Summingoveralla,wethenhavethatD(a0)22 jAjPa2AD(a)2+2 jAjPa2Akaa0k2,andhence,E[(A)]isatmost,2 jAjXa02APa2AD(a)2 Pa2AD(a)2Xa2Amin(D(a);kaa0k)2+2 jAjXa02APa2Akaa0k2 Pa2AD(a)2Xa2Amin(D(a);kaa0k)2:Inthe rstexpression,wesubstitutemin(D(a);kaa0k)2kaa0k2,andinthesecondexpression,wesubstitutemin(D(a);kaa0k)2D(a)2.Simplifying,wethenhave,E[(A)]4 jAjXa02AXa2Akaa0k2=8OPT(A):ThelaststepherefollowsfromLemma3.1.WehavenowshownthatseedingbyD2weightingiscompetitiveaslongasitchoosescentersfromeachclusterofCOPT,whichcompletesthe rsthalfofourargument.WenowuseinductiontoshowthetotalerroringeneralisatmostO(logk). 1Thepower-meaninequalitystatesforanyrealnumbersa1;;amthata2i1 m(ai)2.ItfollowsfromtheCauchy-Schwarzinequality.Weareonlyusingthecasem=2here,butwewillneedthegeneralcaseforLemma3.3. Lemma3.3.LetCbeanarbitraryclustering.Chooseu�0\uncovered"clustersfromCOPT,andletXudenotethesetofpointsintheseclusters.AlsoletXc=XXu.NowsupposeweaddturandomcenterstoC,chosenwithD2weighting.LetC0denotethetheresultingclustering,andlet0denotethecorrespondingpotential.Then,E[0]isatmost,(Xc)+8OPT(Xu)(1+Ht)+ut u(Xu):Here,Htdenotestheharmonicsum,1+1 2++1 t.Proof.Weprovethisbyinduction,showingthatiftheresultholdsfor(t1;u)and(t1;u1),thenitalsoholdsfor(t;u).Therefore,itsucestocheckt=0;u�0andt=u=1asourbasecases.Ift=0andu�0,theresultfollowsfromthefactthat1+Ht=ut u=1.Next,supposet=u=1.Wechooseouronenewcenterfromtheoneuncoveredclusterwithprobabilityexactly(Xu) .Inthiscase,Lemma3.2guaranteesthatE[0](Xc)+8OPT(Xu).Since0evenifwechooseacenterfromacoveredcluster,wehave,E[0](Xu) (Xc)+8OPT(Xu)+(Xc) 2(Xc)+8OPT(Xu):Since1+Ht=2here,wehaveshowntheresultholdsforbothbasecases.Wenowproceedtoprovetheinductivestep.Itisconvenientheretoconsidertwocases.Firstsupposewechooseour rstcenterfromacoveredcluster.Asabove,thishappenswithprobabilityexactly(Xc) .Notethatthisnewcentercanonlydecrease.Bearingthisinmind,applytheinductivehypothesiswiththesamechoiceofcoveredclusters,butwithtdecreasedbyone.ItfollowsthatourcontributiontoE[0]inthiscaseisatmost,(Xc) (Xc)+8OPT(Xu)(1+Ht1)+ut+1 u(Xu):Ontheotherhand,supposewechooseour rstcenterfromsomeuncoveredclusterA.Thishappenswithprobability(A) .Letpadenotetheprobabilitythatwechoosea2Aasourcenter,giventhecenterissomewhereinA,andletadenote(A)afterwechooseaasourcenter.Onceagain,weapplyourinductivehypothesis,thistimeaddingAtothesetofcoveredclusters,aswellasdecreasingbothtanduby1.ItfollowsthatourcontributiontoE[OPT]inthiscaseisatmost,(A) Xa2Apa(Xc)+a+8OPT(Xu)8OPT(A)(1+Ht1)+ut u1(Xu)(A)(A) (Xc)+8OPT(Xu)(1+Ht1)+ut u1(Xu)(A):ThelaststepherefollowsfromthefactthatPa2Apaa8OPT(A),whichisimpliedbyLemma3.2.Now,thepower-meaninequalityimpliesthatPAXu(A)21 u(Xu)2.Therefore,ifwesumoveralluncoveredclustersA,weobtainapotentialcontri-butionofatmost,(Xu) (Xc)+8OPT(Xu)(1+Ht1)+1 ut u1(Xu)21 u(Xu)2=(Xu) (Xc)+8OPT(Xu)(1+Ht1)+ut u(Xu):CombiningthepotentialcontributiontoE[0]frombothcases,wenowobtainthedesiredbound:E[0](Xc)+8OPT(Xu)(1+Ht1)+ut u(Xu)+(Xc) (Xu) u(Xc)+8OPT(Xu)1+Ht1+1 u+ut u(Xu):Theinductivestepnowfollowsfromthefactthat1 u1 t.WespecializeLemma3.3toobtainourmainresult.Theorem3.1IfCisconstructedwithk-means++,thenthecorrespondingpotentialfunctionsatis esE[]8(lnk+2)OPT.Proof.ConsidertheclusteringCafterwehavecom-pletedStep1.LetAdenotetheCOPTclusterinwhichwechosethe rstcenter.ApplyingLemma3.3with t=u=k1andwithAbeingtheonlycoveredclus-ter,wehave,E[OPT](A)+8OPT8OPT(A)(1+Hk1):TheresultnowfollowsfromLemma3.1,andfromthefactthatHk11+lnk.4AmatchinglowerboundInthissection,weshowthattheD2seedingusedbyk-means++isnobetterthan\n(logk)-competitiveinexpectation,therebyprovingTheorem3.1istightwithinaconstantfactor.Fixk,andthenchoosen,,withnkand.WeconstructXwithnpoints.Firstchoosekcentersc1;c2;:::;cksuchthatkcicjk2=2nk n2foralli=j.Now,foreachci,adddatapointsxi;1;xi;2;;xi;n karrangedinaregularsimplexwithcenterci,sidelength,andradiusq nk 2n.Ifwedothisinorthogonaldimensionsforeachi,wethenhave,kxi;i0xj;j0k=ifi=j,orotherwise.Weproveourseedingtechniqueisinexpectation\n(logk)worsethantheoptimalclusteringinthiscase.Clearly,theoptimalclusteringhascentersfcig,whichleadstoanoptimalpotentialofOPT=nk 22.Conversely,usinganinductionsimilartothatofLemma3.3,weshowD2seedingcannotmatchthisbound.Asbefore,weboundtheexpectedpotentialintermsofthenumberofcenterslefttochooseandthenumberofuncoveredclusters(thoseclustersofC0fromwhichwehavenotchosenacenter).Lemma4.1.LetCbeanarbitraryclusteringonXwithkt1centers,butwithuclustersfromCOPTuncovered.NowsupposeweaddtrandomcenterstoC,chosenwithD2weighting.LetC0denotethetheresultingclustering,andlet0denotethecorrespondingpotential.Furthermore,let =nk2 n, =22k2 2andH0u=Pui=1ki ki.Then,E[0]isatleast, t+1n2(1+H0u) +n k22n2(ut):Proof.Weprovethisbyinductionont.Ift=0,notethat,0==nun kk2+un k2:Sincenun kn k,wehavenun kk nun kn kk n k= .Also, ; 1.Therefore,0 nun k2 +un k2:Finally,sincen2uun k2 andn2un2H0u ,wehave,0 n2(1+H0u) +n k22n2u:Thiscompletesthebasecase.Wenowproceedtoprovetheinductivestep.AswithLemma3.3,weconsidertwocases.Theprobabilitythatour rstcenterischosenfromanuncoveredclusteris,un k2 un k2+(ku)n k2(kt)2u2 u2+(ku)2 u2 u2+(ku)2:Applyingourinductivehypothesiswithtandubothdecreasedby1,weobtainapotentialcontributionfromthiscaseofatleast,u2 u2+(ku)2 t+1n2(1+H0u1) +n k22n2(ut):Theprobabilitythatour rstcenterischosenfromacoveredclusteris(ku)n k2(kt)2 un k2+(ku)n k2(kt)2(ku)n k2(kt)2 (ku)n k2(ku)2 u2+(ku)2 (ku)2 u2+(ku)2:Applyingourinductivehypothesiswithtdecreasedby1butwithuconstant,weobtainapotentialcontributionfromthiscaseofatleast,(ku)2 u2+(ku)2 t+1n2(1+H0u) +n k22n2(ut+1):Therefore,E[0]isatleast, t+1n2(1+H0u) +n k22n2(ut)+ t+1 u2+(ku)2(ku)2n k22n2u2H0(u)H0(u1)n2 : However,H0uH0u1=ku kuand =22k2 2,sou2H0(u)H0(u1)n2 =(ku)2n k22n2;andtheresultfollows.Asintheprevioussection,weobtainthedesiredresultbyspecializingtheinduction.Theorem4.1.D2seedingisnobetterthan2(lnk)-competitive.Proof.Supposeaclusteringwithpotentialiscon-structedusingk-means++onXdescribedabove.Ap-plyLemma4.1withu=t=k1afterthe rstcenterhasbeenchosen.Notingthat1+H0k1=1+Pk1i=11 i1 k=Hk�lnk,wethenhave,E[] k n2lnk:Now, xkandbutletnandapproachin nity.Then and bothapproach1,andtheresultfollowsfromthefactthatOPT=nk 22.5GeneralizationsAlthoughthek-meansalgorithmitselfappliesonlyinvectorspaceswiththepotentialfunction=Px2Xminc2Ckxck2,wenotethatourseedingtech-niquedoesnothavethesamelimitations.Inthissec-tion,wediscussextendingourresultstoarbitrarymet-ricspaceswiththemoregeneralpotentialfunction,[`]=Px2Xminc2Ckxck`for`1.Inparticular,notethatthecaseof`=1isthek-medianspotentialfunction.Thesegeneralizationsrequireonlyonechangetothealgorithmitself.InsteadofusingD2seeding,weswitchtoD`seeding{i.e.,wechoosex0asacenterwithprobabilityD(x0)` x2XD(x)`.Fortheanalysis,themostimportantchangeap-pearsinLemma3.1.Ouroriginalproofusesaninnerproductstructurethatisnotavailableinthegeneralcase.However,aslightlyweakerresultcanbeprovenusingonlythetriangleinequality.Lemma5.1.LetAbeanarbitraryclusterinCOPT,andletCbetheclusteringwithjustonecenter,whichischo-senuniformlyatrandomfromA.Then,E[[`](A)]2`[`]OPT(A).Proof.LetcdenotethecenterofAinCOPT.Then,E[[`](A)]=1 jAjXa02AXa2Akaa0k`2`1 jAjXa02AXa2Akack`+ka0ck`=2`[`]OPT(A):Thesecondstepherefollowsfromthetriangleinequalityandthepower-meaninequality.Therestofourupperboundanalysiscarriesthroughwithoutchange,exceptthatintheproofofLemma3.2,weloseafactorof2`1fromthepower-meaninequality,insteadofjust2.Puttingeverythingtogether,weobtainthegeneraltheorem.Theorem5.1.IfCisconstructedwithD`seeding,thenthecorrespondingpotentialfunction[`]satis es,E[[`]]22`(lnk+2)[`]OPT.6EmpiricalresultsInordertoevaluatek-means++inpractice,wehaveimplementedandtesteditinC++[3].Inthissection,wediscusstheresultsofthesepreliminaryexperiments.WefoundthatD2seedingsubstantiallyimprovesboththerunningtimeandtheaccuracyofk-means.6.1DatasetsWeevaluatedtheperformanceofk-meansandk-means++onfourdatasets.The rstdataset,Norm25,issynthetic.Togenerateit,wechose25\true"centersuniformlyatrandomfroma15-dimensionalhypercubeofsidelength500.WethenaddedpointsfromGaussiandistributionsofvariance1aroundeachtruecenter.Thus,weobtainedanumberofwellseparatedGaussianswiththethetruecentersprovidingagoodapproximationtotheoptimalclustering.Wechosetheremainingdatasetsfromreal-worldexampleso theUC-IrvineMachineLearningReposi-tory.TheClouddataset[7]consistsof1024pointsin10dimensions,anditisPhilippeCollard's rstcloudcoverdatabase.TheIntrusiondataset[18]consistsof494019pointsin35dimensions,anditrepresentsfeaturesavail-abletoanintrusiondetectionsystem.Finally,theSpamdataset[25]consistsof4601pointsin58dimensions,anditrepresentsfeaturesavailabletoane-mailspamdetectionsystem.Foreachdataset,wetestedk=10;25;and50.6.2MetricsSinceweweretestingrandomizedseed-ingprocesses,weran20trialsforeachcase.Wereport Average Minimum AverageT k k-meansk-means++ k-meansk-means++ k-meansk-means++ 10 1:365105 8.47% 1:174105 0.93% 0.12 46.72% 25 4:233104 99.96% 1:914104 99.92% 0.90 87.79% 50 7:750103 99.81% 1:474101 0.53% 2.04 1.62% Table1:ExperimentalresultsontheNorm25dataset(n=10000,d=15).Fork-means,welisttheactualpotentialandtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means:100%1k-means++value k-meansvalue. Average Minimum AverageT k k-meansk-means++ k-meansk-means++ k-meansk-means++ 10 7:921103 22.33% 6:284103 10.37% 0.08 51.09% 25 3:637103 42.76% 2:550103 22.60% 0.11 43.21% 50 1:867103 39.01% 1:407103 23.07% 0.16 41.99% Table2:ExperimentalresultsontheClouddataset(n=1024,d=10).Fork-means,welisttheactualpotentialandtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means. Average Minimum AverageT k k-meansk-means++ k-meansk-means++ k-meansk-means++ 10 3:387108 93.37% 3:206108 94.40% 63.94 44.49% 25 3:149108 99.20% 3:100108 99.32% 257.34 49.19% 50 3:079108 99.84% 3:076108 99.87% 917.00 66.70% Table3:ExperimentalresultsontheIntrusiondataset(n=494019,d=35).Fork-means,welisttheactualpotentialandtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means. Average Minimum AverageT k k-meansk-means++ k-meansk-means++ k-meansk-means++ 10 3:698104 49.43% 3:684104 54.59% 2.36 69.00% 25 3:288104 88.76% 3:280104 89.58% 7.36 79.84% 50 3:183104 95.35% 2:384104 94.30% 12.20 75.76% Table4:ExperimentalresultsontheSpamdataset(n=4601,d=58).Fork-means,welisttheactualpotentialandtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means. theminimumandtheaveragepotential(actuallydi-videdbythenumberofpoints),aswellasthemeanrunningtime.Ourimplementationsarestandardwithnospecialoptimizations.6.3ResultsTheresultsfork-meansandk-means++aredisplayedinTables1through4.Welisttheabsoluteresultsfork-means,andthepercentageimprovementachievedbyk-means++(e.g.,a90%improvementintherunningtimeisequivalenttoafactor10speedup).Weobservethatk-means++consistentlyoutperformedk-means,bothbyachievingalowerpotentialvalue,insomecasesbyseveralordersofmagnitude,andalsobyhavingafasterrunningtime.TheD2seedingisslightlyslowerthanuniformseeding,butitstillleadstoafasteralgorithmsinceithelpsthelocalsearchconvergeafterfeweriterations.Thesyntheticexampleisacasewherestandardk-meansdoesverybadly.Eventhoughthereisan\obvious"clustering,theuniformseedingwillinevitablymergesomeoftheseclusters,andthelocalsearchwillneverbeabletosplitthemapart(see[12]forfurtherdiscussionofthisphenomenon).Thecarefulseedingmethodofk-means++avoidedthisproblemaltogether,anditalmostalwaysattainedtheoptimalclusteringonthesyntheticdataset.Thedi erencebetweenk-meansandk-means++onthereal-worlddatasetswasalsosubstantial.Ineverycase,k-means++achievedatleasta10%accuracyimprovementoverk-means,anditoftenperformedmuchbetter.Indeed,ontheSpamandIntrusiondatasets,k-means++achievedpotentials20to1000timessmallerthanthoseachievedbystandardk-means.Eachtrialalsocompletedtwotothreetimesfaster,andeachindividualtrialwasmuchmorelikelytoachieveagoodclustering.7ConclusionandfutureworkWehavepresentedanewwaytoseedthek-meansalgorithmthatisO(logk)-competitivewiththeoptimalclustering.Furthermore,ourseedingtechniqueisasfastandassimpleasthek-meansalgorithmitself,whichmakesitattractiveinpractice.Towardsthatend,weranpreliminaryexperimentsonseveralreal-worlddatasets,andweobservedthatk-means++substantiallyoutperformedstandardk-meansintermsofbothspeedandaccuracy.AlthoughouranalysisoftheexpectedpotentialE[]achievedbyk-means++istighttowithinacon-stantfactor,afewopenquestionsstillremain.Mostimportantly,itisstandardpracticetorunthek-meansalgorithmmultipletimes,andthenkeeponlythebestclusteringfound.Thisraisesthequestionofwhetherk-means++achievesasymptoticallybetterresultsifitisallowedseveraltrials.Forexample,ifk-means++isrun2ktimes,ourargumentscanbemodi edtoshowitislikelytoachieveaconstantapproximationatleastonce.Weaskwhetherasimilarboundcanbeachievedforasmallernumberoftrials.Also,experimentsshowedthatk-means++generallyperformedbetterifitselectedseveralnewcentersduringeachiteration,andthengreedilychosetheonethatdecreasedasmuchaspossible.Unfortunately,ourproofsdonotcarryovertothisscenario.Itwouldbeinterestingtoseeacomparable(orbetter)asymptoticresultprovenhere.Finally,wearecurrentlyworkingonamorethor-oughexperimentalanalysis.Inparticular,wearemea-suringtheperformanceofnotonlyk-means++andstan-dardk-means,butalsoothervariantsthathavebeensuggestedinthetheorycommunity.AcknowledgementsWewouldliketothankRajeevMotwaniforhishelpfulcomments.References[1]PankajK.AgarwalandNabilH.Mustafa.k-meansprojectiveclustering.InPODS'04:Proceedingsofthetwenty-thirdACMSIGMOD-SIGACT-SIGARTsym-posiumonPrinciplesofdatabasesystems,pages155{165,NewYork,NY,USA,2004.ACMPress.[2]D.ArthurandS.Vassilvitskii.Worst-caseandsmoothedanalysisoftheICPalgorithm,withanap-plicationtothek-meansmethod.InSymposiumonFoundationsofComputerScience,2006.[3]DavidArthurandSergeiVassilvitskii.k-means++testcode.http://www.stanford.edu/~darthur/kMeansppTest.zip.[4]DavidArthurandSergeiVassilvitskii.Howslowisthek-meansmethod?InSCG'06:Proceedingsofthetwenty-secondannualsymposiumoncomputationalgeometry.ACMPress,2006.[5]PavelBerkhin.Surveyofclusteringdataminingtechniques.Technicalreport,AccrueSoftware,SanJose,CA,2002.[6]MosesCharikar,LiadanO'Callaghan,andRinaPani-grahy.Betterstreamingalgorithmsforclusteringprob-lems.InSTOC'03:Proceedingsofthethirty- fthan-nualACMsymposiumonTheoryofcomputing,pages30{39,NewYork,NY,USA,2003.ACMPress.[7]PhilippeCollard'scloudcoverdatabase.ftp://ftp.ics.uci.edu/pub/machine-learning-databases/undocumented/taylor/cloud.data.[8]SanjoyDasgupta.Howfastis-means?InBernhardScholkopfandManfredK.Warmuth,editors,COLT,volume2777ofLectureNotesinComputerScience,page735.Springer,2003. [9]W.FernandezdelaVega,MarekKarpinski,ClaireKenyon,andYuvalRabani.Approximationschemesforclusteringproblems.InSTOC'03:Proceedingsofthethirty- fthannualACMsymposiumonTheoryofcomputing,pages50{58,NewYork,NY,USA,2003.ACMPress.[10]P.Drineas,A.Frieze,R.Kannan,S.Vempala,andV.Vinay.Clusteringlargegraphsviathesingularvaluedecomposition.Mach.Learn.,56(1-3):9{33,2004.[11]FredericGibouandRonaldFedkiw.Afasthybridk-meanslevelsetalgorithmforsegmentation.In4thAnnualHawaiiInternationalConferenceonStatisticsandMathematics,pages281{291,2005.[12]SudiptoGuha,AdamMeyerson,NinaMishra,RajeevMotwani,andLiadanO'Callaghan.Clusteringdatastreams:Theoryandpractice.IEEETransactionsonKnowledgeandDataEngineering,15(3):515{528,2003.[13]SarielHar-PeledandSohamMazumdar.Oncoresetsfork-meansandk-medianclustering.InSTOC'04:Proceedingsofthethirty-sixthannualACMsymposiumonTheoryofcomputing,pages291{300,NewYork,NY,USA,2004.ACMPress.[14]SarielHar-PeledandBardiaSadri.Howfastisthek-meansmethod?InSODA'05:ProceedingsofthesixteenthannualACM-SIAMsymposiumonDiscretealgorithms,pages877{885,Philadelphia,PA,USA,2005.SocietyforIndustrialandAppliedMathematics.[15]R.Herwig,A.J.Poustka,C.Muller,C.Bull,H.Lehrach,andJO'Brien.Large-scaleclusteringofcdna- ngerprintingdata.GenomeResearch,9:1093{1105,1999.[16]MaryInaba,NaokiKatoh,andHiroshiImai.Applica-tionsofweightedvoronoidiagramsandrandomizationtovariance-basedk-clustering:(extendedabstract).InSCG'94:ProceedingsofthetenthannualsymposiumonComputationalgeometry,pages332{339,NewYork,NY,USA,1994.ACMPress.[17]TapasKanungo,DavidM.Mount,NathanS.Ne-tanyahu,ChristineD.Piatko,RuthSilverman,andAn-gelaY.Wu.Alocalsearchapproximationalgorithmfork-meansclustering.Comput.Geom.,28(2-3):89{112,2004.[18]KDDCup1999dataset.http://kdd.ics.uci.edu//databases/kddcup99/kddcup99.html.[19]AmitKumar,YogishSabharwal,andSandeepSen.Asimplelineartime(1+)-approximationalgorithmfork-meansclusteringinanydimensions.InFOCS'04:Proceedingsofthe45thAnnualIEEESymposiumonFoundationsofComputerScience(FOCS'04),pages454{462,Washington,DC,USA,2004.IEEECom-puterSociety.[20]StuartP.Lloyd.Leastsquaresquantizationinpcm.IEEETransactionsonInformationTheory,28(2):129{136,1982.[21]JirMatousek.Onapproximategeometrick-clustering.Discrete&ComputationalGeometry,24(1):61{84,2000.[22]RamgopalR.MettuandC.GregPlaxton.Optimaltimeboundsforapproximateclustering.InAdnanDarwicheandNirFriedman,editors,UAI,pages344{351.MorganKaufmann,2002.[23]A.Meyerson.Onlinefacilitylocation.InFOCS'01:Proceedingsofthe42ndIEEEsymposiumonFounda-tionsofComputerScience,page426,Washington,DC,USA,2001.IEEEComputerSociety.[24]R.Ostrovsky,Y.Rabani,L.Schulman,andC.Swamy.Thee ectivenessofLloyd-typemethodsforthek-Meansproblem.InSymposiumonFoundationsofComputerScience,2006.[25]Spame-maildatabase.http://www.ics.uci.edu/~mlearn/databases/spambase/.

By: sherrill-nordquist
Views: 78
Type: Public

kmeans The Advantages of Careful Seeding David Arthur Sergei Vassilvitskii Abstract The kmeans method is a widely used clustering technique that seeks to minimize the average squared distance between - Description


Although it o64256ers no accuracy guarantees its simplicity and speed are very appealing in practice By augmenting kmeans with a very simple ran domized seeding technique we obtain an algorithm that is 920log competitive with the optimal clustering ID: 8348 Download Pdf

Related Documents