Although it o64256ers no accuracy guarantees its simplicity and speed are very appealing in practice By augmenting kmeans with a very simple ran domized seeding technique we obtain an algorithm that is 920log competitive with the optimal clustering ID: 8348 Download Pdf
Although it o64256ers no accuracy guarantees its simplicity and speed are very appealing in practice By augmenting kmeans with a very simple ran domized seeding technique we obtain an algorithm that is 920log competitive with the optimal clustering
Contents. Motivation. Data. Dimension. ality. . Reduction-MDS, Isomap. Clustering-Kmeans, Ncut, Ratio Cut, SCC. Conclustion. Reference. Motivation. Clustering is a main task of exploratory data mining.
Contents. Motivation. Data. Dimension. ality. . Reduction-MDS, Isomap. Clustering-Kmeans, Ncut, Ratio Cut, SCC. Conclustion. Reference. Motivation. Clustering is a main task of exploratory data mining.
stanfordedu Sergei Vassilvitskii Stanford University Stanford CA sergeicsstanfordedu ABSTRACT The kmeans method is an old but popular clustering algo rithm known for its observed speed and its simplicity Until recently however no meaningful theoretic
S Bradley KP Bennett ADemiriz Microsoft Researc Dept of Mathematical Sciences One Microsoft W Dept of Decision Sciences and Eng Sys Redmond W A 98052 Renselaer P olytec hnic Institute ro NY 12180 br ad ley micr osoftc om ennekdemir r
Sculley Google Inc Pittsburgh PA USA dsculleygooglecom ABSTRACT We present two modi57356cations to the popular means clus tering algorithm to address the extreme requirements for latency scalability and sparsity enco
Basic Concepts and Algorithms. Bamshad Mobasher. DePaul University. 2. What is Clustering in Data Mining?. Cluster:. a collection of data objects that are “similar” to one another and thus can be treated collectively as one group.
Fleet Department of Computer Science University of Toronto norouzifleet cstorontoedu Abstract A fundamental limitation of quantization techniques like the kmeans clustering algorithm is the storage and run time cost associated with th
cornelledu Thorsten Joachims Department of Computer Science Cornell University Ithaca NY USA tjcscornelledu ABSTRACT The means clustering algorithm is one of the most widely used e64256ective and best understood clustering methods How ever successful
1. Unsupervised Learning and Clustering. In unsupervised learning you are given a data set with no output classifications. Clustering is an important type of unsupervised learning. PCA was another type of unsupervised learning.
Published bysherrill-nordquist
Although it o64256ers no accuracy guarantees its simplicity and speed are very appealing in practice By augmenting kmeans with a very simple ran domized seeding technique we obtain an algorithm that is 920log competitive with the optimal clustering
Download Pdf - The PPT/PDF document "kmeans The Advantages of Careful Seedin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
k-means++:TheAdvantagesofCarefulSeedingDavidArthurSergeiVassilvitskiiyAbstractThek-meansmethodisawidelyusedclusteringtechniquethatseekstominimizetheaveragesquareddistancebetweenpointsinthesamecluster.Althoughitoersnoaccuracyguarantees,itssimplicityandspeedareveryappealinginpractice.Byaugmentingk-meanswithaverysimple,ran-domizedseedingtechnique,weobtainanalgorithmthatis(logk)-competitivewiththeoptimalclustering.Prelim-inaryexperimentsshowthatouraugmentationimprovesboththespeedandtheaccuracyofk-means,oftenquitedramatically.1IntroductionClusteringisoneoftheclassicproblemsinmachinelearningandcomputationalgeometry.Inthepopulark-meansformulation,oneisgivenanintegerkandasetofndatapointsinRd.Thegoalistochoosekcenterssoastominimize,thesumofthesquareddistancesbetweeneachpointanditsclosestcenter.SolvingthisproblemexactlyisNP-hard,evenwithjusttwoclusters[10],buttwenty-veyearsago,Lloyd[20]proposedalocalsearchsolutionthatisstillverywidelyusedtoday(seeforexample[1,11,15]).Indeed,arecentsurveyofdataminingtechniquesstatesthatit\isbyfarthemostpopularclusteringalgorithmusedinscienticandindustrialapplications"[5].Usuallyreferredtosimplyask-means,Lloyd'salgorithmbeginswithkarbitrarycenters,typicallychosenuniformlyatrandomfromthedatapoints.Eachpointisthenassignedtothenearestcenter,andeachcenterisrecomputedasthecenterofmassofallpointsassignedtoit.Thesetwosteps(assignmentandcentercalculation)arerepeateduntiltheprocessstabilizes.Onecancheckthatthetotalerrorismonotoni-callydecreasing,whichensuresthatnoclusteringisre-peatedduringthecourseofthealgorithm.Sincethereareatmostknpossibleclusterings,theprocesswillal-waysterminate.Inpractice,veryfewiterationsareusu-allyrequired,whichmakesthealgorithmmuchfaster StanfordUniversity,SupportedinpartbyNDSEGFellow-ship,NSFGrantITR-0331640,andgrantsfromMedia-XandSNRC.yStanfordUniversity,SupportedinpartbyNSFGrantITR-0331640,andgrantsfromMedia-XandSNRC.thanmostofitscompetitors.Unfortunately,theempiricalspeedandsimplicityofthek-meansalgorithmcomeatthepriceofaccuracy.Therearemanynaturalexamplesforwhichthealgo-rithmgeneratesarbitrarilybadclusterings(i.e., OPTisunboundedevenwhennandkarexed).Furthermore,theseexamplesdonotrelyonanadversarialplacementofthestartingcenters,andtheratiocanbeunboundedwithhighprobabilityevenwiththestandardrandom-izedseedingtechnique.Inthispaper,weproposeawayofinitializingk-meansbychoosingrandomstartingcenterswithveryspecicprobabilities.Specically,wechooseapointpasacenterwithprobabilityproportionaltop'scontributiontotheoverallpotential.Lettingdenotethepotentialafterchoosingcentersinthisway,weshowthefollowing.Theorem1.1.Foranysetofdatapoints,E[]8(lnk+2)OPT.Thissamplingisbothfastandsimple,anditalreadyachievesapproximationguaranteesthatk-meanscan-not.Weproposeusingittoseedtheinitialcentersfork-means,leadingtoacombinedalgorithmwecallk-means++.ThiscomplementsaveryrecentresultofOstrovskyetal.[24],whoindependentlyproposedmuchthesamealgorithm.Whereastheyshowedthisrandomizedseed-ingisO(1)-competitiveondatasetsfollowingacertainseparationcondition,weshowitisO(logk)-competitiveonalldatasets.WealsoshowthattheanalysisforTheorem1.1istightuptoaconstantfactor,andthatitcanbeeas-ilyextendedtovariouspotentialfunctionsinarbitrarymetricspaces.Inparticular,wecanalsogetasim-pleO(logk)approximationalgorithmforthek-medianobjective.Furthermore,weprovidepreliminaryexperi-mentaldatashowingthatinpractice,k-means++reallydoesoutperformk-meansintermsofbothaccuracyandspeed,oftenbyasubstantialmargin.1.1RelatedworkAsafundamentalprobleminmachinelearning,k-meanshasarichhistory.Becauseofitssimplicityanditsobservedspeed,Lloyd'smethod[20]remainsthemostpopularapproachinpractice, despiteitslimitedaccuracy.TheconvergencetimeofLloyd'smethodhasbeenthesubjectofarecentseriesofpapers[2,4,8,14];inthisworkwefocusonimprovingitsaccuracy.Inthetheorycommunity,Inabaetal.[16]werethersttogiveanexactalgorithmforthek-meansproblem,withtherunningtimeofO(nkd).Sincethen,anumberofpolynomialtimeapproximationschemeshavebeendeveloped(see[9,13,19,21]andthereferencestherein).Whiletheauthorsdevelopinterestinginsightsintothestructureoftheclusteringproblem,theiralgorithmsarehighlyexponential(orworse)ink,andareunfortunatelyimpracticalevenforrelativelysmalln,kandd.Kanungoetal.[17]proposedanO(n3 d)algorithmthatis(9+)-competitive.However,n3comparesunfavorablywiththealmostlinearrunningtimeofLloyd'smethod,andtheexponentialdependenceondcanalsobeproblematic.Forthesereasons,Kanungoetal.alsosuggestedawayofcombiningtheirtechniqueswithLloyd'salgorithm,butinordertoavoidtheexponentialdependenceond,theirapproachsacricesallapproximationguarantees.MettuandPlaxton[22]alsoachievedaconstant-probabilityO(1)approximationusingatechniquecalledsuccessivesampling.TheymatchourrunningtimeofO(nkd),butonlyifkissucientlylargeandthespreadissucientlysmall.Inpractice,ourapproachissimpler,andourexperimentalresultsseemtobebetterintermsofbothspeedandaccuracy.Veryrecently,Ostrovskyetal.[24]independentlyproposedanalgorithmthatisessentiallyidenticaltoours,althoughtheiranalysisisquitedierent.LettingOPT;kdenotetheoptimalpotentialforak-clusteringonagivendataset,theyprovek-means++isO(1)-competitiveinthecasewhereOPT;k OPT;k 12.Theintuitionhereisthatifthisconditiondoesnothold,thenthedataisnotwellsuitedforclusteringwiththegivenvaluefork.Combiningthisresultwithoursgivesastrongcharacterizationofthealgorithm'sperformance.Inparticular,k-means++isneverworsethanO(logk)-competitive,andonverywellformeddatasets,itimprovestobeingO(1)-competitive.Overall,theseedingtechniqueweproposeissimilarinspirittothatusedbyMeyerson[23]foronlinefacilitylocation,andMishraetal.[12]andCharikaretal.[6]inthecontextofk-medianclustering.However,ouranalysisisquitedierentfromthoseworks.2PreliminariesInthissection,weformallydenethek-meansproblem,aswellasthek-meansandk-means++algorithms.Forthek-meansproblem,wearegivenanintegerkandasetofndatapointsXRd.WewishtochoosekcentersCsoastominimizethepotentialfunction,=Xx2Xminc2Ckx ck2:Choosingthesecentersimplicitlydenesaclustering{foreachcenter,wesetoneclustertobethesetofdatapointsthatareclosertothatcenterthantoanyother.Asnotedabove,ndinganexactsolutiontothek-meansproblemisNP-hard.Throughoutthepaper,wewillletCOPTdenotetheoptimalclusteringforagiveninstanceofthek-meansproblem,andwewillletOPTdenotethecorrespondingpotential.GivenaclusteringCwithpotential,wealsolet(A)denotethecontributionofAXtothepotential(i.e.,(A)=Px2Aminc2Ckx ck2).2.1Thek-meansalgorithmThek-meansmethodisasimpleandfastalgorithmthatattemptstolocallyimproveanarbitraryk-meansclustering.Itworksasfollows.1.ArbitrarilychoosekinitialcentersC=fc1;:::;ckg.2.Foreachi2f1;:::;kg,settheclusterCitobethesetofpointsinXthatareclosertocithantheyaretocjforallj=i.3.Foreachi2f1;:::;kg,setcitobethecenterofmassofallpointsinCi:ci=1 jCijPx2Cix.4.RepeatSteps2and3untilCnolongerchanges.ItisstandardpracticetochoosetheinitialcentersuniformlyatrandomfromX.ForStep2,tiesmaybebrokenarbitrarily,aslongasthemethodisconsistent.Steps2and3arebothguaranteedtodecrease,sothealgorithmmakeslocalimprovementstoanarbitraryclusteringuntilitisnolongerpossibletodoso.ToseethatStep3doesinfactdecreases,itishelpfultorecallastandardresultfromlinearalgebra(see[14]).Lemma2.1.LetSbeasetofpointswithcenterofmassc(S),andletzbeanarbitrarypoint.Then,Px2Skx zk2 Px2Skx c(S)k2=jSjkc(S) zk2.MonotonicityforStep3followsfromtakingStobeasingleclusterandztobeitsinitialcenter.Asdiscussedabove,thek-meansalgorithmisat-tractiveinpracticebecauseitissimpleanditisgener-allyfast.Unfortunately,itisguaranteedonlytondalocaloptimum,whichcanoftenbequitepoor.2.2Thek-means++algorithmThek-meansalgo-rithmbeginswithanarbitrarysetofclustercenters.Weproposeaspecicwayofchoosingthesecenters.At anygiventime,letD(x)denotetheshortestdistancefromadatapointxtotheclosestcenterwehaveal-readychosen.Then,wedenethefollowingalgorithm,whichwecallk-means++.1a.Chooseaninitialcenterc1uniformlyatrandomfromX.1b.Choosethenextcenterci,selectingci=x02XwithprobabilityD(x0)2 x2XD(x)2.1c.RepeatStep1buntilwehavechosenatotalofkcenters.2-4.Proceedaswiththestandardk-meansalgorithm.WecalltheweightingusedinStep1bsimply\D2weighting".3k-means++isO(logk)-competitiveInthissection,weproveourmainresult.Theorem3.1.IfCisconstructedwithk-means++,thenthecorrespondingpotentialfunctionsatisesE[]8(lnk+2)OPT.Infact,weprovethisholdsafteronlyStep1ofthealgorithmabove.Steps2through4canthenonlydecrease.Notsurprisingly,ourexperimentsshowthislocaloptimizationisimportantinpractice,althoughitisdiculttoquantifythistheoretically.Ouranalysisconsistsoftwoparts.First,weshowthatk-means++iscompetitiveinthoseclustersofCOPTfromwhichitchoosesacenter.Thisiseasiestinthecaseofourrstcenter,whichischosenuniformlyatrandom.Lemma3.1.LetAbeanarbitraryclusterinCOPT,andletCbetheclusteringwithjustonecenter,whichischosenuniformlyatrandomfromA.Then,E[(A)]=2OPT(A).Proof.Letc(A)denotethecenterofmassofthedatapointsinA.ByLemma2.1,weknowthatsinceCOPTisoptimal,itmustbeusingc(A)asthecentercorrespondingtotheclusterA.Usingthesamelemmaagain,weseeE[(A)]isgivenby,Xa02A1 jAj Xa2Aka a0k2!=1 jAjXa02A Xa2Aka c(A)k2+jAjka0 c(A)k2!=2Xa2Aka c(A)k2;andtheresultfollows.OurnextstepistoproveananalogofLemma3.1fortheremainingcenters,whicharechosenwithD2weighting.Lemma3.2.LetAbeanarbitraryclusterinCOPT,andletCbeanarbitraryclustering.IfweaddarandomcentertoCfromA,chosenwithD2weighting,thenE[(A)]8OPT(A).Proof.Theprobabilitythatwechoosesomexeda0asourcenter,giventhatwearechoosingourcenterfromA,ispreciselyD(a0)2 a2AD(a)2.Furthermore,afterchoos-ingthecentera0,apointawillcontributepreciselymin(D(a);ka a0k)2tothepotential.Therefore,E[(A)]=Xa02AD(a0)2 Pa2AD(a)2Xa2Amin(D(a);ka a0k)2:NotebythetriangleinequalitythatD(a0)D(a)+ka a0kforalla;a0.Fromthis,thepower-meaninequality1impliesthatD(a0)22D(a)2+2ka a0k2.Summingoveralla,wethenhavethatD(a0)22 jAjPa2AD(a)2+2 jAjPa2Aka a0k2,andhence,E[(A)]isatmost,2 jAjXa02APa2AD(a)2 Pa2AD(a)2Xa2Amin(D(a);ka a0k)2+2 jAjXa02APa2Aka a0k2 Pa2AD(a)2Xa2Amin(D(a);ka a0k)2:Intherstexpression,wesubstitutemin(D(a);ka a0k)2ka a0k2,andinthesecondexpression,wesubstitutemin(D(a);ka a0k)2D(a)2.Simplifying,wethenhave,E[(A)]4 jAjXa02AXa2Aka a0k2=8OPT(A):ThelaststepherefollowsfromLemma3.1.WehavenowshownthatseedingbyD2weightingiscompetitiveaslongasitchoosescentersfromeachclusterofCOPT,whichcompletesthersthalfofourargument.WenowuseinductiontoshowthetotalerroringeneralisatmostO(logk). 1Thepower-meaninequalitystatesforanyrealnumbersa1;;amthata2i1 m(ai)2.ItfollowsfromtheCauchy-Schwarzinequality.Weareonlyusingthecasem=2here,butwewillneedthegeneralcaseforLemma3.3. Lemma3.3.LetCbeanarbitraryclustering.Chooseu0\uncovered"clustersfromCOPT,andletXudenotethesetofpointsintheseclusters.AlsoletXc=X Xu.NowsupposeweaddturandomcenterstoC,chosenwithD2weighting.LetC0denotethetheresultingclustering,andlet0denotethecorrespondingpotential.Then,E[0]isatmost,(Xc)+8OPT(Xu)(1+Ht)+u t u(Xu):Here,Htdenotestheharmonicsum,1+1 2++1 t.Proof.Weprovethisbyinduction,showingthatiftheresultholdsfor(t 1;u)and(t 1;u 1),thenitalsoholdsfor(t;u).Therefore,itsucestocheckt=0;u0andt=u=1asourbasecases.Ift=0andu0,theresultfollowsfromthefactthat1+Ht=u t u=1.Next,supposet=u=1.Wechooseouronenewcenterfromtheoneuncoveredclusterwithprobabilityexactly(Xu) .Inthiscase,Lemma3.2guaranteesthatE[0](Xc)+8OPT(Xu).Since0evenifwechooseacenterfromacoveredcluster,wehave,E[0](Xu) (Xc)+8OPT(Xu)+(Xc) 2(Xc)+8OPT(Xu):Since1+Ht=2here,wehaveshowntheresultholdsforbothbasecases.Wenowproceedtoprovetheinductivestep.Itisconvenientheretoconsidertwocases.Firstsupposewechooseourrstcenterfromacoveredcluster.Asabove,thishappenswithprobabilityexactly(Xc) .Notethatthisnewcentercanonlydecrease.Bearingthisinmind,applytheinductivehypothesiswiththesamechoiceofcoveredclusters,butwithtdecreasedbyone.ItfollowsthatourcontributiontoE[0]inthiscaseisatmost,(Xc) (Xc)+8OPT(Xu)(1+Ht 1)+u t+1 u(Xu):Ontheotherhand,supposewechooseourrstcenterfromsomeuncoveredclusterA.Thishappenswithprobability(A) .Letpadenotetheprobabilitythatwechoosea2Aasourcenter,giventhecenterissomewhereinA,andletadenote(A)afterwechooseaasourcenter.Onceagain,weapplyourinductivehypothesis,thistimeaddingAtothesetofcoveredclusters,aswellasdecreasingbothtanduby1.ItfollowsthatourcontributiontoE[OPT]inthiscaseisatmost,(A) Xa2Apa(Xc)+a+8OPT(Xu) 8OPT(A)(1+Ht 1)+u t u 1(Xu) (A)(A) (Xc)+8OPT(Xu)(1+Ht 1)+u t u 1(Xu) (A):ThelaststepherefollowsfromthefactthatPa2Apaa8OPT(A),whichisimpliedbyLemma3.2.Now,thepower-meaninequalityimpliesthatPAXu(A)21 u(Xu)2.Therefore,ifwesumoveralluncoveredclustersA,weobtainapotentialcontri-butionofatmost,(Xu) (Xc)+8OPT(Xu)(1+Ht 1)+1 u t u 1(Xu)2 1 u(Xu)2=(Xu) (Xc)+8OPT(Xu)(1+Ht 1)+u t u(Xu):CombiningthepotentialcontributiontoE[0]frombothcases,wenowobtainthedesiredbound:E[0](Xc)+8OPT(Xu)(1+Ht 1)+u t u(Xu)+(Xc) (Xu) u(Xc)+8OPT(Xu)1+Ht 1+1 u+u t u(Xu):Theinductivestepnowfollowsfromthefactthat1 u1 t.WespecializeLemma3.3toobtainourmainresult.Theorem3.1IfCisconstructedwithk-means++,thenthecorrespondingpotentialfunctionsatisesE[]8(lnk+2)OPT.Proof.ConsidertheclusteringCafterwehavecom-pletedStep1.LetAdenotetheCOPTclusterinwhichwechosetherstcenter.ApplyingLemma3.3with t=u=k 1andwithAbeingtheonlycoveredclus-ter,wehave,E[OPT](A)+8OPT 8OPT(A)(1+Hk 1):TheresultnowfollowsfromLemma3.1,andfromthefactthatHk 11+lnk.4AmatchinglowerboundInthissection,weshowthattheD2seedingusedbyk-means++isnobetterthan\n(logk)-competitiveinexpectation,therebyprovingTheorem3.1istightwithinaconstantfactor.Fixk,andthenchoosen,,withnkand.WeconstructXwithnpoints.Firstchoosekcentersc1;c2;:::;cksuchthatkci cjk2=2 n k n2foralli=j.Now,foreachci,adddatapointsxi;1;xi;2;;xi;n karrangedinaregularsimplexwithcenterci,sidelength,andradiusq n k 2n.Ifwedothisinorthogonaldimensionsforeachi,wethenhave,kxi;i0 xj;j0k=ifi=j,orotherwise.Weproveourseedingtechniqueisinexpectation\n(logk)worsethantheoptimalclusteringinthiscase.Clearly,theoptimalclusteringhascentersfcig,whichleadstoanoptimalpotentialofOPT=n k 22.Conversely,usinganinductionsimilartothatofLemma3.3,weshowD2seedingcannotmatchthisbound.Asbefore,weboundtheexpectedpotentialintermsofthenumberofcenterslefttochooseandthenumberofuncoveredclusters(thoseclustersofC0fromwhichwehavenotchosenacenter).Lemma4.1.LetCbeanarbitraryclusteringonXwithk t1centers,butwithuclustersfromCOPTuncovered.NowsupposeweaddtrandomcenterstoC,chosenwithD2weighting.LetC0denotethetheresultingclustering,andlet0denotethecorrespondingpotential.Furthermore,let=n k2 n,=2 2k2 2andH0u=Pui=1k i ki.Then,E[0]isatleast,t+1n2(1+H0u)+n k2 2n2(u t):Proof.Weprovethisbyinductionont.Ift=0,notethat,0==n un k k2+un k2:Sincen un kn k,wehaven un k k n un kn k k n k=.Also,;1.Therefore,0n un k2+un k2:Finally,sincen2uun k2andn2un2H0u,wehave,0n2(1+H0u)+n k2 2n2u:Thiscompletesthebasecase.Wenowproceedtoprovetheinductivestep.AswithLemma3.3,weconsidertwocases.Theprobabilitythatourrstcenterischosenfromanuncoveredclusteris,un k2 un k2+(k u)n k2 (k t)2u2 u2+(k u)2u2 u2+(k u)2:Applyingourinductivehypothesiswithtandubothdecreasedby1,weobtainapotentialcontributionfromthiscaseofatleast,u2 u2+(k u)2t+1n2(1+H0u 1)+n k2 2n2(u t):Theprobabilitythatourrstcenterischosenfromacoveredclusteris(k u)n k2 (k t)2 un k2+(k u)n k2 (k t)2(k u)n k2 (k t)2 (k u)n k2(k u)2 u2+(k u)2(k u)2 u2+(k u)2:Applyingourinductivehypothesiswithtdecreasedby1butwithuconstant,weobtainapotentialcontributionfromthiscaseofatleast,(k u)2 u2+(k u)2t+1n2(1+H0u)+n k2 2n2(u t+1):Therefore,E[0]isatleast,t+1n2(1+H0u)+n k2 2n2(u t)+t+1 u2+(k u)2(k u)2n k2 2n2 u2H0(u) H0(u 1)n2: However,H0u H0u 1=k u kuand=2 2k2 2,sou2H0(u) H0(u 1)n2=(k u)2n k2 2n2;andtheresultfollows.Asintheprevioussection,weobtainthedesiredresultbyspecializingtheinduction.Theorem4.1.D2seedingisnobetterthan2(lnk)-competitive.Proof.Supposeaclusteringwithpotentialiscon-structedusingk-means++onXdescribedabove.Ap-plyLemma4.1withu=t=k 1aftertherstcenterhasbeenchosen.Notingthat1+H0k 1=1+Pk 1i=1 1 i 1 k=Hklnk,wethenhave,E[]kn2lnk:Now,xkandbutletnandapproachinnity.Thenandbothapproach1,andtheresultfollowsfromthefactthatOPT=n k 22.5GeneralizationsAlthoughthek-meansalgorithmitselfappliesonlyinvectorspaceswiththepotentialfunction=Px2Xminc2Ckx ck2,wenotethatourseedingtech-niquedoesnothavethesamelimitations.Inthissec-tion,wediscussextendingourresultstoarbitrarymet-ricspaceswiththemoregeneralpotentialfunction,[`]=Px2Xminc2Ckx ck`for`1.Inparticular,notethatthecaseof`=1isthek-medianspotentialfunction.Thesegeneralizationsrequireonlyonechangetothealgorithmitself.InsteadofusingD2seeding,weswitchtoD`seeding{i.e.,wechoosex0asacenterwithprobabilityD(x0)` x2XD(x)`.Fortheanalysis,themostimportantchangeap-pearsinLemma3.1.Ouroriginalproofusesaninnerproductstructurethatisnotavailableinthegeneralcase.However,aslightlyweakerresultcanbeprovenusingonlythetriangleinequality.Lemma5.1.LetAbeanarbitraryclusterinCOPT,andletCbetheclusteringwithjustonecenter,whichischo-senuniformlyatrandomfromA.Then,E[[`](A)]2`[`]OPT(A).Proof.LetcdenotethecenterofAinCOPT.Then,E[[`](A)]=1 jAjXa02AXa2Aka a0k`2` 1 jAjXa02AXa2A ka ck`+ka0 ck`=2`[`]OPT(A):Thesecondstepherefollowsfromthetriangleinequalityandthepower-meaninequality.Therestofourupperboundanalysiscarriesthroughwithoutchange,exceptthatintheproofofLemma3.2,weloseafactorof2` 1fromthepower-meaninequality,insteadofjust2.Puttingeverythingtogether,weobtainthegeneraltheorem.Theorem5.1.IfCisconstructedwithD`seeding,thenthecorrespondingpotentialfunction[`]satises,E[[`]]22`(lnk+2)[`]OPT.6EmpiricalresultsInordertoevaluatek-means++inpractice,wehaveimplementedandtesteditinC++[3].Inthissection,wediscusstheresultsofthesepreliminaryexperiments.WefoundthatD2seedingsubstantiallyimprovesboththerunningtimeandtheaccuracyofk-means.6.1DatasetsWeevaluatedtheperformanceofk-meansandk-means++onfourdatasets.Therstdataset,Norm25,issynthetic.Togenerateit,wechose25\true"centersuniformlyatrandomfroma15-dimensionalhypercubeofsidelength500.WethenaddedpointsfromGaussiandistributionsofvariance1aroundeachtruecenter.Thus,weobtainedanumberofwellseparatedGaussianswiththethetruecentersprovidingagoodapproximationtotheoptimalclustering.Wechosetheremainingdatasetsfromreal-worldexamplesotheUC-IrvineMachineLearningReposi-tory.TheClouddataset[7]consistsof1024pointsin10dimensions,anditisPhilippeCollard'srstcloudcoverdatabase.TheIntrusiondataset[18]consistsof494019pointsin35dimensions,anditrepresentsfeaturesavail-abletoanintrusiondetectionsystem.Finally,theSpamdataset[25]consistsof4601pointsin58dimensions,anditrepresentsfeaturesavailabletoane-mailspamdetectionsystem.Foreachdataset,wetestedk=10;25;and50.6.2MetricsSinceweweretestingrandomizedseed-ingprocesses,weran20trialsforeachcase.Wereport Average Minimum AverageT k k-meansk-means++ k-meansk-means++ k-meansk-means++ 10 1:365105 8.47% 1:174105 0.93% 0.12 46.72% 25 4:233104 99.96% 1:914104 99.92% 0.90 87.79% 50 7:750103 99.81% 1:474101 0.53% 2.04 1.62% Table1:ExperimentalresultsontheNorm25dataset(n=10000,d=15).Fork-means,welisttheactualpotentialandtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means:100% 1 k-means++value k-meansvalue. Average Minimum AverageT k k-meansk-means++ k-meansk-means++ k-meansk-means++ 10 7:921103 22.33% 6:284103 10.37% 0.08 51.09% 25 3:637103 42.76% 2:550103 22.60% 0.11 43.21% 50 1:867103 39.01% 1:407103 23.07% 0.16 41.99% Table2:ExperimentalresultsontheClouddataset(n=1024,d=10).Fork-means,welisttheactualpotentialandtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means. Average Minimum AverageT k k-meansk-means++ k-meansk-means++ k-meansk-means++ 10 3:387108 93.37% 3:206108 94.40% 63.94 44.49% 25 3:149108 99.20% 3:100108 99.32% 257.34 49.19% 50 3:079108 99.84% 3:076108 99.87% 917.00 66.70% Table3:ExperimentalresultsontheIntrusiondataset(n=494019,d=35).Fork-means,welisttheactualpotentialandtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means. Average Minimum AverageT k k-meansk-means++ k-meansk-means++ k-meansk-means++ 10 3:698104 49.43% 3:684104 54.59% 2.36 69.00% 25 3:288104 88.76% 3:280104 89.58% 7.36 79.84% 50 3:183104 95.35% 2:384104 94.30% 12.20 75.76% Table4:ExperimentalresultsontheSpamdataset(n=4601,d=58).Fork-means,welisttheactualpotentialandtimeinseconds.Fork-means++,welistthepercentageimprovementoverk-means. theminimumandtheaveragepotential(actuallydi-videdbythenumberofpoints),aswellasthemeanrunningtime.Ourimplementationsarestandardwithnospecialoptimizations.6.3ResultsTheresultsfork-meansandk-means++aredisplayedinTables1through4.Welisttheabsoluteresultsfork-means,andthepercentageimprovementachievedbyk-means++(e.g.,a90%improvementintherunningtimeisequivalenttoafactor10speedup).Weobservethatk-means++consistentlyoutperformedk-means,bothbyachievingalowerpotentialvalue,insomecasesbyseveralordersofmagnitude,andalsobyhavingafasterrunningtime.TheD2seedingisslightlyslowerthanuniformseeding,butitstillleadstoafasteralgorithmsinceithelpsthelocalsearchconvergeafterfeweriterations.Thesyntheticexampleisacasewherestandardk-meansdoesverybadly.Eventhoughthereisan\obvious"clustering,theuniformseedingwillinevitablymergesomeoftheseclusters,andthelocalsearchwillneverbeabletosplitthemapart(see[12]forfurtherdiscussionofthisphenomenon).Thecarefulseedingmethodofk-means++avoidedthisproblemaltogether,anditalmostalwaysattainedtheoptimalclusteringonthesyntheticdataset.Thedierencebetweenk-meansandk-means++onthereal-worlddatasetswasalsosubstantial.Ineverycase,k-means++achievedatleasta10%accuracyimprovementoverk-means,anditoftenperformedmuchbetter.Indeed,ontheSpamandIntrusiondatasets,k-means++achievedpotentials20to1000timessmallerthanthoseachievedbystandardk-means.Eachtrialalsocompletedtwotothreetimesfaster,andeachindividualtrialwasmuchmorelikelytoachieveagoodclustering.7ConclusionandfutureworkWehavepresentedanewwaytoseedthek-meansalgorithmthatisO(logk)-competitivewiththeoptimalclustering.Furthermore,ourseedingtechniqueisasfastandassimpleasthek-meansalgorithmitself,whichmakesitattractiveinpractice.Towardsthatend,weranpreliminaryexperimentsonseveralreal-worlddatasets,andweobservedthatk-means++substantiallyoutperformedstandardk-meansintermsofbothspeedandaccuracy.AlthoughouranalysisoftheexpectedpotentialE[]achievedbyk-means++istighttowithinacon-stantfactor,afewopenquestionsstillremain.Mostimportantly,itisstandardpracticetorunthek-meansalgorithmmultipletimes,andthenkeeponlythebestclusteringfound.Thisraisesthequestionofwhetherk-means++achievesasymptoticallybetterresultsifitisallowedseveraltrials.Forexample,ifk-means++isrun2ktimes,ourargumentscanbemodiedtoshowitislikelytoachieveaconstantapproximationatleastonce.Weaskwhetherasimilarboundcanbeachievedforasmallernumberoftrials.Also,experimentsshowedthatk-means++generallyperformedbetterifitselectedseveralnewcentersduringeachiteration,andthengreedilychosetheonethatdecreasedasmuchaspossible.Unfortunately,ourproofsdonotcarryovertothisscenario.Itwouldbeinterestingtoseeacomparable(orbetter)asymptoticresultprovenhere.Finally,wearecurrentlyworkingonamorethor-oughexperimentalanalysis.Inparticular,wearemea-suringtheperformanceofnotonlyk-means++andstan-dardk-means,butalsoothervariantsthathavebeensuggestedinthetheorycommunity.AcknowledgementsWewouldliketothankRajeevMotwaniforhishelpfulcomments.References[1]PankajK.AgarwalandNabilH.Mustafa.k-meansprojectiveclustering.InPODS'04:Proceedingsofthetwenty-thirdACMSIGMOD-SIGACT-SIGARTsym-posiumonPrinciplesofdatabasesystems,pages155{165,NewYork,NY,USA,2004.ACMPress.[2]D.ArthurandS.Vassilvitskii.Worst-caseandsmoothedanalysisoftheICPalgorithm,withanap-plicationtothek-meansmethod.InSymposiumonFoundationsofComputerScience,2006.[3]DavidArthurandSergeiVassilvitskii.k-means++testcode.http://www.stanford.edu/~darthur/kMeansppTest.zip.[4]DavidArthurandSergeiVassilvitskii.Howslowisthek-meansmethod?InSCG'06:Proceedingsofthetwenty-secondannualsymposiumoncomputationalgeometry.ACMPress,2006.[5]PavelBerkhin.Surveyofclusteringdataminingtechniques.Technicalreport,AccrueSoftware,SanJose,CA,2002.[6]MosesCharikar,LiadanO'Callaghan,andRinaPani-grahy.Betterstreamingalgorithmsforclusteringprob-lems.InSTOC'03:Proceedingsofthethirty-fthan-nualACMsymposiumonTheoryofcomputing,pages30{39,NewYork,NY,USA,2003.ACMPress.[7]PhilippeCollard'scloudcoverdatabase.ftp://ftp.ics.uci.edu/pub/machine-learning-databases/undocumented/taylor/cloud.data.[8]SanjoyDasgupta.Howfastis-means?InBernhardScholkopfandManfredK.Warmuth,editors,COLT,volume2777ofLectureNotesinComputerScience,page735.Springer,2003. [9]W.FernandezdelaVega,MarekKarpinski,ClaireKenyon,andYuvalRabani.Approximationschemesforclusteringproblems.InSTOC'03:Proceedingsofthethirty-fthannualACMsymposiumonTheoryofcomputing,pages50{58,NewYork,NY,USA,2003.ACMPress.[10]P.Drineas,A.Frieze,R.Kannan,S.Vempala,andV.Vinay.Clusteringlargegraphsviathesingularvaluedecomposition.Mach.Learn.,56(1-3):9{33,2004.[11]FredericGibouandRonaldFedkiw.Afasthybridk-meanslevelsetalgorithmforsegmentation.In4thAnnualHawaiiInternationalConferenceonStatisticsandMathematics,pages281{291,2005.[12]SudiptoGuha,AdamMeyerson,NinaMishra,RajeevMotwani,andLiadanO'Callaghan.Clusteringdatastreams:Theoryandpractice.IEEETransactionsonKnowledgeandDataEngineering,15(3):515{528,2003.[13]SarielHar-PeledandSohamMazumdar.Oncoresetsfork-meansandk-medianclustering.InSTOC'04:Proceedingsofthethirty-sixthannualACMsymposiumonTheoryofcomputing,pages291{300,NewYork,NY,USA,2004.ACMPress.[14]SarielHar-PeledandBardiaSadri.Howfastisthek-meansmethod?InSODA'05:ProceedingsofthesixteenthannualACM-SIAMsymposiumonDiscretealgorithms,pages877{885,Philadelphia,PA,USA,2005.SocietyforIndustrialandAppliedMathematics.[15]R.Herwig,A.J.Poustka,C.Muller,C.Bull,H.Lehrach,andJO'Brien.Large-scaleclusteringofcdna-ngerprintingdata.GenomeResearch,9:1093{1105,1999.[16]MaryInaba,NaokiKatoh,andHiroshiImai.Applica-tionsofweightedvoronoidiagramsandrandomizationtovariance-basedk-clustering:(extendedabstract).InSCG'94:ProceedingsofthetenthannualsymposiumonComputationalgeometry,pages332{339,NewYork,NY,USA,1994.ACMPress.[17]TapasKanungo,DavidM.Mount,NathanS.Ne-tanyahu,ChristineD.Piatko,RuthSilverman,andAn-gelaY.Wu.Alocalsearchapproximationalgorithmfork-meansclustering.Comput.Geom.,28(2-3):89{112,2004.[18]KDDCup1999dataset.http://kdd.ics.uci.edu//databases/kddcup99/kddcup99.html.[19]AmitKumar,YogishSabharwal,andSandeepSen.Asimplelineartime(1+)-approximationalgorithmfork-meansclusteringinanydimensions.InFOCS'04:Proceedingsofthe45thAnnualIEEESymposiumonFoundationsofComputerScience(FOCS'04),pages454{462,Washington,DC,USA,2004.IEEECom-puterSociety.[20]StuartP.Lloyd.Leastsquaresquantizationinpcm.IEEETransactionsonInformationTheory,28(2):129{136,1982.[21]JirMatousek.Onapproximategeometrick-clustering.Discrete&ComputationalGeometry,24(1):61{84,2000.[22]RamgopalR.MettuandC.GregPlaxton.Optimaltimeboundsforapproximateclustering.InAdnanDarwicheandNirFriedman,editors,UAI,pages344{351.MorganKaufmann,2002.[23]A.Meyerson.Onlinefacilitylocation.InFOCS'01:Proceedingsofthe42ndIEEEsymposiumonFounda-tionsofComputerScience,page426,Washington,DC,USA,2001.IEEEComputerSociety.[24]R.Ostrovsky,Y.Rabani,L.Schulman,andC.Swamy.TheeectivenessofLloyd-typemethodsforthek-Meansproblem.InSymposiumonFoundationsofComputerScience,2006.[25]Spame-maildatabase.http://www.ics.uci.edu/~mlearn/databases/spambase/.
© 2021 docslides.com Inc.
All rights reserved.