edu Benjamin Moseley 8727 University of Illinois Urbana IL bmosele2illinoisedu Andrea Vattani 8727 University of California San Diego CA avattanicsucsdedu Ravi Kumar Yahoo Research Sunnyvale CA ravikumaryahoo inccom Sergei Vassilvitskii Yahoo Researc ID: 12308
Download Pdf The PPT/PDF document "Scalable KMeans Bahman Bahmani Stanford..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
leadstoanO(logk)approximationoftheoptimum[5],oraconstantapproximationifthedataisknowntobewell-clusterable[30].Theexperimentalevaluationofk-means++initializationandthevariantsthatfollowed[1,2,15]demon-stratedthatcorrectlyinitializingLloyd'siterationiscrucialifoneweretoobtainagoodsolutionnotonlyintheory,butalsoinpractice.Onavarietyofdatasetsk-means++initial-izationobtainedorderofmagnitudeimprovementsovertherandominitialization.Thedownsideofthek-means++initializationisitsinher-entlysequentialnature.AlthoughitstotalrunningtimeofO(nkd),whenlookingforak-clusteringofnpointsinRd,isthesameasthatofasingleLloyd'siteration,itisnotap-parentlyparallelizable.Theprobabilitywithwhichapointischosentobetheithcenterdependscriticallyonthereal-izationofthepreviousi1centers(itisthepreviouschoicesthatdeterminewhichpointsareawayinthecurrentsolu-tion).Anaiveimplementationofk-means++initializationwillmakekpassesoverthedatainordertoproducetheinitialcenters.Thisfactisexacerbatedinthemassivedatascenario.First,asdatasetsgrow,sodoesthenumberofclassesintowhichonewishestopartitionthedata.Forexample,clus-teringmillionsofpointsintok=100ork=1000istypical,butak-means++initializationwouldbeveryslowinthesecases.Thisslowdownisevenmoredetrimentalwhentherestofthealgorithm(i.e.,Lloyd'siterations)canbeimple-mentedinaparallelenvironmentlikeMapReduce[13].Formanyapplicationsitisdesirabletohaveaninitializational-gorithmwithsimilarguaranteestok-means++andthatcanbeecientlyparallelized.1.1OurcontributionsInthisworkweobtainaparallelversionofthek-means++initializationalgorithmandempiricallydemonstrateitsprac-ticaleectiveness.Themainideaisthatinsteadofsamplingasinglepointineachpassofthek-means++algorithm,wesampleO(k)pointsineachroundandrepeattheprocessforapproximatelyO(logn)rounds.AttheendofthealgorithmweareleftwithO(klogn)pointsthatformasolutionthatiswithinaconstantfactorawayfromtheoptimum.WethenreclustertheseO(klogn)pointsintokinitialcentersfortheLloyd'siteration.Thisinitializationalgorithm,whichwecallk-means||,isquitesimpleandlendsitselftoeasyparal-lelimplementations.However,theanalysisofthealgorithmturnsouttobehighlynon-trivial,requiringnewinsights,andisquitedierentfromtheanalysisofk-means++.Wethenevaluatetheperformanceofthisalgorithmonreal-worlddatasets.Ourkeyobservationsintheexperi-mentsare:O(logn)iterationsisnotnecessaryandafteraslittleasverounds,thesolutionofk-means||isconsistentlyasgoodorbetterthanthatfoundbyanyothermethod.Theparallelimplementationofk-means||ismuchfasterthanexistingparallelalgorithmsfork-means.ThenumberofiterationsuntilLloyd'salgorithmcon-vergesissmallestwhenusingk-means||astheseed.2.RELATEDWORKClusteringproblemshavebeenfrequentandimportantobjectsofstudyforthepastmanyyearsbydatamanage-mentanddataminingresearchers.1Athoroughreviewoftheclusteringliterature,evenrestrictedtotheworkinthedatabasearea,isfarbeyondthescopeofthispaper;thereadersarereferredtotheplethoraofsurveysavailable[8,10,25,21,19].Below,weonlydiscussthehighlightsdirectlyrelevanttoourwork.Recallthatweareconcernedwithk-partitionclustering:givenasetofnpointsinEuclideanspaceandanintegerk,ndapartitionofthesepointsintoksubsets,eachwitharepresentative,alsoknownasacenter.Therearethreecommonformulationsofk-partitionclusteringdependingontheparticularobjectiveused:k-center,wheretheobjectiveistominimizethemaximumdistancebetweenapointanditsnearestclustercenter,k-median,wheretheobjectiveistominimizethesumofthesedistances,andk-means,wheretheobjectiveistominimizethesumofsquaresofthesedistances.AllthreeoftheseproblemsareNP-hard,butaconstantfactorapproximationisknownforthem.Thek-meansalgorithmshavebeenextensivelystudiedfromdatabaseanddatamanagementpointsofview;wedis-cusssomeofthem.OrdonezandOmiecinski[29]studiedecientdisk-basedimplementationofk-means,takingintoaccounttherequirementsofarelationalDBMS.Ordonez[28]studiedSQLimplementationsofthek-meanstobetterintegrateitwitharelationalDBMS.Thescalabilityissuesink-meansareaddressedbyFarnstrometal.[16],whousedcompression-basedtechniquesofBradleyetal.[9]toobtainasingle-passalgorithm.Theiremphasisistoinitial-izek-meansintheusualmanner,butinsteadimprovetheperformanceoftheLloyd'siteration.Thek-meansalgorithmhasalsobeenconsideredinapar-allelandothersettings;theliteratureisextensiveonthistopic.DhillonandModha[14]consideredk-meansinthemessage-passingmodel,focusingonthespeedupandscal-abilityissuesinthismodel.Severalpapershavestudiedk-meanswithoutliers;see,forexample,[22]andtherefer-encesin[18].Dasetal.[12]showedhowtoimplementEM(ageneralizationofk-means)inMapReduce;seealso[36]whousedsimilartrickstospeedupk-means.Sculley[31]presentedmodicationstok-meansforbatchoptimizationsandtotakedatasparsityintoaccount.Noneofthesepapersfocusesondoinganon-trivialinitialization.Morerecently,Eneetal.[15]consideredthek-medianprobleminMapRe-duceandgaveaconstant-roundalgorithmthatachievesaconstantapproximation.Thek-meansalgorithmshavealsobeenstudiedfromthe-oreticalandalgorithmicpointsofview.Kanungoetal.[23]proposedalocalsearchalgorithmfork-meanswitharun-ningtimeofO(n3d)andanapproximationfactorof9+.Althoughtherunningtimeisonlycubicintheworstcase,eveninpracticethealgorithmexhibitsslowconvergencetotheoptimalsolution.Kumar,Sabharwal,andSen[26]ob-taineda(1+)-approximationalgorithmwitharunningtimelinearinnanddbutexponentialinkand1 .Os-trovskyetal.[30]presentedasimplealgorithmforndinganinitialsetofclustersforLloyd'siterationandshowedthatundersomedataseparabilityassumptions,thealgo- 1Apaperondatabaseclustering[35]wonthe2006SIGMODTestofTimeAward. 3.3Ourinitializationalgorithm:k-means||Inthissectionwepresentk-means||,ourparallelversionforinitializingthecenters.Whileouralgorithmislargelyinspiredbyk-means++,itusesanoversamplingfactor`= (k),whichisunlikek-means++;intuitively,`shouldbethoughtofas(k).Ouralgorithmpicksaninitialcenter(say,uniformlyatrandom)andcomputes ,theinitialcostoftheclusteringafterthisselection.Itthenproceedsinlog iterations,whereineachiteration,giventhecurrentsetCofcenters,itsampleseachxwithprobability`d2(x;C)=X(C).ThesampledpointsarethenaddedtoC,thequantityX(C)updated,andtheiterationcontinued.Aswewillseelater, Algorithm2k-means||(k;`)initialization. 1:C sampleapointuniformlyatrandomfromX2: X(C)3:forO(log )timesdo4:C0 sampleeachpointx2Xindependentlywithprobabilitypx=`d2(x;C) X(C)5:C C[C06:endfor7:Forx2C,setwxtobethenumberofpointsinXclosertoxthananyotherpointinC8:ReclustertheweightedpointsinCintokclusters theexpectednumberofpointschosenineachiterationis`andattheend,theexpectednumberofpointsinCis`log ,whichistypicallymorethank.Toreducethenumberofcenters,Step7assignsweightstothepointsinCandStep8reclusterstheseweightedpointstoobtainkcenters.ThedetailsarepresentedinAlgorithm2.NoticethatthesizeofCissignicantlysmallerthantheinputsize;thereclusteringcanthereforebedonequickly.Forinstance,inMapReduce,sincethenumberofcentersissmalltheycanallbeassignedtoasinglemachineandanyprovableapproximationalgorithm(suchask-means++)canbeusedtoclusterthepointstoobtainkcenters.AMapReduceimplementationofAlgorithm2isdiscussedinSection3.5.Whileouralgorithmisverysimpleandlendsitselftoanaturalparallelimplementation(inlog rounds2),thechal-lengingpartistoshowthatithasprovableguarantees.Notethat n22,whereisthemaximumdistanceamongapairofpointsinX.Wenowstateourformalguaranteeaboutthisalgorithm.Theorem1.Ifan-approximationalgorithmisusedinStep8,thenAlgorithmk-means||obtainsasolutionthatisanO()-approximationtok-means.Thus,ifk-means++initializationisusedinStep8,thenk-means||isanO(logk)-approximation.InSection3.4wegiveanintuitiveexplanationwhythealgorithmworks;wedeferthefullprooftoSection6.3.4AglimpseoftheanalysisInthissection,wepresenttheintuitionbehindtheproofofTheorem1.ConsideraclusterApresentintheoptimumk-meanssolution,denotejAj=T,andsortthepointsinAinanincreasingorderoftheirdistancetocentroid(A):let 2Inpractice,ourexperimentalresultsinSection5showthatonlyafewroundsareenoughtoreachagoodsolution.theorderingbea1;:::;aT.Letqtbetheprobabilitythatatistherstpointintheorderingchosenbyk-means||andqT+1betheprobabilitythatnopointissampledfromclusterA.Lettingptdenotetheprobabilityofselectingat,wehave,bydenitionofthealgorithm,pt=`d2(at;C)=X(C).Also,sincek-means||pickseachpointindependently,forany1tT,wehaveqt=ptQt1j=1(1pj),andqT+1=1PTt=1qt.IfatistherstpointinA(w.r.t.theordering)sampledasanewcenter,wecaneitherassignallthepointsinAtoat,orjuststickwiththecurrentclusteringofA.Hence,lettingst=min(A;Xa2Ajjaatjj2);wehaveE[A(C[C0)]TXt=1qtst+qT+1A(C);Now,wedoamean-eldanalysis,inwhichweassumeallpt's(1tT)tobeequaltosomevaluep.Geometricallyspeaking,thiscorrespondstothecasewhereallthepointsinAareveryfarfromthecurrentclustering(andarealsorathertightlyclustered,sothatalld(at;C)'s(1tT)areequal).Inthiscase,wehaveqt=p(1p)t1,andhencefqtg1tTisamonotonedecreasingsequence.Bytheorderingonat's,lettings0t=Xa2Ajjaatjj2;wehavethatfs0tg1tTisanincreasingsequence.ThereforeTXt=1qtstTXt=1qts0t1 T TXt=1qtTXt=1s0t!;wherethelastinequality,aninstanceofChebyshev'ssumin-equality[20],isusingtheinversemonotonicityofsequencesfqtg1tTandfs0tg1tT.Itiseasytoseethat1 TPTt=1s0t=2A.Therefore,E[A(C[C0)](1qT+1)2A+qT+1A(C):Thisshowsthatineachiterationofk-means||,foreachoptimalclusterA,weremoveafractionofAandreplaceitwithaconstantfactortimesA.Thus,Steps1{6ofk-means||obtainaconstantfactorapproximationtok-meansafterO(log )roundsandreturnO(`log )centers.Theal-gorithmobtainsasolutionofsizekbyclusteringthechosencentersusingaknownalgorithm.Section6containsthefor-malargumentsthatworkforthegeneralcasewhenpt'sarenotnecessarilythesame.3.5AparallelimplementationInthissectionwediscussaparallelimplementationofk-means||intheMapReducemodelofcomputation.WeassumefamiliaritywiththeMapReducemodelandreferthereaderto[13]forfurtherdetails.Aswementionedearlier,Lloyd'siterationscanbeeasilyparallelizedinMapReduceandhence,weonlyfocusonSteps1{7inAlgorithm2.Step4isverysimpleinMapReduce:eachmappercansampleindependentlyandStep7isequallysimplegivenasetCofcenters.Givena(small)setCofcenters,computingX(C)isalsoeasy:eachmapperworkingonaninputpartitionX0XcancomputeX0(C)andthereducercansimply R=1 R=10 R=100 seed nal seed nal seed nal Random | 14 | 201 | 23,337 k-means++ 23 14 62 31 30 15 k-means||`=k=2;r=5 21 14 36 28 23 15 k-means||`=2k;r=5 17 14 27 25 16 15 Table1:Themediancost(over11runs)onGauss-Mixturewithk=50,scaleddownby104.Weshowboththecostaftertheinitializationstep(seed)andthenalcostafterLloyd'siterations(nal).roundof~O(k3=2p n).)Notethatthisimpliesthattherun-ningtimeofPartitiondoesnotimprovewhenthenumberofavailablemachinessurpassesacertainthreshold.Ontheotherhand,k-means||'srunningtimeimproveslinearlywiththenumberofavailablemachines(asdiscussedinSection2,suchissueswereconsideredin[14]).Finally,noticethatusingthisoptimalsetting,theexpectedsizeoftheinterme-diatesetusedbyPartitionis3p nklogk,whichismuchlargerthanthatobtainedbyk-means||.Forinstance,Ta-ble5showsthatthesizeofthecoresetreturnedbyk-means||issmallerbythreeordersofmagnitude.5.EXPERIMENTALRESULTSInthissectionwedescribetheexperimentalresultsbasedonthesetupinSection4.Wepresentexperimentsonbothsequentialandparallelimplementationsofk-means||.Recallthatthemainmeritsofk-means||werestatedinTheorem1:(i)k-means||obtainsasasolutionwhoseclusteringcostisonparwithk-means++andhenceisexpectedtobemuchbet-terthanRandomand(ii)k-means||runsinafewernumberofroundswhencomparedtok-means++,whichtranslatesintoafasterrunningtimeespeciallyintheparallelimplementa-tion.Thegoalofourexperimentswillbetodemonstratetheseimprovementsonmassive,real-worlddatasets.5.1ClusteringcostToevaluatetheclusteringcostofk-means||,wecompareitagainstthebaselineapproaches.SpamandGaussMixturearesmallenoughtobeevaluatedonasinglemachine,andwecomparetheircosttothatofk-means||formoderatevaluesofk2f20;50;100g.Wenotethatfork50,thecentersselectedbyPartitionbeforereclusteringrepresentthefulldataset(as3p nklogknforthesedatasets),whichmeansthatresultsofPartitionwouldbeidenticaltothoseofk-means++.Hence,inthiscase,weonlycomparek-means||withk-means++andRandom.KDDCup1999issucientlylargethatforlargevaluesofk2f500;1000g,k-means++isextremelyslowwhenrunonasinglemachine.Hence,inthiscase,wewillonlycomparetheparallelimplementationofk-means||withPartitionandRandom.WepresenttheresultsforGaussMixtureinTable1andforSpaminTable2.Foreachalgorithmwelistthecostofthesolutionbothattheendoftheinitializationstep,beforeanyLloyd'siterationandthenalcost.Wepresenttwopa-rametersettingsfork-means||;wewillexploretheeectoftheparametersontheperformanceofthealgorithminSec-tion5.3.Wenotethattheinitializationcostofk-means||istypicallylowerthanthatofk-means++.Thissuggeststhat k=20 k=50 k=100 seed nal seed nal seed nal Random | 1,528 | 1,488 | 1,384 k-means++ 460 233 110 68 40 24 k-means||`=k=2;r=5 310 241 82 65 29 23 k-means||`=2k;r=5 260 234 69 66 24 24 Table2:Themediancost(over11runs)onSpamscaleddownby105.Weshowboththecostaftertheinitializationstep(seed)andthenalcostafterLloyd'siterations(nal). k=500 k=1000 Random 6:8107 6:4107 Partition 7.3 1.9 0:1k 5.1 1.5 0:5k 19 5.2 k-means||,`= k 7.7 2.0 2k 5.2 1.5 10k 5.8 1.6 Table3:Clusteringcost(scaleddownby1010)forKDDCup1999forr=5.thecentersproducedbyk-means||avoidoutliers,i.e.,pointsthat\confuse"k-means++.Thisimprovementpersists,al-thoughisnotaspronouncedifwelookatthenalcostoftheclustering.InTable3wepresenttheresultsforKD-DCup1999.Itisclearthatbothk-means||andPartitionoutperformRandombyordersofmagnitude.Theoverallcostfork-means||improveswithlargervaluesof`andsurpassesthatofPartitionfor`k.5.2RunningtimeWenowshowthatk-means||isfasterthanRandomandPartitionwhenimplementedtoruninparallel.Recallthattherunningtimeofk-means||consistsoftwocomponents:thetimerequiredtogeneratetheinitialsolutionandtherunningtimeofLloyd'siterationtoconvergence.Theformerisproportionaltoboththenumberofpassesthroughthedataandthesizeoftheintermediatesolution.Werstturnourattentiontotherunningtimeoftheini-tializationroutine.Itisclearthatthenumberrofroundsusedbyk-means||ismuchsmallerthanthatbyk-means++.Wethereforefocusontheparallelimplementationandcom-parek-means||againstPartitionandRandom.InTable4weshowthetotalrunningtimeofthesealgorithms.Forvar-ioussettingsof`,k-means||runsmuchfasterthanRandomandPartition. k=500 k=1000 Random 300.0 489.4 Partition 420.2 1,021.7 0:1k 230.2 222.6 0:5k 69.0 46.2 k-means||,`= k 75.6 89.1 2k 69.8 86.7 10k 75.7 101.0 Table4:Time(inminutes)forKDDCup1999. Figure5.1:Theeectofdierentvaluesof`andthenumberofroundsronthenalcostofthealgorithmfora10%sampleofKDDCup1999.Eachdatapointisthemedianof11runsofthealgorithm.BytheinductionhypothesisonE[(i)],wehaveE[(i+1)]1+ 2i+1 +81+ 1+1=1+ 2i+1 +16 1: Corollary3impliesthatafterO(log )rounds,thecostoftheclusteringisO();Theorem1isthenanimmediateconsequence.WenowproceedtoestablishTheorem2.ConsideranyclusterAwithcentroid(A)intheoptimalsolution.DenotejAj=Tandleta1;:::;aTbethepointsinAsortedincreasinglywithrespecttotheirdistancetocentroid(A).LetC0denotethesetofcentersthatarese-lectedduringaparticulariteration.For1tT,weletqt=Pr[at2C0;aj=2C0;81jt]betheprobabilitythattherstt1pointsfa1;:::;at1garenotsampledduringthisiterationandatissampled.Also,wedenotebyqT+1theprobabilitythatnopointissampledfromclusterA.Furthermore,fortheremainderofthissection,letD(a)=d(a;C),whereCisthesetofcentersinthecurrentiteration.Lettingptdenotetheprobabilityofselectingat,wehave,bydenitionofthealgorithm,pt=`D2(at) .Sincek-means||pickseachpointindependently,usingtheconventionthatpT+1=1,wehaveforall1tT+1,qt=ptt1Yj=1(1pj):Themainideabehindtheproofistoconsideronlythoseclustersintheoptimalsolutionthathavesignicantcostrel-ativetothetotalclusteringcost.Foreachoftheseclusters,theideaistorstexpressbothitsclusteringcostandtheprobabilitythatanearlypointisnotselectedaslinearfunc-tionsoftheqt's(Lemmas4,5),andthenappealtolinearprogramming(LP)dualityinordertoboundtheclusteringcostitself(Lemma6andCorollary7).Toformalizethisidea,westartbydeningst=min(A;Xa2Ajjaatjj2);forall1tT,andsT+1=A.Then,letting0A=A(C[C0)betheclusteringcostofclusterAafterthecurrentroundofthealgorithm,wehavethefollowing. Figure5.2:Thecostofk-means||followedbyLloyd'siterationsasafunctionofthenumberofinitializationroundsforGaussMixture.Lemma4.TheexpectedcostofclusteringanoptimumclusterAafteraroundofAlgorithm2isboundedasE[0A]T+1Xt=1qtst:(6.1)Proof.Wecanrewritetheexpectationoftheclusteringcost0AforclusterAafteroneroundofthealgorithmasfollows:E[0A]=TXt=1qtE[0Ajat2C0;aj=2C0;81jt]:(6.2)Observethatconditionedonthefactthatat2C0,wecaneitherassignallthepointsinAtocenterat,orjuststickwiththeformerclusteringofA,whicheverhasasmallercost.HenceE[0Ajat2C0;aj=2C0;81jt]st:andtheresultfollowsfrom(6.2). Inordertominimizetherighthandsideof(6.1),wewanttobesurethatthesamplingdonebythealgorithmplacesalotofweightonqtforsmallvaluesoft.Intuitivelythismeansthatwearemorelikelytoselectapointclosetotheoptimalcenteroftheclusterthanonefurtheraway.OursamplingbasedonD2()impliesaconstraintontheprob-abilitythatanearlypointisnotselected,whichwedetailbelow.Lemma5.Let0=1,and,forany1tT,t=Qtj=11D2(aj) A(1qT+1):Then,forany0tT,T+1Xr=t+1qrt:Proof.FirstnotethatqT+1=QTt=1(1pt)1PTt=1pt.Therefore,1qT+1tXt=1pt=`A :Thuspt=`D2(at) D2(at) A(1qT+1):Toprovethelemma,bythedenitionofqrwehaveT+1Xr=t+1qr= tYj=1(1pj)!T+1Xr=t+1r1Yj=t+1(1pj)prtYj=1(1pj)tYj=11D2(at) A(1qT+1)=t: Havingprovedthislemma,wenowslightlychangeourperspectiveandthinkofthevaluesqt(1tT+1)asvariablesthat(byLemma5)satisfyanumberoflinearcon-straintsandalso(byLemma4)alinearfunctionofwhichboundsE[0A].ThisnaturallyleadstoanLPonthesevari-ablestogetanupperboundonE[0A];seeFigure6.1.WewillthenusethepropertiesoftheLPanditsdualtoprovethefollowinglemma.Lemma6.TheexpectedpotentialofanoptimalclusterAafterasamplingstepinAlgorithm2isboundedasE[0A](1qT+1)TXt=1D2(at) Ast+TA:Proof.SincethepointsinAaresortedincreasinglywithrespecttotheirdistancestothecentroid,lettings0t=Xa2Ajjaatjj2;(6.3)for1tT,wehavethats01s0T.Hence,sincest=minfA;s0tg,wealsohaves1sTsT+1.NowconsidertheLPinFigure6.1anditsdual.Sincestisanincreasingsequence,theoptimalsolutiontothedualmusthavet=st+1st(lettings0=0).Then,wecan