/
Scalable KMeans Bahman Bahmani  Stanford University Stanford CA bahmanstanford Scalable KMeans Bahman Bahmani  Stanford University Stanford CA bahmanstanford

Scalable KMeans Bahman Bahmani Stanford University Stanford CA bahmanstanford - PDF document

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
549 views
Uploaded On 2014-11-15

Scalable KMeans Bahman Bahmani Stanford University Stanford CA bahmanstanford - PPT Presentation

edu Benjamin Moseley 8727 University of Illinois Urbana IL bmosele2illinoisedu Andrea Vattani 8727 University of California San Diego CA avattanicsucsdedu Ravi Kumar Yahoo Research Sunnyvale CA ravikumaryahoo inccom Sergei Vassilvitskii Yahoo Researc ID: 12308

edu Benjamin Moseley 8727 University

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Scalable KMeans Bahman Bahmani Stanford..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

leadstoanO(logk)approximationoftheoptimum[5],oraconstantapproximationifthedataisknowntobewell-clusterable[30].Theexperimentalevaluationofk-means++initializationandthevariantsthatfollowed[1,2,15]demon-stratedthatcorrectlyinitializingLloyd'siterationiscrucialifoneweretoobtainagoodsolutionnotonlyintheory,butalsoinpractice.Onavarietyofdatasetsk-means++initial-izationobtainedorderofmagnitudeimprovementsovertherandominitialization.Thedownsideofthek-means++initializationisitsinher-entlysequentialnature.AlthoughitstotalrunningtimeofO(nkd),whenlookingforak-clusteringofnpointsinRd,isthesameasthatofasingleLloyd'siteration,itisnotap-parentlyparallelizable.Theprobabilitywithwhichapointischosentobetheithcenterdependscriticallyonthereal-izationofthepreviousi�1centers(itisthepreviouschoicesthatdeterminewhichpointsareawayinthecurrentsolu-tion).Anaiveimplementationofk-means++initializationwillmakekpassesoverthedatainordertoproducetheinitialcenters.Thisfactisexacerbatedinthemassivedatascenario.First,asdatasetsgrow,sodoesthenumberofclassesintowhichonewishestopartitionthedata.Forexample,clus-teringmillionsofpointsintok=100ork=1000istypical,butak-means++initializationwouldbeveryslowinthesecases.Thisslowdownisevenmoredetrimentalwhentherestofthealgorithm(i.e.,Lloyd'siterations)canbeimple-mentedinaparallelenvironmentlikeMapReduce[13].Formanyapplicationsitisdesirabletohaveaninitializational-gorithmwithsimilarguaranteestok-means++andthatcanbeecientlyparallelized.1.1OurcontributionsInthisworkweobtainaparallelversionofthek-means++initializationalgorithmandempiricallydemonstrateitsprac-ticale ectiveness.Themainideaisthatinsteadofsamplingasinglepointineachpassofthek-means++algorithm,wesampleO(k)pointsineachroundandrepeattheprocessforapproximatelyO(logn)rounds.AttheendofthealgorithmweareleftwithO(klogn)pointsthatformasolutionthatiswithinaconstantfactorawayfromtheoptimum.WethenreclustertheseO(klogn)pointsintokinitialcentersfortheLloyd'siteration.Thisinitializationalgorithm,whichwecallk-means||,isquitesimpleandlendsitselftoeasyparal-lelimplementations.However,theanalysisofthealgorithmturnsouttobehighlynon-trivial,requiringnewinsights,andisquitedi erentfromtheanalysisofk-means++.Wethenevaluatetheperformanceofthisalgorithmonreal-worlddatasets.Ourkeyobservationsintheexperi-mentsare:O(logn)iterationsisnotnecessaryandafteraslittleas verounds,thesolutionofk-means||isconsistentlyasgoodorbetterthanthatfoundbyanyothermethod.Theparallelimplementationofk-means||ismuchfasterthanexistingparallelalgorithmsfork-means.ThenumberofiterationsuntilLloyd'salgorithmcon-vergesissmallestwhenusingk-means||astheseed.2.RELATEDWORKClusteringproblemshavebeenfrequentandimportantobjectsofstudyforthepastmanyyearsbydatamanage-mentanddataminingresearchers.1Athoroughreviewoftheclusteringliterature,evenrestrictedtotheworkinthedatabasearea,isfarbeyondthescopeofthispaper;thereadersarereferredtotheplethoraofsurveysavailable[8,10,25,21,19].Below,weonlydiscussthehighlightsdirectlyrelevanttoourwork.Recallthatweareconcernedwithk-partitionclustering:givenasetofnpointsinEuclideanspaceandanintegerk, ndapartitionofthesepointsintoksubsets,eachwitharepresentative,alsoknownasacenter.Therearethreecommonformulationsofk-partitionclusteringdependingontheparticularobjectiveused:k-center,wheretheobjectiveistominimizethemaximumdistancebetweenapointanditsnearestclustercenter,k-median,wheretheobjectiveistominimizethesumofthesedistances,andk-means,wheretheobjectiveistominimizethesumofsquaresofthesedistances.AllthreeoftheseproblemsareNP-hard,butaconstantfactorapproximationisknownforthem.Thek-meansalgorithmshavebeenextensivelystudiedfromdatabaseanddatamanagementpointsofview;wedis-cusssomeofthem.OrdonezandOmiecinski[29]studiedecientdisk-basedimplementationofk-means,takingintoaccounttherequirementsofarelationalDBMS.Ordonez[28]studiedSQLimplementationsofthek-meanstobetterintegrateitwitharelationalDBMS.Thescalabilityissuesink-meansareaddressedbyFarnstrometal.[16],whousedcompression-basedtechniquesofBradleyetal.[9]toobtainasingle-passalgorithm.Theiremphasisistoinitial-izek-meansintheusualmanner,butinsteadimprovetheperformanceoftheLloyd'siteration.Thek-meansalgorithmhasalsobeenconsideredinapar-allelandothersettings;theliteratureisextensiveonthistopic.DhillonandModha[14]consideredk-meansinthemessage-passingmodel,focusingonthespeedupandscal-abilityissuesinthismodel.Severalpapershavestudiedk-meanswithoutliers;see,forexample,[22]andtherefer-encesin[18].Dasetal.[12]showedhowtoimplementEM(ageneralizationofk-means)inMapReduce;seealso[36]whousedsimilartrickstospeedupk-means.Sculley[31]presentedmodi cationstok-meansforbatchoptimizationsandtotakedatasparsityintoaccount.Noneofthesepapersfocusesondoinganon-trivialinitialization.Morerecently,Eneetal.[15]consideredthek-medianprobleminMapRe-duceandgaveaconstant-roundalgorithmthatachievesaconstantapproximation.Thek-meansalgorithmshavealsobeenstudiedfromthe-oreticalandalgorithmicpointsofview.Kanungoetal.[23]proposedalocalsearchalgorithmfork-meanswitharun-ningtimeofO(n3�d)andanapproximationfactorof9+.Althoughtherunningtimeisonlycubicintheworstcase,eveninpracticethealgorithmexhibitsslowconvergencetotheoptimalsolution.Kumar,Sabharwal,andSen[26]ob-taineda(1+)-approximationalgorithmwitharunningtimelinearinnanddbutexponentialinkand1 .Os-trovskyetal.[30]presentedasimplealgorithmfor ndinganinitialsetofclustersforLloyd'siterationandshowedthatundersomedataseparabilityassumptions,thealgo- 1Apaperondatabaseclustering[35]wonthe2006SIGMODTestofTimeAward. 3.3Ourinitializationalgorithm:k-means||Inthissectionwepresentk-means||,ourparallelversionforinitializingthecenters.Whileouralgorithmislargelyinspiredbyk-means++,itusesanoversamplingfactor`= (k),whichisunlikek-means++;intuitively,`shouldbethoughtofas(k).Ouralgorithmpicksaninitialcenter(say,uniformlyatrandom)andcomputes ,theinitialcostoftheclusteringafterthisselection.Itthenproceedsinlog iterations,whereineachiteration,giventhecurrentsetCofcenters,itsampleseachxwithprobability`d2(x;C)=X(C).ThesampledpointsarethenaddedtoC,thequantityX(C)updated,andtheiterationcontinued.Aswewillseelater, Algorithm2k-means||(k;`)initialization. 1:C sampleapointuniformlyatrandomfromX2: X(C)3:forO(log )timesdo4:C0 sampleeachpointx2Xindependentlywithprobabilitypx=`d2(x;C) X(C)5:C C[C06:endfor7:Forx2C,setwxtobethenumberofpointsinXclosertoxthananyotherpointinC8:ReclustertheweightedpointsinCintokclusters theexpectednumberofpointschosenineachiterationis`andattheend,theexpectednumberofpointsinCis`log ,whichistypicallymorethank.Toreducethenumberofcenters,Step7assignsweightstothepointsinCandStep8reclusterstheseweightedpointstoobtainkcenters.ThedetailsarepresentedinAlgorithm2.NoticethatthesizeofCissigni cantlysmallerthantheinputsize;thereclusteringcanthereforebedonequickly.Forinstance,inMapReduce,sincethenumberofcentersissmalltheycanallbeassignedtoasinglemachineandanyprovableapproximationalgorithm(suchask-means++)canbeusedtoclusterthepointstoobtainkcenters.AMapReduceimplementationofAlgorithm2isdiscussedinSection3.5.Whileouralgorithmisverysimpleandlendsitselftoanaturalparallelimplementation(inlog rounds2),thechal-lengingpartistoshowthatithasprovableguarantees.Notethat n22,whereisthemaximumdistanceamongapairofpointsinX.Wenowstateourformalguaranteeaboutthisalgorithm.Theorem1.Ifan -approximationalgorithmisusedinStep8,thenAlgorithmk-means||obtainsasolutionthatisanO( )-approximationtok-means.Thus,ifk-means++initializationisusedinStep8,thenk-means||isanO(logk)-approximation.InSection3.4wegiveanintuitiveexplanationwhythealgorithmworks;wedeferthefullprooftoSection6.3.4AglimpseoftheanalysisInthissection,wepresenttheintuitionbehindtheproofofTheorem1.ConsideraclusterApresentintheoptimumk-meanssolution,denotejAj=T,andsortthepointsinAinanincreasingorderoftheirdistancetocentroid(A):let 2Inpractice,ourexperimentalresultsinSection5showthatonlyafewroundsareenoughtoreachagoodsolution.theorderingbea1;:::;aT.Letqtbetheprobabilitythatatisthe rstpointintheorderingchosenbyk-means||andqT+1betheprobabilitythatnopointissampledfromclusterA.Lettingptdenotetheprobabilityofselectingat,wehave,byde nitionofthealgorithm,pt=`d2(at;C)=X(C).Also,sincek-means||pickseachpointindependently,forany1tT,wehaveqt=ptQt�1j=1(1�pj),andqT+1=1�PTt=1qt.Ifatisthe rstpointinA(w.r.t.theordering)sampledasanewcenter,wecaneitherassignallthepointsinAtoat,orjuststickwiththecurrentclusteringofA.Hence,lettingst=min(A;Xa2Ajja�atjj2);wehaveE[A(C[C0)]TXt=1qtst+qT+1A(C);Now,wedoamean- eldanalysis,inwhichweassumeallpt's(1tT)tobeequaltosomevaluep.Geometricallyspeaking,thiscorrespondstothecasewhereallthepointsinAareveryfarfromthecurrentclustering(andarealsorathertightlyclustered,sothatalld(at;C)'s(1tT)areequal).Inthiscase,wehaveqt=p(1�p)t�1,andhencefqtg1tTisamonotonedecreasingsequence.Bytheorderingonat's,lettings0t=Xa2Ajja�atjj2;wehavethatfs0tg1tTisanincreasingsequence.ThereforeTXt=1qtstTXt=1qts0t1 T TXt=1qtTXt=1s0t!;wherethelastinequality,aninstanceofChebyshev'ssumin-equality[20],isusingtheinversemonotonicityofsequencesfqtg1tTandfs0tg1tT.Itiseasytoseethat1 TPTt=1s0t=2A.Therefore,E[A(C[C0)](1�qT+1)2A+qT+1A(C):Thisshowsthatineachiterationofk-means||,foreachoptimalclusterA,weremoveafractionofAandreplaceitwithaconstantfactortimesA.Thus,Steps1{6ofk-means||obtainaconstantfactorapproximationtok-meansafterO(log )roundsandreturnO(`log )centers.Theal-gorithmobtainsasolutionofsizekbyclusteringthechosencentersusingaknownalgorithm.Section6containsthefor-malargumentsthatworkforthegeneralcasewhenpt'sarenotnecessarilythesame.3.5AparallelimplementationInthissectionwediscussaparallelimplementationofk-means||intheMapReducemodelofcomputation.WeassumefamiliaritywiththeMapReducemodelandreferthereaderto[13]forfurtherdetails.Aswementionedearlier,Lloyd'siterationscanbeeasilyparallelizedinMapReduceandhence,weonlyfocusonSteps1{7inAlgorithm2.Step4isverysimpleinMapReduce:eachmappercansampleindependentlyandStep7isequallysimplegivenasetCofcenters.Givena(small)setCofcenters,computingX(C)isalsoeasy:eachmapperworkingonaninputpartitionX0XcancomputeX0(C)andthereducercansimply R=1 R=10 R=100 seed nal seed nal seed nal Random | 14 | 201 | 23,337 k-means++ 23 14 62 31 30 15 k-means||`=k=2;r=5 21 14 36 28 23 15 k-means||`=2k;r=5 17 14 27 25 16 15 Table1:Themediancost(over11runs)onGauss-Mixturewithk=50,scaleddownby104.Weshowboththecostaftertheinitializationstep(seed)andthe nalcostafterLloyd'siterations( nal).roundof~O(k3=2p n).)Notethatthisimpliesthattherun-ningtimeofPartitiondoesnotimprovewhenthenumberofavailablemachinessurpassesacertainthreshold.Ontheotherhand,k-means||'srunningtimeimproveslinearlywiththenumberofavailablemachines(asdiscussedinSection2,suchissueswereconsideredin[14]).Finally,noticethatusingthisoptimalsetting,theexpectedsizeoftheinterme-diatesetusedbyPartitionis3p nklogk,whichismuchlargerthanthatobtainedbyk-means||.Forinstance,Ta-ble5showsthatthesizeofthecoresetreturnedbyk-means||issmallerbythreeordersofmagnitude.5.EXPERIMENTALRESULTSInthissectionwedescribetheexperimentalresultsbasedonthesetupinSection4.Wepresentexperimentsonbothsequentialandparallelimplementationsofk-means||.Recallthatthemainmeritsofk-means||werestatedinTheorem1:(i)k-means||obtainsasasolutionwhoseclusteringcostisonparwithk-means++andhenceisexpectedtobemuchbet-terthanRandomand(ii)k-means||runsinafewernumberofroundswhencomparedtok-means++,whichtranslatesintoafasterrunningtimeespeciallyintheparallelimplementa-tion.Thegoalofourexperimentswillbetodemonstratetheseimprovementsonmassive,real-worlddatasets.5.1ClusteringcostToevaluatetheclusteringcostofk-means||,wecompareitagainstthebaselineapproaches.SpamandGaussMixturearesmallenoughtobeevaluatedonasinglemachine,andwecomparetheircosttothatofk-means||formoderatevaluesofk2f20;50;100g.Wenotethatfork50,thecentersselectedbyPartitionbeforereclusteringrepresentthefulldataset(as3p nklogk�nforthesedatasets),whichmeansthatresultsofPartitionwouldbeidenticaltothoseofk-means++.Hence,inthiscase,weonlycomparek-means||withk-means++andRandom.KDDCup1999issucientlylargethatforlargevaluesofk2f500;1000g,k-means++isextremelyslowwhenrunonasinglemachine.Hence,inthiscase,wewillonlycomparetheparallelimplementationofk-means||withPartitionandRandom.WepresenttheresultsforGaussMixtureinTable1andforSpaminTable2.Foreachalgorithmwelistthecostofthesolutionbothattheendoftheinitializationstep,beforeanyLloyd'siterationandthe nalcost.Wepresenttwopa-rametersettingsfork-means||;wewillexplorethee ectoftheparametersontheperformanceofthealgorithminSec-tion5.3.Wenotethattheinitializationcostofk-means||istypicallylowerthanthatofk-means++.Thissuggeststhat k=20 k=50 k=100 seed nal seed nal seed nal Random | 1,528 | 1,488 | 1,384 k-means++ 460 233 110 68 40 24 k-means||`=k=2;r=5 310 241 82 65 29 23 k-means||`=2k;r=5 260 234 69 66 24 24 Table2:Themediancost(over11runs)onSpamscaleddownby105.Weshowboththecostaftertheinitializationstep(seed)andthe nalcostafterLloyd'siterations( nal). k=500 k=1000 Random 6:8107 6:4107 Partition 7.3 1.9 0:1k 5.1 1.5 0:5k 19 5.2 k-means||,`= k 7.7 2.0 2k 5.2 1.5 10k 5.8 1.6 Table3:Clusteringcost(scaleddownby1010)forKDDCup1999forr=5.thecentersproducedbyk-means||avoidoutliers,i.e.,pointsthat\confuse"k-means++.Thisimprovementpersists,al-thoughisnotaspronouncedifwelookatthe nalcostoftheclustering.InTable3wepresenttheresultsforKD-DCup1999.Itisclearthatbothk-means||andPartitionoutperformRandombyordersofmagnitude.Theoverallcostfork-means||improveswithlargervaluesof`andsurpassesthatofPartitionfor`�k.5.2RunningtimeWenowshowthatk-means||isfasterthanRandomandPartitionwhenimplementedtoruninparallel.Recallthattherunningtimeofk-means||consistsoftwocomponents:thetimerequiredtogeneratetheinitialsolutionandtherunningtimeofLloyd'siterationtoconvergence.Theformerisproportionaltoboththenumberofpassesthroughthedataandthesizeoftheintermediatesolution.We rstturnourattentiontotherunningtimeoftheini-tializationroutine.Itisclearthatthenumberrofroundsusedbyk-means||ismuchsmallerthanthatbyk-means++.Wethereforefocusontheparallelimplementationandcom-parek-means||againstPartitionandRandom.InTable4weshowthetotalrunningtimeofthesealgorithms.Forvar-ioussettingsof`,k-means||runsmuchfasterthanRandomandPartition. k=500 k=1000 Random 300.0 489.4 Partition 420.2 1,021.7 0:1k 230.2 222.6 0:5k 69.0 46.2 k-means||,`= k 75.6 89.1 2k 69.8 86.7 10k 75.7 101.0 Table4:Time(inminutes)forKDDCup1999. Figure5.1:Thee ectofdi erentvaluesof`andthenumberofroundsronthe nalcostofthealgorithmfora10%sampleofKDDCup1999.Eachdatapointisthemedianof11runsofthealgorithm.BytheinductionhypothesisonE[(i)],wehaveE[(i+1)]1+ 2i+1 +81+ 1� +1=1+ 2i+1 +16 1� : Corollary3impliesthatafterO(log )rounds,thecostoftheclusteringisO();Theorem1isthenanimmediateconsequence.WenowproceedtoestablishTheorem2.ConsideranyclusterAwithcentroid(A)intheoptimalsolution.DenotejAj=Tandleta1;:::;aTbethepointsinAsortedincreasinglywithrespecttotheirdistancetocentroid(A).LetC0denotethesetofcentersthatarese-lectedduringaparticulariteration.For1tT,weletqt=Pr[at2C0;aj=2C0;81jt]betheprobabilitythatthe rstt�1pointsfa1;:::;at�1garenotsampledduringthisiterationandatissampled.Also,wedenotebyqT+1theprobabilitythatnopointissampledfromclusterA.Furthermore,fortheremainderofthissection,letD(a)=d(a;C),whereCisthesetofcentersinthecurrentiteration.Lettingptdenotetheprobabilityofselectingat,wehave,byde nitionofthealgorithm,pt=`D2(at) .Sincek-means||pickseachpointindependently,usingtheconventionthatpT+1=1,wehaveforall1tT+1,qt=ptt�1Yj=1(1�pj):Themainideabehindtheproofistoconsideronlythoseclustersintheoptimalsolutionthathavesigni cantcostrel-ativetothetotalclusteringcost.Foreachoftheseclusters,theideaisto rstexpressbothitsclusteringcostandtheprobabilitythatanearlypointisnotselectedaslinearfunc-tionsoftheqt's(Lemmas4,5),andthenappealtolinearprogramming(LP)dualityinordertoboundtheclusteringcostitself(Lemma6andCorollary7).Toformalizethisidea,westartbyde ningst=min(A;Xa2Ajja�atjj2);forall1tT,andsT+1=A.Then,letting0A=A(C[C0)betheclusteringcostofclusterAafterthecurrentroundofthealgorithm,wehavethefollowing. Figure5.2:Thecostofk-means||followedbyLloyd'siterationsasafunctionofthenumberofinitializationroundsforGaussMixture.Lemma4.TheexpectedcostofclusteringanoptimumclusterAafteraroundofAlgorithm2isboundedasE[0A]T+1Xt=1qtst:(6.1)Proof.Wecanrewritetheexpectationoftheclusteringcost0AforclusterAafteroneroundofthealgorithmasfollows:E[0A]=TXt=1qtE[0Ajat2C0;aj=2C0;81jt]:(6.2)Observethatconditionedonthefactthatat2C0,wecaneitherassignallthepointsinAtocenterat,orjuststickwiththeformerclusteringofA,whicheverhasasmallercost.HenceE[0Ajat2C0;aj=2C0;81jt]st:andtheresultfollowsfrom(6.2). Inordertominimizetherighthandsideof(6.1),wewanttobesurethatthesamplingdonebythealgorithmplacesalotofweightonqtforsmallvaluesoft.Intuitivelythismeansthatwearemorelikelytoselectapointclosetotheoptimalcenteroftheclusterthanonefurtheraway.OursamplingbasedonD2()impliesaconstraintontheprob-abilitythatanearlypointisnotselected,whichwedetailbelow.Lemma5.Let0=1,and,forany1tT,t=Qtj=11�D2(aj) A(1�qT+1):Then,forany0tT,T+1Xr=t+1qrt:Proof.FirstnotethatqT+1=QTt=1(1�pt)1�PTt=1pt.Therefore,1�qT+1tXt=1pt=`A :Thuspt=`D2(at) D2(at) A(1�qT+1):Toprovethelemma,bythede nitionofqrwehaveT+1Xr=t+1qr= tYj=1(1�pj)!T+1Xr=t+1r�1Yj=t+1(1�pj)prtYj=1(1�pj)tYj=11�D2(at) A(1�qT+1)=t: Havingprovedthislemma,wenowslightlychangeourperspectiveandthinkofthevaluesqt(1tT+1)asvariablesthat(byLemma5)satisfyanumberoflinearcon-straintsandalso(byLemma4)alinearfunctionofwhichboundsE[0A].ThisnaturallyleadstoanLPonthesevari-ablestogetanupperboundonE[0A];seeFigure6.1.WewillthenusethepropertiesoftheLPanditsdualtoprovethefollowinglemma.Lemma6.TheexpectedpotentialofanoptimalclusterAafterasamplingstepinAlgorithm2isboundedasE[0A](1�qT+1)TXt=1D2(at) Ast+TA:Proof.SincethepointsinAaresortedincreasinglywithrespecttotheirdistancestothecentroid,lettings0t=Xa2Ajja�atjj2;(6.3)for1tT,wehavethats01s0T.Hence,sincest=minfA;s0tg,wealsohaves1sTsT+1.NowconsidertheLPinFigure6.1anditsdual.Sincestisanincreasingsequence,theoptimalsolutiontothedualmusthave t=st+1�st(lettings0=0).Then,wecan