Author Hila Becker Date May 5 2005 A Survey of Correlation Clustering Abstract The problem of partitioning a set of data points into clusters is found in many applications Correlation clustering is a clustering technique motivated by the the problem ID: 43928
Download Pdf The PPT/PDF document "COMS E Advanced Topics in Computational ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
fusingahypothesisclassofvertexclusters.Sowheneverwehave(u;v)and(v;w)aspositiveexamples,(u;w)mustalsobeapositiveexample,thereforewemightnotbeabletorepresentfinthebestpossibleway.Mostoftheworkonthisclusteringmethodisfairlynew,asitwasonlyintroducedin2002by[1].Fortherestofthispaper,wewilldiscusssomeofthemajorresultsaswellasacoupleofopenquestionsandanintuitionfortheirsolution.2DenitionsLetG=(V;E)beagraphonanverticeswithedgeweightsce0.Lete(u;v)2f+;gbethelabeloftheedge(u;v).ThepositiveneighborhoodofuisN+(u)=fug[fv:e(u;v)=+gandthenegativeneighborhoodofuisN(u)=fug[fv:e(u;v)=g.LetOPTrepresenttheoptimalclustering,andforaclusteringC,weletC(v)bethesetofverticesthatareinthesameclusterasv.ConsideraclusteringC=fC1;C2;:::;Cng.Anegativelabelededgeinsideaclusterisconsideredanegativemistakeandapositivelabelededgebetweenclustersisconsideredapositivemistake.Ifourgoalistominimizedisagreements,weminimizetheweightofpositiveedgesbetweenclustersandabsolutevaluedweightofnegativeedgesinsideclusters.Whenmaximizingagreementswewishtomaximizetheweightofpositiveedgesinsideclustersplustheabsolutevaluedweightsofnegativeedgesbetweenclusters.3PreviousWork3.1OverviewIntheoriginalpaperthatintroducedtheproblem,Bansaletal.[1]showedaconstantfac-torapproximationalgorithmforminimizingdisagreements,basedontheprincipleofcountingerroneoustriangleswhicharetriangleswithtwopositivelabelededgesandonenegativelabelededge.Inaddition,theyalsoshowedaPTASformaximizingagreementssimilartothePTASforMAXCUTondensegraphs,focusingoncompletegraphs,aswellasgraphswithedgelabels-1and1.Theworkonminimizingdisagreementwasextendedin[3]whereanO(logn)approximationalgorithmisgivenforgeneralgraphsispresentedusinglinearprogrammingandregion-growingtechniques.DemaineandImmmorlica[3]alsoprovesthattheproblemofminimizingdisagreementsisashardastheAPX-hardproblemminimummulticut.Charikaretal.[2]gaveafactor4approximationalgorithmforminimizingdisagreementsoncompletegraphs,andalsoafactorO(logn)approximationforgeneralgraphs.Inadditiontheyshowedafactor0.7664approximationforgeneralgraphsandprovedthatndingaPTASisAPX-hard.SimilarresultsonthehardnessofminimizingdisagreementswereobtainedindependentlybyEmanueletal.[4].Lastly,Swamy[6]gavea0.7666-approximationalgorithmformaximizingagreementsingeneralgraphswithnon-negativeedgeweightsusingsemideniteprogramming.Wewillshowtheconstantfactorapproximationalgorithmforminimizingdisagreementsincom-pletegraphspresentedin[1]aswellastheO(logn)approximationalgorithmof[3]forminimizingdisagreementsingeneralgraphs.Eachofthesealgorithmsusesadierentmethodforapproximat-ingandoptimizingtheclustering.Lookingatthesedierentmethodsisusefulwhenconsideringtheopenquestionsrelatedtothistopic.ASurveyofCorrelationClustering-2 eachpositiveedgecanbechosenatmosttwicesowecanaccountforatleast1=4ofthepositivemistakesbyusingedge-disjointerroneoustriangles.Therefore,havingbothpositiveandnegativemistakeswecanchooseedgedisjointerroneoustrianglestoaccountforatleast1=8ofthemistakesmadebytheclustering.Lemma3ThereexistsaclusteringOPT'suchthatallnon-singletonclustersare-cleanandthenumberofmistakesmadebyOPT'isatmost(9 2+1)timesthenumberofmistakesmadebyOPT.ProofWeapplya"clean-up"proceduretotheoptimalclusteringthatresultsintheclusteringOPT'.LetC1;C2;:::;CnbetheclustersofOPT.LetS=;.Fori=1tokdo:-Iftherearemorethan 3jCij 3-badverticesinCi,"dissolve"theclusterCibylettingC0i=;andS=S[Ci.-ElseletBibethesetof 3-badverticesinCi.ThenS=S[BiandC0i=CiBiOutputtheclusteringofOPT'=C01;C02;:::;C0n,fxgx2S.WewillshowthatthemistakesmadebyOPTandOPT'arerelated,startingbyshowingthateachC0iis-clean.ThisclaimistrivialwhenC0iisempty.Otherwise,weknowthatjCijjC0ij(1 3)jCij.ForeveryvertexvinclusterC0ithefollowingholds:jN+(v)\C0ij(1 3)jCij 3jCij(1)jC0ijAlso,forC0i=VnC0ijN+(v)\C0ij 3jCij+ 3jCij2 3jC0ij 1=3jC0ij(as1)ThisshowsthateveryC0iis-clean.Next,wedeterminethenumberofmistakes.InthecasewherewedissolveaclusterCithenthenumberofmistakesassociatedwithitisatleast 32jCij2=2.Thus,thenumberofmistakeswegetbydissolvingthisclusterisatmostjCij2=2.IftheclusterCiwasnotdissolvedthenthenumberofmistakesassociatedwithitisatleast 3jCijjBij=2.SousingtheaboveprocedureweonlyaddatmostjCijjBijmistakes.Notethatdissolvingaclusteraddsatmost2 9fractionofthemistakesmadebytheoptimalcluster,andnotdissolvingaclusteraddsatmost 62 9fractionofthemistakesmadebytheoptimalcluster.Therefore,wehaveshowntheexistenceofaclusteringOPT'inwhichallnon-singletonclustersare-cleanandthenumberofmistakesmadebyOPT'isatmost(9 2+1)timesthenumberofmistakesmadebyOPT.WenowpresentanalgorithmthattriestondclusterssimilartoOPT'.NotethattheclustersC0iinOPTarenon-singletonclusters,whiletheclustersaddedtoSaresingletonclusters.AlgorithmIASurveyofCorrelationClustering-4 inS.Thelemmafollows.WenowneedtorelatethenumberofinternalmistakesmadebyAlgorithmItothenumberofinternalmistakesmadebyOPT'andOPT.UsingLemma2,wecanseethattheclusteringoftheverticesobtainedusingAlgorithmILemma6ThetotalnumberofinternalmistakesmadebyAlgorithmIisatmost8timesthenumberofmistakesmadebyOPT.Usinglemmas5,6and3,weprovedthatouralgorithmgivesaconstantfactorapproximationtoOPT.Theorem7ThenumberofmistakesmadebyalgorithmIisatmost9(1 2+1)timesthemistakesmadebyOPT.3.3MinimizingDisagreementsinGeneralWeightedGraphsInthissection,wepresentanO(logn)approximationalgorithmforminimizingdisagreementsingeneralweightedgraphswhichwasgivenbyDemaineetal.[5].ThisalgorithmusesacombinationofLinearProgramming,RoundingandRegion-Growingtechniques.Thisalgorithmrstsolvesalinearprogramandthenusestheresultingfractionalvaluestodeterminethedistancebetweentwovertices,wherelargerdistancecorrespondstoweakersimilarity.Inthelaststepofthealgorithm,weusetheregion-growingtechniquetogroupcloseverticestogetherandroundthefractionalvalues.ConsiderthegraphG=(V;E).Letxuvbeabooleanvariablerepresentingtheedgelabelofe(u;v)2E,u;v2V.GivenaclusteringC,letxuv=0ifuandvareinthesamecluster,andxuv=1iftheyareindierentclusters.RecallthatthenumberofmistakesinclusteringCisthesumofthepositiveandnegativemistakes.WedenetheweightoftheclusteringasthesumofthesumoftheweightoferroneousedgesinC.ForaclusteringC,theweightoftheclusteringisw(C)=wp(C)+wn(C)wherewp(C)=Pfe=(u;v)2E+;u=2C(v)gandwn(C)=Pfe=(u;v)2E;u2C(v)g.Notethat(1xuv)=1iftheedge(u,v)isinsideaclusterand(1xuv)=0otherwise.Letcedenoteanedgee(u;v)2E,wecanrewritetheweightofCasw(C)=Xe2Ece(1xe)+Xe2E+cexeInordertominimizedisagreements,weneedtondanassignmenttoxuvthatminimizestheweightsuchthatxuv2f0;1gandxuvsatisesthetriangleinequality.WeformulatetheproblemasalinearprogramMinimizeXe2Ece(1xe)+Xe2E+cexeSuchthatxuv2[0;1],xuv=xvuandxuv+xvwxuw.WeshowaproceduretoroundthisLPinordertogetanO(logn)approximationusingtheregion-growingtechnique.Region-growingreferstotheprocedureofgrowinga"ball"aroundnodesinagraphbyiterativelyaddingnodesofxeddistancertotheball,untilallnodesbelongtosome"ball".Inouralgorithm,thesetofnodescontainedineachballcorrespondstothesetofnodesthatmakeupeachcluster.ASurveyofCorrelationClustering-6 SinceouralgorithmgrowseachBuntilCut(B)cln(n+1)Vol(B)wegetfrom(1)wp(ROUND)c 2ln(n+1)XB2BVol(B)(2)Bythedesignofthealgorithm,allgeneratedballsaredisjointsousing(2)andourpriorassumptionF=wp(FRAC)wegetwp(ROUND)c 2ln(n+1)0@X(u;v)2E+cuvFRAC(xuv)+XB2BF n1Ac 2ln(n+1)(wp(FRAC)+F)cln(n+1)wp(FRAC)ToobservetheO(logn)approximationweclaimthattheballsreturnedbythealgorithmhaveradiusr1=c,whichfollowsfromthelemma[5],[10]Lemma11ForanyvertexuandafamilyofballsB(u;r),theconditionCut(B(u;r))cln(n+1)Vol(B(u;r))isachievedbysomer1=c.WenowshowthatthealgorithmguaranteesanO(1)approximationtothecostofnegativeedges.Wedosobyusingtheabovelemmatoguaranteetheboundontheradiusandprovingthatthesolutionisac c2-approximationofthecostofnegativeedgesinsideclusters.LetBbethesetofballsfoundbyAlgorithmII.Weget,wn(FRAC)=X(u;v)2Ecuv(1FRAC(xuv))(3)XB2BX(u;v)2B\Ecuv(1FRAC(xuv))(4)XB2BX(u;v)2B\Ecuv(12=c)(5)(12=c)XB2BX(u;v)2B\Ecuv(12=c)(6)=c2 cwm(ROUND)(7)Equation(5)followsfrom(4),thetriangleinequalityandthefactthatr1=c.ThealgorithmguaranteesanO(1)approximationgiventhatc2intheapproximation-ratioc c2.Wethereforegetthetotalnumberofmistakesmadebythealgorithmtobew(ROUND)=wp(ROUND)+wn(ROUND)cln(n+1)wp(OPT)+c c2wn(OPT)maxcln(n+1);c c2w(OPT)ASurveyofCorrelationClustering-8 setofWebpages.Weselecteddierentsamplesof200pageseachfromdierentgenrestotesttheclusteringalgorithmon.Theaccuracyofthealgorithmcomparedtoclusteringbymanualinspectionwasaboutthan95%.Eventhoughtheresultshaveadependencyontheclassierfunctionusedtodecidewhethertwodocumentsaresimilar,webelievethattheaccuracyoftheresultspointstothefactthatusingtheregion-growingtechniquecanhelpgetbetterapproximationresultsfortheopencorrelationclusteringproblemsthatwerenotshowntohaveatightapproximationfactor.References[1]NikhilBansal,AvrimBlum,andShuchiChawla.Correlationclustering.InProceedingsofthe43rdAnnualIEEESymposiumonFoundationsofComputerScience,pages238250,Vancouver,Canada,November2002.[2]MosesCharikar,VenkatesanGuruswami,andAnthonyWirth.Clusteringwithqualitativeinformation.InProceedingsofthe44thAnnualIEEESymposiumonFoundationsofComputerScience,pages524533,2003.[3]ErikD.DemaineandNicoleImmorlica.Correlationclusteringwithpartialinformation.InPro-ceedingsofthe6thInternationalWorkshoponApproximationAlgorithmsforCombinatorialOptimizationProblems,pages113,Princeton,NJ,August2003.[4]DotanEmanuelandAmosFiat.Correlationclusteringminimizingdisagreementsonarbitraryweightedgraphs.InProceedingsofthe11thAnnualEuropeanSymposiumonAlgorithms,pages208220,2003.[5]ErikD.Demaine,NicoleImmorlica,DotanEmanuelandAmosFiat.CorrelationClusteringinGeneralWeightedGraphs.Specialissueonapproximationandonlinealgorithms2005.[6]ChaitanyaSwamy.Correlationclustering:maximizingagreementsviasemideniteprogram-ming.InProceedingsofthe15thAnnualACM-SIAMSymposiumonDiscreteAlgorithms,pages526527,NewOrleans,LA,2004.SocietyforIndustrialandAppliedMathematics.[7]MartinEster,Hans-PeterKriegel,JorgSander,andXiaoweiXu.Clusteringformininginlargespatialdatabases.KI-Journal,1,1998.SpecialIssueonDataMining.ScienTecPublishing.[8]AriFreund.TheLocalRatioandPrimal-DualMethodsinApproximationAlgorithms,Seminar(203.3485).http:==cs:haifa:ac:il=courses=approx algo=[9]DoritS.HochbaumandDavidB.Shmoys.Auniedapproachtoapproximationalgorithmsforbottleneckproblems.JournaloftheACM,33:533550,1986.[10]Approximationalgorithms.VijayV.Vazirani.Berlin;NewYork:Springer,c2001.[11]MosesCharikarandAnthonyWirth.Maximizingquadraticprograms:Extendinggrothendiecksinequality.InProceedingsofthe45thAnnualIEEESymposiumonFounda-tionsofComputerScience,pages5460,2004.[12]OptimizationSoftwareandTheory.HenryWolkowicz.http:==orion:math:uwaterloo:ca=hwolkowi=henry=software=readme:html#comboptASurveyofCorrelationClustering-10