/
COMS E Advanced Topics in Computational Learning Theory COMS E Advanced Topics in Computational Learning Theory

COMS E Advanced Topics in Computational Learning Theory - PDF document

karlyn-bohler
karlyn-bohler . @karlyn-bohler
Follow
492 views
Uploaded On 2015-03-11

COMS E Advanced Topics in Computational Learning Theory - PPT Presentation

Author Hila Becker Date May 5 2005 A Survey of Correlation Clustering Abstract The problem of partitioning a set of data points into clusters is found in many applications Correlation clustering is a clustering technique motivated by the the problem ID: 43928

Author Hila Becker Date

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "COMS E Advanced Topics in Computational ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

fusingahypothesisclassofvertexclusters.Sowheneverwehave(u;v)and(v;w)aspositiveexamples,(u;w)mustalsobeapositiveexample,thereforewemightnotbeabletorepresentfinthebestpossibleway.Mostoftheworkonthisclusteringmethodisfairlynew,asitwasonlyintroducedin2002by[1].Fortherestofthispaper,wewilldiscusssomeofthemajorresultsaswellasacoupleofopenquestionsandanintuitionfortheirsolution.2De nitionsLetG=(V;E)beagraphonanverticeswithedgeweightsce0.Lete(u;v)2f+;�gbethelabeloftheedge(u;v).ThepositiveneighborhoodofuisN+(u)=fug[fv:e(u;v)=+gandthenegativeneighborhoodofuisN�(u)=fug[fv:e(u;v)=�g.LetOPTrepresenttheoptimalclustering,andforaclusteringC,weletC(v)bethesetofverticesthatareinthesameclusterasv.ConsideraclusteringC=fC1;C2;:::;Cng.Anegativelabelededgeinsideaclusterisconsideredanegativemistakeandapositivelabelededgebetweenclustersisconsideredapositivemistake.Ifourgoalistominimizedisagreements,weminimizetheweightofpositiveedgesbetweenclustersandabsolutevaluedweightofnegativeedgesinsideclusters.Whenmaximizingagreementswewishtomaximizetheweightofpositiveedgesinsideclustersplustheabsolutevaluedweightsofnegativeedgesbetweenclusters.3PreviousWork3.1OverviewIntheoriginalpaperthatintroducedtheproblem,Bansaletal.[1]showedaconstantfac-torapproximationalgorithmforminimizingdisagreements,basedontheprincipleofcountingerroneoustriangleswhicharetriangleswithtwopositivelabelededgesandonenegativelabelededge.Inaddition,theyalsoshowedaPTASformaximizingagreementssimilartothePTASforMAXCUTondensegraphs,focusingoncompletegraphs,aswellasgraphswithedgelabels-1and1.Theworkonminimizingdisagreementwasextendedin[3]whereanO(logn)approximationalgorithmisgivenforgeneralgraphsispresentedusinglinearprogrammingandregion-growingtechniques.DemaineandImmmorlica[3]alsoprovesthattheproblemofminimizingdisagreementsisashardastheAPX-hardproblemminimummulticut.Charikaretal.[2]gaveafactor4approximationalgorithmforminimizingdisagreementsoncompletegraphs,andalsoafactorO(logn)approximationforgeneralgraphs.Inadditiontheyshowedafactor0.7664approximationforgeneralgraphsandprovedthat ndingaPTASisAPX-hard.SimilarresultsonthehardnessofminimizingdisagreementswereobtainedindependentlybyEmanueletal.[4].Lastly,Swamy[6]gavea0.7666-approximationalgorithmformaximizingagreementsingeneralgraphswithnon-negativeedgeweightsusingsemide niteprogramming.Wewillshowtheconstantfactorapproximationalgorithmforminimizingdisagreementsincom-pletegraphspresentedin[1]aswellastheO(logn)approximationalgorithmof[3]forminimizingdisagreementsingeneralgraphs.Eachofthesealgorithmsusesadi erentmethodforapproximat-ingandoptimizingtheclustering.Lookingatthesedi erentmethodsisusefulwhenconsideringtheopenquestionsrelatedtothistopic.ASurveyofCorrelationClustering-2 eachpositiveedgecanbechosenatmosttwicesowecanaccountforatleast1=4ofthepositivemistakesbyusingedge-disjointerroneoustriangles.Therefore,havingbothpositiveandnegativemistakeswecanchooseedgedisjointerroneoustrianglestoaccountforatleast1=8ofthemistakesmadebytheclustering.Lemma3ThereexistsaclusteringOPT'suchthatallnon-singletonclustersare-cleanandthenumberofmistakesmadebyOPT'isatmost(9 2+1)timesthenumberofmistakesmadebyOPT.ProofWeapplya"clean-up"proceduretotheoptimalclusteringthatresultsintheclusteringOPT'.LetC1;C2;:::;CnbetheclustersofOPT.LetS=;.Fori=1tokdo:-Iftherearemorethan 3jCij 3-badverticesinCi,"dissolve"theclusterCibylettingC0i=;andS=S[Ci.-ElseletBibethesetof 3-badverticesinCi.ThenS=S[BiandC0i=CiBiOutputtheclusteringofOPT'=C01;C02;:::;C0n,fxgx2S.WewillshowthatthemistakesmadebyOPTandOPT'arerelated,startingbyshowingthateachC0iis-clean.ThisclaimistrivialwhenC0iisempty.Otherwise,weknowthatjCijjC0ij(1� 3)jCij.ForeveryvertexvinclusterC0ithefollowingholds:jN+(v)\C0ij(1� 3)jCij� 3jCij�(1�)jC0ijAlso,forC0i=VnC0ijN+(v)\C0ij 3jCij+ 3jCij2 3jC0ij 1�=3jC0ij(as1)ThisshowsthateveryC0iis-clean.Next,wedeterminethenumberofmistakes.InthecasewherewedissolveaclusterCithenthenumberofmistakesassociatedwithitisatleast 32jCij2=2.Thus,thenumberofmistakeswegetbydissolvingthisclusterisatmostjCij2=2.IftheclusterCiwasnotdissolvedthenthenumberofmistakesassociatedwithitisatleast 3jCijjBij=2.SousingtheaboveprocedureweonlyaddatmostjCijjBijmistakes.Notethatdissolvingaclusteraddsatmost2 9fractionofthemistakesmadebytheoptimalcluster,andnotdissolvingaclusteraddsatmost 62 9fractionofthemistakesmadebytheoptimalcluster.Therefore,wehaveshowntheexistenceofaclusteringOPT'inwhichallnon-singletonclustersare-cleanandthenumberofmistakesmadebyOPT'isatmost(9 2+1)timesthenumberofmistakesmadebyOPT.Wenowpresentanalgorithmthattriesto ndclusterssimilartoOPT'.NotethattheclustersC0iinOPTarenon-singletonclusters,whiletheclustersaddedtoSaresingletonclusters.AlgorithmIASurveyofCorrelationClustering-4 inS.Thelemmafollows.WenowneedtorelatethenumberofinternalmistakesmadebyAlgorithmItothenumberofinternalmistakesmadebyOPT'andOPT.UsingLemma2,wecanseethattheclusteringoftheverticesobtainedusingAlgorithmILemma6ThetotalnumberofinternalmistakesmadebyAlgorithmIisatmost8timesthenumberofmistakesmadebyOPT.Usinglemmas5,6and3,weprovedthatouralgorithmgivesaconstantfactorapproximationtoOPT.Theorem7ThenumberofmistakesmadebyalgorithmIisatmost9(1 2+1)timesthemistakesmadebyOPT.3.3MinimizingDisagreementsinGeneralWeightedGraphsInthissection,wepresentanO(logn)approximationalgorithmforminimizingdisagreementsingeneralweightedgraphswhichwasgivenbyDemaineetal.[5].ThisalgorithmusesacombinationofLinearProgramming,RoundingandRegion-Growingtechniques.Thisalgorithm rstsolvesalinearprogramandthenusestheresultingfractionalvaluestodeterminethedistancebetweentwovertices,wherelargerdistancecorrespondstoweakersimilarity.Inthelaststepofthealgorithm,weusetheregion-growingtechniquetogroupcloseverticestogetherandroundthefractionalvalues.ConsiderthegraphG=(V;E).Letxuvbeabooleanvariablerepresentingtheedgelabelofe(u;v)2E,u;v2V.GivenaclusteringC,letxuv=0ifuandvareinthesamecluster,andxuv=1iftheyareindi erentclusters.RecallthatthenumberofmistakesinclusteringCisthesumofthepositiveandnegativemistakes.Wede netheweightoftheclusteringasthesumofthesumoftheweightoferroneousedgesinC.ForaclusteringC,theweightoftheclusteringisw(C)=wp(C)+wn(C)wherewp(C)=Pfe=(u;v)2E+;u=2C(v)gandwn(C)=Pfe=(u;v)2E�;u2C(v)g.Notethat(1�xuv)=1iftheedge(u,v)isinsideaclusterand(1�xuv)=0otherwise.Letcedenoteanedgee(u;v)2E,wecanrewritetheweightofCasw(C)=Xe2E�ce(1�xe)+Xe2E+cexeInordertominimizedisagreements,weneedto ndanassignmenttoxuvthatminimizestheweightsuchthatxuv2f0;1gandxuvsatis esthetriangleinequality.WeformulatetheproblemasalinearprogramMinimizeXe2E�ce(1�xe)+Xe2E+cexeSuchthatxuv2[0;1],xuv=xvuandxuv+xvwxuw.WeshowaproceduretoroundthisLPinordertogetanO(logn)approximationusingtheregion-growingtechnique.Region-growingreferstotheprocedureofgrowinga"ball"aroundnodesinagraphbyiterativelyaddingnodesof xeddistancertotheball,untilallnodesbelongtosome"ball".Inouralgorithm,thesetofnodescontainedineachballcorrespondstothesetofnodesthatmakeupeachcluster.ASurveyofCorrelationClustering-6 SinceouralgorithmgrowseachBuntilCut(B)cln(n+1)Vol(B)wegetfrom(1)wp(ROUND)c 2ln(n+1)XB2BVol(B)(2)Bythedesignofthealgorithm,allgeneratedballsaredisjointsousing(2)andourpriorassumptionF=wp(FRAC)wegetwp(ROUND)c 2ln(n+1)0@X(u;v)2E+cuvFRAC(xuv)+XB2BF n1Ac 2ln(n+1)(wp(FRAC)+F)cln(n+1)wp(FRAC)ToobservetheO(logn)approximationweclaimthattheballsreturnedbythealgorithmhaveradiusr1=c,whichfollowsfromthelemma[5],[10]Lemma11ForanyvertexuandafamilyofballsB(u;r),theconditionCut(B(u;r))cln(n+1)Vol(B(u;r))isachievedbysomer1=c.WenowshowthatthealgorithmguaranteesanO(1)approximationtothecostofnegativeedges.Wedosobyusingtheabovelemmatoguaranteetheboundontheradiusandprovingthatthesolutionisac c�2-approximationofthecostofnegativeedgesinsideclusters.LetBbethesetofballsfoundbyAlgorithmII.Weget,wn(FRAC)=X(u;v)2E�cuv(1�FRAC(xuv))(3)XB2BX(u;v)2B\E�cuv(1�FRAC(xuv))(4)XB2BX(u;v)2B\E�cuv(1�2=c)(5)(1�2=c)XB2BX(u;v)2B\E�cuv(1�2=c)(6)=c�2 cwm(ROUND)(7)Equation(5)followsfrom(4),thetriangleinequalityandthefactthatr1=c.ThealgorithmguaranteesanO(1)approximationgiventhatc�2intheapproximation-ratioc c�2.Wethereforegetthetotalnumberofmistakesmadebythealgorithmtobew(ROUND)=wp(ROUND)+wn(ROUND)cln(n+1)wp(OPT)+c c�2wn(OPT)maxcln(n+1);c c�2w(OPT)ASurveyofCorrelationClustering-8 setofWebpages.Weselecteddi erentsamplesof200pageseachfromdi erentgenrestotesttheclusteringalgorithmon.Theaccuracyofthealgorithmcomparedtoclusteringbymanualinspectionwasaboutthan95%.Eventhoughtheresultshaveadependencyontheclassi erfunctionusedtodecidewhethertwodocumentsaresimilar,webelievethattheaccuracyoftheresultspointstothefactthatusingtheregion-growingtechniquecanhelpgetbetterapproximationresultsfortheopencorrelationclusteringproblemsthatwerenotshowntohaveatightapproximationfactor.References[1]NikhilBansal,AvrimBlum,andShuchiChawla.Correlationclustering.InProceedingsofthe43rdAnnualIEEESymposiumonFoundationsofComputerScience,pages238250,Vancouver,Canada,November2002.[2]MosesCharikar,VenkatesanGuruswami,andAnthonyWirth.Clusteringwithqualitativeinformation.InProceedingsofthe44thAnnualIEEESymposiumonFoundationsofComputerScience,pages524533,2003.[3]ErikD.DemaineandNicoleImmorlica.Correlationclusteringwithpartialinformation.InPro-ceedingsofthe6thInternationalWorkshoponApproximationAlgorithmsforCombinatorialOptimizationProblems,pages113,Princeton,NJ,August2003.[4]DotanEmanuelandAmosFiat.Correlationclusteringminimizingdisagreementsonarbitraryweightedgraphs.InProceedingsofthe11thAnnualEuropeanSymposiumonAlgorithms,pages208220,2003.[5]ErikD.Demaine,NicoleImmorlica,DotanEmanuelandAmosFiat.CorrelationClusteringinGeneralWeightedGraphs.Specialissueonapproximationandonlinealgorithms2005.[6]ChaitanyaSwamy.Correlationclustering:maximizingagreementsviasemide niteprogram-ming.InProceedingsofthe15thAnnualACM-SIAMSymposiumonDiscreteAlgorithms,pages526527,NewOrleans,LA,2004.SocietyforIndustrialandAppliedMathematics.[7]MartinEster,Hans-PeterKriegel,JorgSander,andXiaoweiXu.Clusteringformininginlargespatialdatabases.KI-Journal,1,1998.SpecialIssueonDataMining.ScienTecPublishing.[8]AriFreund.TheLocalRatioandPrimal-DualMethodsinApproximationAlgorithms,Seminar(203.3485).http:==cs:haifa:ac:il=courses=approx algo=[9]DoritS.HochbaumandDavidB.Shmoys.Auni edapproachtoapproximationalgorithmsforbottleneckproblems.JournaloftheACM,33:533550,1986.[10]Approximationalgorithms.VijayV.Vazirani.Berlin;NewYork:Springer,c2001.[11]MosesCharikarandAnthonyWirth.Maximizingquadraticprograms:Extendinggrothendiecksinequality.InProceedingsofthe45thAnnualIEEESymposiumonFounda-tionsofComputerScience,pages5460,2004.[12]OptimizationSoftwareandTheory.HenryWolkowicz.http:==orion:math:uwaterloo:ca=hwolkowi=henry=software=readme:html#comboptASurveyofCorrelationClustering-10