/
Dening and Evaluating Network Communities based on Ground truth Jaewon Yang Stanford University Dening and Evaluating Network Communities based on Ground truth Jaewon Yang Stanford University

Dening and Evaluating Network Communities based on Ground truth Jaewon Yang Stanford University - PDF document

olivia-moreira
olivia-moreira . @olivia-moreira
Follow
507 views
Uploaded On 2014-12-26

Dening and Evaluating Network Communities based on Ground truth Jaewon Yang Stanford University - PPT Presentation

edu Jure Leskovec Stanford University jurecsstanfordedu Abstract Nodes in realworld networks organize into densely linked communities where edges appear with high con centration among the members of the community Identifying such communities of nodes ID: 29623

edu Jure Leskovec Stanford University

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Dening and Evaluating Network Communitie..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

DeningandEvaluatingNetworkCommunitiesbasedonGround-truthJaewonYangStanfordUniversitycrucis@stanford.eduJureLeskovecStanfordUniversityjure@cs.stanford.eduAbstract—Nodesinreal-worldnetworksorganizeintodenselylinkedcommunitieswhereedgesappearwithhighcon-centrationamongthemembersofthecommunity.Identifyingsuchcommunitiesofnodeshasproventobeachallengingtaskmainlyduetoaplethoraofdenitionsofacommunity,intractabilityofalgorithms,issueswithevaluationandthelackofareliablegold-standardground-truth.Inthispaperwestudyasetof230largereal-worldsocial,collaborationandinformationnetworkswherenodesexplicitlystatetheirgroupmemberships.Forexample,insocialnetworksnodesexplicitlyjoinvariousinterestbasedsocialgroups.Weusesuchgroupstodeneareliableandrobustnotionofground-truthcommunities.Wethenproposeamethodologywhichallowsustocompareandquantitativelyevaluatehowdifferentstructuraldenitionsofnetworkcommunitiescorrespondtoground-truthcommuni-ties.Wechoose13commonlyusedstructuraldenitionsofnet-workcommunitiesandexaminetheirsensitivity,robustnessandperformanceinidentifyingtheground-truth.Weshowthatthe13structuraldenitionsareheavilycorrelatedandnaturallygroupintofourclasses.Wendthattwoofthesedenitions,ConductanceandTriad-participation-ratio,consistentlygivethebestperformanceinidentifyingground-truthcommunities.Wealsoinvestigateataskofdetectingcommunitiesgivenasingleseednode.Weextendthelocalspectralclusteringalgorithmintoaheuristicparameter-freecommunitydetectionmethodthateasilyscalestonetworkswithmorethanhundredmillionnodes.Theproposedmethodachieves30%relativeimprovementovercurrentlocalclusteringmethods.I.INTRODUCTIONNetworksareanaturalwaytorepresentsocial[20],bio-logical[23],technological[16],andinformation[8]systems.Nodesinsuchnetworksorganizeintodenselylinkedgroupsthatarecommonlyreferredtoasnetworkcommunities,clus-tersormodules[11].Therearemanyreasonswhynodesinnetworksorganizeintodenselylinkedclusters.Forexample,societyisorganizedintosocialgroups,families,villagesandassociations[7],[12].OntheWorldWideWeb,topicallyrelatedpageslinkmoredenselyamongthemselves[8].And,inmetabolicnetworks,denselylinkedclustersofnodesarerelatedtofunctionalunits,suchaspathwaysandcycles[23].Toextractcommunitiesfromagivenundirectednetwork,onetypicallychoosesascoringfunction(e.g.,modularity)thatquantiestheintuitionthatcommunitiescorrespondtodenselylinkedsetsofnodes.ThenoneappliesaproceduretoThispaperhasbeenpublishedintheProceedingsof2012IEEEInter-nationalConferenceonDataMining(ICDM),2012.ndsetsofnodeswithahighvalueofthescoringfunction.Identifyingsuchcommunitiesinnetworks[14],[6],[26],[9]hasproventobeachallengingtask[10],[18],[17]duetothreereasons:Thereexistmultiplestructuraldenitionsofnetworkcommunities[5],[24];Evenifwewouldagreeonasinglecommonstructuraldenition(i.e.,asinglescoringfunction),theformalizationsofcommunitydetectionleadtoNP-hardproblems[26];And,thelackofreliableground-truthmakesevaluationextremelydifcult.Currentlytheperformanceofcommunitydetectionmeth-odsisevaluatedbymanualinspection.Foreachdetectedcommunityaneffortismadetointerpretitasa“real”communitybyidentifyingacommonpropertyorexternalattributesharedbyallthemembersofthecommunity.Forexample,whenexaminingcommunitiesinascienticcol-laborationnetwork,wemightbymanualinspectiondiscoverthatmanyofdetectedcommunitiescorrespondtogroupsofscientistsworkingincommonareasofscience[22].Suchanecdotalevaluationproceduresrequireextensivemanualeffort,arenon-comprehensiveandlimitedtosmallnetworks.Apossiblesolutionwouldbetondareliabledenitionofexplicitlylabeledgold-standardground-truthcommuni-ties.Usingsuchground-truthcommunitieswouldallowforquantitativeandlarge-scaleevaluationandcomparisonofnetworkcommunitydetectionmethods.Suchabilitywouldenabletheeldtomovebeyondthecurrentstandardofanecdotalevaluationofcommunitiestoacomprehensiveevaluationofcommunitydetectionmethodsbasedontheirperformancetoextracttheground-truth.Thecontributionsofourworkarethreefold.First,wedescribeasetof230largesocialandinformationnetworkswherewedeneground-truthcommunitiesinareliableway.Second,basedontheground-truthwequantitativelyevaluate13commonlyusedstructuraldenitionsofnetworkcommunitiesandexaminetheirrobustnessandsensitivitytonoise.Third,weextendthelocalspectralclusteringalgorithmintoaparameter-freecommunitydetectionmethodthatscalestonetworksofhundredsofmillionsofnodes.Presentwork:Ground-truthcommunities.Nextwede-scribetheproposeddenitionofground-truthcommunitiesandarguewhyitcorrespondsto“real”communities.Generally,aftercommunitiesareidentiedbasedonthestructureofgivennetwork,theessentialnextstepisto interpretthembyidentifyingacommonexternalpropertyorfunctionthatthemembersofagivencommunityshareandaroundwhichthecommunityorganizes[7].Forexample,givenaprotein-proteininteractionnetworkofacellweidentifycommunitiesbasedonthestructureofthenetworkandthenndthatthesecommunitiescorrespondtorealfunctionalunitsofacell.Thus,thegoalofcommunitydetectionistoidentifysetsofnodeswithacommon(oftenexternal/latent)functionbasedonlytheconnectiv-itystructureofthenetwork.Acommonfunctioncanbecommonrole,afliation,orattribute[12].Inourproteininteractionnetworkexampleabove,suchcommonfunctionofnodeswouldbe`belongingtothesamefunctionalunit'.Communitydetectionmethodsidentifycommunitiesbasedonstructurewhiletheextractedcommunitiesareevaluatedbasedontheirfunction.Sowedistinguishbetweenstructuralandfunctionaldenitionsofcommunities.Weusecommonfunctionofnodestodeneground-truthcommunities.Presentwork:Networkswithground-truth.Wegathered230networksfromanumberofdifferentdomainsandresearchareaswherenodesexplicitlystatetheirground-truthcommunitymemberships.Ourcollectionconsistsofsocial,collaborationandinformationnetworksforeachofwhichwendarobustfunctionaldenitionofground-truth.Forexample,inonlinesocialnetworks(like,Orkut,LiveJournal,Friendsterand225differentNingnetworks)weconsiderexplicitlydenedinterestbasedgroups(e.g.,fansofLadyGaga,studentsofthesameschool)asground-truthcommunities.Nodesexplicitlyjoinsuchgroupsthatorganizearoundspecictopics,interests,andafliations[7],[12].Next,wealsoconsidertheAmazonproductco-purchasingnetworkwherewedeneground-truthusinghierarchicallynestedproductcategories.Hereallmembers(i.e.,products)ofthesameground-truthcommunityshareacommonfunc-tionorpurpose.Last,inthescienticcollaborationnetworkofDBLPweusepublicationvenuesasproxiesforground-truthresearchcommunities.Ourreasoninghereisthatinscienticcollaborationnetworks,realcommunitieswouldcorrespondtoareasofscience.Thus,weusejournalsandconferencesasproxiesforscienticcommunities.Presentwork:Methodologyandndings.Theground-truthallowsustoexaminehowwellvariousstructuraldeni-tionsofnetworkcommunitiescorrespondtorealfunctionalgroups(i.e.,ground-truthcommunities).Agoodstructuraldenitionofacommunitywouldbesuchthatitwoulddetectconnectivitypatternsthatcorrespondtorealgroups(i.e.,theground-truth).Thismeansthatwecanevaluatediffer-entstructuraldenitionsbasedontheirabilitytoidentifyconnectivitystructureofground-truthcommunities.Westudy13commonlyusedstructuraldenitionsofcom-munitiesandexaminetheirquality,sensitivityandrobust-ness.Eachsuchdenitioncorrespondstoascoringfunctionthatscoresasetofnodesbasedontheirconnectivity.Ahighscoremeansthatasetofnodescloselyresemblestheconnectivitycommunities.Bycomparingcorrelationsofscoresthatdifferentstructuraldenitionsassigntoground-truthcommunities,wendthatthe13denitionsnaturallygroupintofourdistinctclassesTheseclassescorrespondtodenitionsthatconsider:(1)onlyinternalcommunityconnectivity,(2)onlyexternalconnectivityofthenodestotherestofthenetwork;(3)bothinternalandexternalcommunityconnectivity,and(4)networkmodularity.Wethenconsideranaxiomaticapproachanddenefourintuitivepropertiesthatcommunitieswouldideallyhave.Intuitively,a“good”communityiscohesive,compact,andinternallywellconnectedwhilebeingalsowellseparatedfromtherestofthenetwork.Thisallowsustocharacterizewhichconnectivitypatternsagivenstructuraldenitiondetectsandwhichonesitmisses.Wealsoinvestigatetherobustnessofcommunityscoringfunctionsbasedonfourtypesofrandomizedperturbationstrategies.Overall,evalu-ationshowsthatscoringfunctionsthatarebasedontriadicclosure[29]andtheconductancescore[27]bestcapturethestructureofground-truthcommunities.Last,wealsoinvestigateataskofdetectingcommunitiesfromasingleseednode.Thetaskistodiscoverallmembersofacommunityfromasingleseedmembernode.Weextendthelocalspectralclusteringalgorithm[2]intoaparameter-freecommunitydetectionmethodthatscalestonetworksofhundredsofmillionsofnodes.Ourmethodrecoversground-truthcommunitieswith30%relativeimprovementintheF1-scoreoverthecurrentlocalgraphpartitioningmethods.Tothebestofourknowledgeourworkisthersttousesocialandinformationnetworkswithexplicitcommu-nitymembershipstodeneanevaluationmethodologyforcomparingnetworkcommunitydetectionmethodsbasedontheiraccuracyonrealdata.Webelievethatthepresentworkwillbringmorerigortothestandardfortheevaluationofcommunitydetectionmethods.Allourdatasetscanbedownloadedathttp://snap.stanford.edu.II.COMMUNITYSCORINGFUNCTIONSANDDATASETSWestartbydescribingthenetworkdatasetsandourproposedfunctionaldenitionsofground-truthcommuni-ties.Thenwecontinuewithoutlining13commonlyusedstructuraldenitionsofnetworkcommunities.Networkswithground-truthcommunities.Overallweconsider230largesocial,collaborationandinformationnetworks,whereforeachnetworkwehaveagraphandasetoffunctionallydenedground-truthcommunities.Membersoftheseground-truthcommunitiesshareacommonfunction,propertyorpurpose.Networksthatwestudycomefromawiderangeofdomainsandsizes.TableIgivesthenetworks.First,weconsideronlinesocialnetworks(theLiveJournalbloggingcommunity[4],theFriendsteronlinenetwork[20],andtheOrkutsocialnetwork[20])whereuserscreate Dataset N E C S A LiveJournal 4.0M 34.9M 311,782 40.06 3.09 Friendster 117.7M 2,586.1M 1,449,666 26.72 0.32 Orkut 3.0M 117.2M 8,455,253 34.86 95.9 Ning(225nets) 7.0M 35.5M 137,177 46.89 0.92 Amazon 0.33M 0.92M 49,732 99.86 14.83 DBLP 0.42M 1.34M 2,547 429.79 2.56TableI230SOCIAL,COLLABORATIONANDINFORMATIONNETWORKSWITHEXPLICITGROUND-TRUTHCOMMUNITIES.N:NUMBEROFNODES,E:NUMBEROFEDGES,C:NUMBEROFCOMMUNITIES,S:AVERAGECOMMUNITYSIZE,A:COMMUNITYMEMBERSHIPSPERNODE.NINGSTATISTICSAREAGGREGATEDOVER225DIFFERENTSUBNETWORKS.explicitfunctionalgroupstowhichothersthenjoinandsharecontent.Thesegroupsarecreatedbasedonspecictopics,interests,hobbiesandgeographicalregions.Forex-ample,LiveJournalcategorizesgroupsintothefollowingtypes:culture,entertainment,expression,fandom,gaming,life/style,life/support,sports,studentlifeandtechnology.Similarly,inothersocialnetworksconsideredinthisstudyusersdenetopicalcommunitiesthatothersthenjoin.Weconsidereachsuchexplicitinterest-basedgroupasaground-truthcommunity.Similarly,wehaveasetof225differentonlinesocialnetworks[13]thatareallhostedbytheNingplatform.ItisimportanttonotethateachNingnetworkisaseparatesocialnetwork—anindependentwebsitewithaseparateusercommunity.Forexample,theNBAteamDallasMavericksanddiabetespatientsnetworkTuDiabetesalluseNingtohosttheirseparateonlinesocialnetworks.Afterjoiningaspecicnetwork,usersthencreateandjoingroups.Forexample,inTuDiabetes,Ningnetworkgroupsformaroundspecictypesofdiabetes,differentagegroups,andsimilar.Notethattheseareexactlythepropertiesaroundwhichweexpectcommunitiestoforminanetworkofdiabetespatients.Again,weusesuchexplicitlydenedfunctionalgroupsasground-truthcommunities.ThesecondtypeofnetworkweconsideristheAma-zonproductco-purchasingnetwork[16].Thenodesofthenetworkrepresentproductsandedgeslinkcommonlyco-purchasedproducts.Eachproduct(i.e.,node)belongstooneormorehierarchicallyorganizedproductcategoriesandproductsfromthesamecategorydeneagroupwhichweviewasaground-truthcommunity.Notethatherethedenitionofground-truthissomewhatdifferent.Inthiscase,nodesthatbelongtoacommonground-truthcommunityshareacommonfunctionorpurpose.Finally,wealsoconsidertheDBLPscienticcollabora-tionnetwork[4]wherenodesrepresentauthorsandedgesconnectauthorsthathaveco-authoredapaper.Todeneground-truthinthissettingwereasonasfollows.Commu-nitiesinascienticdomaincorrespondtopeopleworkingincommonareasandsubareasofscience.However,notethatpublicationvenuesserveasgoodproxiesforscienticareas:Peoplepublishinginthesameconferenceformascienticcommunity.Thusweusepublicationvenues(i.e.,conferences)asground-truthcommunitieswhichserveasproxiesforhighlyoverlappingscienticcommunitiesaroundwhichthecollaborationnetworkthenorganizes.Allournetworksandthecorrespondingground-truthsarecompleteandpubliclyavailableathttp://snap.stanford.edu/data.Theresultswepresenthereareconsistentandrobustacrossawiderangeofnetworksandacrossanevenwiderrangeofgroups.Thisgivesfurtherevidencethatourapproachisgeneralandwell-founded.Ourworkisconsistentwiththepremisethatisimplicitinallcommunitydetectionworks:membersofstructuralcommunitiessharesomefunctionalroleorpropertythatservesasanorganizingprincipleofthenetwork.Hereweusefunctionallydenedgroupsaslabeledground-truthcommunities.NotethatourworkisfundamentallydifferentfromAhnetal.[1],whoevaluatedcommunitieswithattributebasednode-nodesimilarityofthemembers.Thisapproach,forex-ample,foldsallsocialdimensions(family,school,interests)aroundwhichseparatecommunitiesformintoonesimilaritymetric[19].Incontrast,wedonotusenodesimilaritytodenecommunities.Rather,weharnessexplicitlylabeledfunctionalgroupsaslabelsofground-truthcommunities.Datapreprocessing.Torepresentallnetworksinacon-sistentwayweconsidereachnetworkasanunweightedundirectedstaticgraph.Becausemembersofthegroupmaybedisconnectedinthenetwork,weconsidereachconnectedcomponentofthegroupasaseparateground-truthcommunity.However,weallowground-truthcommunitiestobenestedandtooverlap.Communityscoringfunctions.Wenowproceedtodiscussvariousscoringfunctionsthatcharacterizehowcommunity-likeistheconnectivitystructureofagivensetofnodes.Theideaisthatgivenacommunityscoringfunction,onecanthenndsetsofnodeswithhighscoreandconsiderthesesetsascommunities.Allscoringfunctionsbuildontheintuitionthatcommunitiesaresetsofnodeswithmanyconnectionsbetweenthemembersandfewconnectionsfromthememberstotherestofthenetwork.Therearemanypossiblewaystomathematicallyformalizethisintuition.Wegather13commonlyusedscoringfunctions,orequivalently,13structuraldenitionsofnetworkcommunities.Somescoringfunctionsarewellknownintheliterature,whileothersareproposedhereforthersttime.GivenasetofnodesS,weconsiderafunctionf(S)thatcharacterizeshowcommunity-likeistheconnectivityofnodesinS.LetG(V;E)beanundirectedgraphwithn=Vnodesandm=Eedges.LetSbethesetofnodes,wherenSisthenumberofnodesinS,nS=S;mSthenumberofedgesinS,mS=jf(u;v)2Eu2S;v2Sgj;andcS,thenumberofedgesontheboundaryofS,cS=jf(u;v)2Eu2S;v62Sgj;andd(u)isthedegreeofnodeu.Weconsider13scoringfunctionsf(S)thatcapturethenotionofqualityofanetworkcommunityS.Theexperimentswewillpresentlaterrevealthatscoring Conductance Flake-ODF Normalized Cut Max ODF Avg ODF Avg Deg FOMD Edges Inside TPR Internal Density Modularity Expansion Cut Ratio Figure1.Clustersbasedoncorrelationsofcommunityscoringfunctions.functionsnaturallygroupintothefollowingfourclasses:(A)Scoringfunctionsbasedoninternalconnectivity:Internaldensity:f(S)=mS nS(nS1)2istheinternaledgedensityofthenodesetS[24].Edgesinside:f(S)=mSisthenumberofedgesbetweenthemembersofS[24].Averagedegree:f(S)=2mS nSistheaverageinternaldegreeofthemembersofS[24].Fractionovermediandegree(FOMD):f(S)=jfu:u2S;jf(u;v):v2Sgj�dmgj nSisthefractionofnodesofSthathaveinternaldegreehigherthandm,wheredmisthemedianvalueofd(u)inV.TriangleParticipationRatio(TPR):f(S)=jfu:u2S;(v;w):v;w2S;(u;v)2E;(u;w)2E;(v;w)2E=?gj nSisthefractionofnodesinSthatbelongtoatriad.(B)Scoringfunctionsbasedonexternalconnectivity:Expansionmeasuresthenumberofedgespernodethatpointoutsidethecluster:f(S)=cS nS[24].CutRatioisthefractionofexistingedges(outofallpossibleedges)leavingthecluster:f(S)=cS nS(nnS)[9].(C)Scoringfunctionsthatcombineinternalandexter-nalconnectivity:Conductance:f(S)=cS 2mS+cSmeasuresthefractionoftotaledgevolumethatpointsoutsidethecluster[27].NormalizedCut:f(S)=cS 2mS+cS+cS 2(mmS)+cS[27].Maximum-ODF(OutDegreeFraction):f(S)=maxu2Sjf(u;v)2E:v62Sgj d(u)isthemaximumfrac-tionofedgesofanodeinSthatpointoutsideS[8].Average-ODF:f(S)=1 nSPu2Sjf(u;v)2E:v62Sgj d(u))istheaveragefractionofedgesofnodesinSthatpointoutofS[8].Flake-ODF:f(S)=jfu:u2S;jf(u;v)2E:v2Sgjd(u)2gj nSisthefractionofnodesinSthathavefeweredgespointinginsidethantotheoutsideofthecluster[8].(D)Scoringfunctionbasedonanetworkmodel:Modularity:f(S)=1 4(mSE(mS))isthedifferencebetweenmS,thenumberofedgesbetweennodesinSandE(mS),theexpectednumberofsuchedgesinarandomgraphwithidenticaldegreesequence[21].Experimentalresult:Fourclassesofscoringfunctions.Nextweexaminerelationshipthe13communityscoringfunctionsweintroduced.Foreachofthe10millionground-truthcommunitiesinournetworks,wecomputeascoreusingeachofthe13scoringfunctions.Wethencreateacorrelationmatrixofscoringfunctionsandthresholdit.Fig.1showsconnectionsbetweenscoringfunctionswithcorrelation06(ontheLiveJournalnetwork).Weob-servethatscoresnaturallygroupintofourclusters.Thismeansthatscoringfunctionsofthesameclusterreturnheavilycorrelatedvaluesandquantifythesameaspectofconnectivitystructure.Overall,noneofthescoringfunc-tionsarenegativelycorrelated,whichmeansthatnoneofthemsystematicallydisagree.Interestingly,Modularityisnotcorrelatedwithanyotherscoringfunction(Avg.degreeisthemostcorrelatedat0.05correlation).Weobservesimilarresultsinotheralldatasets.Theexperimentdemonstratesthateventhoughmanydifferentstructuraldenitionsofcommunitieshavebeenproposed,thesedenitionsareheavilycorrelated.Essentiallythereareonly4differentstructuralnotionsofnetworkcommunitiesasrevealedbyFig.1.Forbrevityintherestofthepaperwepresentresultsfor6representativescoringfunctions(denotedasbluenodesinFig.1):4fromthetwolargeclustersand2fromthetwosmallclusters).Wealsonotethatherewecomputedthevaluesofthe13scoresonground-truthcommunities.Inrealitytheaimofcommunitydetectionistondsetsofnodesthatmaxi-mizeagivenscoringfunction.ExactmaximizationofthesefunctionsistypicallyNP-hardandleadstoitsownsetofinterestingproblems.(Referto[17]fordiscussion.)III.EVALUATIONOFCOMMUNITYSCORINGFUNCTIONSThesecondmainpurposeofthepaperistodevelopanevaluationmethodologyfornetworkcommunitydetection.Basedonground-truthcommunitieswenowaimtocompareandevaluatedifferentcommunityscoringfunctions.Communitygoodnessmetrics.Ourgoalistorankdifferentstructuraldenitionsofanetworkcommunity(i.e.,commu-nityscoringfunctions)bytheirabilitytodetectground-truthcommunities.Weadoptthefollowingaxiomaticapproach.First,wedenefourcommunity“goodness”metricsthatformalizetheintuitionthat“good”communitiesarebothcompactandwellconnectedinternallywhilebeingrelativelywell-separatedfromtherestofthenetwork.ThedifferencebetweencommunityscoringfunctionsfromSectionIIandthegoodnessmetricsdenedaboveisthatacommunityscoringfunctionquantieshowcommunity-likeasetis,whileagoodnessmetricinanaxiomaticwayquantiesadesirablepropertyofacommu-nity.Asetwithhighgoodnessmetricdoesnotnecessarilycorrespondtoacommunity,butasetwithhighcommunityscoreshouldhaveahighvalueononeormoregoodnessmetrics.Inotherwords,thegoodnessmetricsshedlighton various(inmanycasesmutuallyexclusive)aspectsofthenetworkcommunitystructure.UsingthenotationfromSectionII,wedenefourgood-nessmetricsg(S)foranodesetS:Separabilitycapturestheintuitionthatgoodcommuni-tiesarewell-separatedfromtherestofthenetwork[27],[9],meaningthattheyhaverelativelyfewedgespoint-ingfromsetStotherestofthenetwork.SeparabilitymeasurestheratiobetweentheinternalandtheexternalnumberofedgesofS:g(S)=mS cS.Densitybuildsonintuitionthatgoodcommunitiesarewellconnected[9].Itmeasuresthefractionoftheedges(outofallpossibleedges)thatappearbetweenthenodesinS,g(S)=mS nS(nS1)2.Cohesivenesscharacterizestheinternalstructureofthecommunity.Intuitively,agoodcommunityshouldbeinternallywellandevenlyconnected,i.e.,itshouldberelativelyhardtosplitacommunityintotwosubcommunities.Wecharacterizethisbytheconductanceoftheinternalcut.Formally,g(S)=minS0S(S0)where(S0)istheconductanceofS0measuredintheinducedsubgraphbyS.Intuitively,conductancemeasurestheratiooftheedgesinS0thatpointoutsidethesetandtheedgesinsidethesetS0.Agoodcom-munityshouldhavehighcohesiveness(highinternalconductance)asitshouldrequiredeletingmanyedgesbeforethecommunitywouldbeinternallysplitintodisconnectedcomponents[17].Clusteringcoefcientisbasedonthepremisethatnetworkcommunitiesaremanifestationsoflocallyin-homogeneousdistributionsofedges,becausepairsofnodeswithcommonneighborsaremorelikelytobeconnectedwitheachother[29].Experimentalsetup.Weareinterestedinquantifyinghow“good”arethecommunitieschosenbyaparticularscoringfunctionf(S)byevaluatingtheirgoodnessmetric.Weformulateourexperimentsasfollows:Foreachof230networks,wehaveasetofground-truthcommunitiesSi.Foreachcommunityscoringfunctionf(S),weranktheground-truthcommunitiesbythedecreasingscoref(Si).Wemeasurethecumulativerunningaveragevalueofthegoodnessmetricg(S)ofthetop-kground-truthcommunities(undertheorderinginducedbyf(Si)).Theintuitionfortheexperimentsisthefollowing.Aperfectcommunityscoringfunctionwouldrankthecom-munitiesinthedecreasingorderofthegoodnessmetricandthusthecumulativerunningaverageofthegoodnessmetricwoulddecreasemonotonicallywithk.Whileifahypotheticalcommunityscoringfunctionwouldrandomlyrankthecommunities,thenthecumulativerunningaveragewouldbeaconstantfunctionofk.Experimentalresults.Wefoundqualitativelysimilarresultsonallourdatasets.Hereweonlypresentresultsforthe 10-1 100 101 102 103 100 101 102 103 104 105 106 SeparabilityRank, kC T M F D CR U (a)Separability 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 101 102 103 104 105 106 DensityRank, kC T M F D CR U (b)Density 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 101 102 103 104 105 106 CohesivenessRank, kC T M F D CR U (c)Cohesiveness 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 100 101 102 103 104 105 106 ClusteringRank, kC T M F D CR U (d)ClusteringcoefcientFigure2.CumulativeaverageofgoodnessmetricsforLiveJournalcommunitiesrankedbyeachofthesixrepresentativescoringfunctions.LiveJournalnetwork.Resultsarerepresentativeforallothernetworks.Wepointthereadertotheextendedversionofthepaper[31]foracompletesetofresults.Figure2(a)showstheresultsbyplottingthecumulativerunningaverageofseparabilityforLiveJournalground-truthcommunitiesrankedbyeachofthesixcommunityscoringfunctions.Curve“U”presentstheupperbound,i.e.,itplotsthecumulativerunningaverageofseparabilitywhenground-truthcommunitiesareorderedbydecreasingseparability.WeobservethatConductance(C)andCutRatio(CR)givenearoptimalperformance,i.e.,theynearlyperfectlyordertheground-truthcommunitiesbyseparability.Ontheotherhand,weobservethatTriadParticipationRatio(T)andModularity(M)scoreground-truthcommunitiesintheinverseorderofseparability(especiallyfork100),whichmeansthattheybothpreferdenselylinkedsetsofnodes.Similarly,Figures2(b),(c),and(d)showthecumulativerunningaverageofcommunitydensity,cohesivenessandclusteringcoefcient.Weobservethatallscoringfunctions(exceptModularity)rankdenser,morecohesiveandmoreclusteredground-truthcommunitieshigher.Forthedensitymetric,theFractionovermediandegree(D)scoreperformsbestforhighvaluesofkfollowedbyConductance(C)andFlake-ODF(F).Intermsofcohesivenessandclusteringcoefcient,theTriadParticipationRatio(T)scoregivesbyfarthebestresults.InallcasestheonlyexceptionistheModularitywhichranksthecommunitiesinnearlyreverseorderofthegoodnessmetric(thecumulativerunningaverageincreasesasafunctionofk).Wenotethattheseareallwell-knownissuesofModularity[10]buttheygetfurtherattenuatedwhentestedonground-truthcommunities.ThecurvesinFigure2illustratetheabilityofthescoringfunctionstorankcommunities.Toquantifythisweperformthefollowingexperiment.Foragivengoodnessmetricgand Scoringfunction Separability Density Cohesiveness Clustering Conductance(C) 1.0 3.5 3.4 3.1 Flake-ODF(F) 3.9 3.6 3.5 4.3 FOMD(D) 4.9 3.0 2.9 2.9 TPR(T) 4.5 2.3 2.1 1.2 Modularity(M) 4.0 5.5 5.7 3.9 CutRatio(CR) 2.6 3.1 3.2 5.5 TableIIAVERAGESCORINGFUNCTIONRANKFOREACHGOODNESSMETRIC.foreachscoringfunctionf,wemeasuretherankofeachscoringfunctionincomparisontootherscoringfunctionsateveryvalueofk.Forexample,inFigure2(a),therankatk=100ofConductanceis1,Cutratio2,Flake-ODF3,FOMD4,Modularity5,andTPR6.Foreveryk,werankthescoresandcomputetheaveragerankoverallvaluesofk,whichquantiestheabilityofthescoringfunctiontoidentifycommunitieswithhighgoodnessmetric.TableIIshowstheaveragerankforeachscoreandeachgoodnessmetric.Anaveragerankof1meansthatapartic-ularscorealwaysoutperformsotherscores,whilerankof6meansthatthescoregivesworstrankingoutofall6scores.WeobservethatConductance(C)performsbestintermsofSeparabilitybutrelativelybadintheotherthreemetrics.ForDensity,CohesivenessandClusteringcoefcient,TriadParticipationRatio(T)isthebest.Perhapsnotsurprisingly,TriadParticipationRatioscoresbadlyonSeparabilityofground-truthcommunities.Thus,Conductanceisabletoidentifywell-separatedcommunities,butperformspoorlyinidentifyingdenseandcohesivesetsofnodeswithhighclusteringcoefcient.Ontheotherhand,TriadParticipationRatiogivestheworstperformanceintermsofSeparabilitybutscoresthebestfortheotherthreemetrics.Weconcludethatdependingonthenetworkdifferentdenitionsofnetworkcommunitiesmightbeappropriate.Whenthenetworkcontainswell-separatednon-overlappingcommunities,Conductanceisthebestscoringfunction.Whenthenetworkcontainsdenseheavilyoverlappingcom-munities,thentheTriadParticipationRatiodenesthemostappropriatenotionofacommunity.Furtherresearchisneededtoidentifymostappropriatestructuraldenitionsofcommunitiesforvarioustypesofnetworksandtypesoffunctionalcommunities.E.g.,insocialnetworkswehavebothidentity-basedaswellasbond-basedcommunities[25]andtheymayinfacthavedifferentstructuralsignatures.Lastly,inFigure2wealsoobservethattheaveragegoodnessmetricofthetopkcommunitiesremainsatbutthenquicklydegrades.Weobservethesamepatterninallourdatasets.Thus,fortheremainderofthepaperwefocusourattentiontoasetofthetop5,000communitiesofeachnetworkbasedontheaveragerankoverthe6scores.IV.ROBUSTNESSOFCOMMUNITYSCORINGFUNCTIONSInthissection,weevaluatecommunityscoringfunctionsusingasetofperturbationstrategies.Wedevelopasetofstrategiestogeneraterandomizedperturbationsofground-truthcommunities,whichallowsustoinvestigaterobustnessandsensitivityofcommunityscoringfunctions.Intuitively,agoodcommunityscoringfunctionshouldbesuchthatitisstableundersmallperturbationsoftheground-truthcommunitybutdegradesquicklywithlargerperturbations.Ourreasoningisasfollows.Wedesireacommunityscoringfunctionthatscoreswellwhenevaluatedonaground-truthcommunitybutscoreslowwhenevaluatedonaperturbedcommunity.Inotherwords,anidealcommu-nityscoringfunctionshouldgiveamaximalvaluewhenevaluatedontheground-truthcommunity.Ifweconsideraslightlyperturbedground-truthcommunity(i.e.,anodesetthatdiffersveryslightlyfromtheground-truthcommunity),wewouldwantthescoretobenearlyasgoodasthescoreoftheoriginalground-truthcommunity.Thiswouldmeanthatthescoringfunctionisrobusttonoise.However,iftheground-truthcommunityisperturbedsomuchthatitresemblesarandomsetofnodes,thenagoodscoringfunctionshouldgiveitalowscore.Communityperturbationstrategies.Weproceedbyden-ingasetofcommunityperturbationstrategies.Tovarytheamountofperturbation,eachperturbationstrategyhasasingleparameterpthatcontrolstheintensityoftheperturbation.Givenpandaground-truthcommunitydenedbyitsmembersS,thecommunityperturbationstartswithSandthenmodiesit(i.e.,changesitsmembers)byexecutingtheperturbationstrategypStimes.Wedenethefollowingperturbationstrategies:NODESWAPperturbationisbasedonthemechanismwherethecommunitymembershipsdiffusefromtheoriginalcommunitythroughthenetwork.Weachievethisbypickingarandomedge(u;v)whereu2Sandv62Sandthenswapthememberships(i.e.,removeufromSandaddv).NotethatNODESWAPpreservesthesizeofSbutifvisnotconnectedtothenodesinS,thenNODESWAPmakesSdisconnected.RANDOMtakescommunitymembersandreplacesthemwithrandomnon-members.Wepickarandomnodeu2Sandarandomv62Sandthenswapthemember-ships.LikeNODESWAP,RANDOMmaintainsthesizeofSbutmaydisconnectS.Generally,RANDOMwilldegradethequalityofSfasterthanNODESWAP,sinceNODESWAPonlyaffectsthe“fringe”ofthecommunity.EXPANDperturbationgrowsthemembershipsetSbyexpandingitattheboundary.Wepickarandomedge(u;v)whereu2Sandv62SandaddvtoS.AddingvtoSwillgenerallydecreasethequalityofthecommunity.EXPANDpreservestheconnectednessofSbutincreasesthesizeofS.SHRINKremovesmembersfromthecommunitybound-ary.Wepickarandomedge(u;v)whereu2S;v62SandremoveufromS.SHRINKwilldecreasethequalityofSandresultinasmallercommunitywhile preservingconnectedness.ForagivenS,leth(S;p)denoteaperturbedversionofthecommunitygeneratedbytheperturbationhofintensityp.Wenowquantifythedifferenceofthescorebetweentheunperturbedground-truthcommunityanditsperturbation.WeusetheZ-score,whichmeasuresintheunitsofstandarddeviationhowmuchthescoringfunctionchangesasafunc-tionofperturbationintensityp.Forground-truthcommunitySi,theZ-scoreZ(f;h;p)ofcommunityscoringfunctionfunderperturbationstrategyhwithintensitypis:Z(f;h;p)=Eiif(Si)f(h(Si;p))] p Variif(h(Si;p))]whereEii];Varii]arethemeanandthevarianceovercommunitiesSi,andf(h(Si;p))isthecommunityscoreofperturbedSiunderperturbationhwithintensityp.Tomeasuref(h(Si;p)),werun20trialsofh(Si;p)andcomputetheaveragevalueoff.Z-scoreisthedifferencebetweentheaveragecommunityscoreoftruecommuni-tiesf(Si)andtheaveragecommunityscoresofperturbedcommunitiesf(h(Si;p))normalizedbythestandardde-viationofcommunityscoresofperturbedcommunities.Sincef(h(Si;p)areindependentforeachi,Eiif(h(Si;p))]followsaNormaldistributionbytheCentralLimitThe-orem.Thus,P(zZ(f;h;p))givestheprobabilitythatEiif(h(Si;p))]&#x-499;&#x.763;Eiif(Si)]wherezisastandardnormalrandomvariable.Wemeasurefsothatlowervaluesmeanbettercommunities,i.e.,weaddanegativesigntoTPR,ModularityandFOMD.HighZ-scoresmeanthatEiif(Si)]islikelytobesmallerthanEiif(h(Si;p))]andthatSiisbetterthanh(Si;p)intermsoff.Experimentalresults.Foreachofthe6communityscoringfunctions,wemeasureZ-scoreforperturbationintensityprangingbetween0.01and0.6.Thismeansthatwerandomlyswapbetween1%and60%ofthecommunitymembersandmeasuretheZ-scoreforeachscoringfunction.Forsmallp,smallZ-scoresaredesirablesincetheyindicatethatthescoringfunctionisrobusttonoise.Forhighperturbationin-tensitiesp,highZ-scoresarepreferredbecausethissuggeststhatthecommunityscoringfunctionissensitive,i.e.,asthecommunitybecomesmore“random”wewantthescoringfunctiontosignicantlyincreaseitsvalue.Figure3showstheZ-scoresofLiveJournalcommunitiesasafunctionofperturbationintensityp.WeplottheZ-scoreforeachofthe6communityscoringfunctions.Asexpected,theZ-scoresincreasewithp,whichmeansthatasthecommunitygetsmoreperturbed,thevalueofthescoretendstodecrease.However,thefastertheincreasethemoresensitiveandthusthebetterthescore.Forexample,undertheNODESWAPperturbationConductance(C)exhibitsthehighestZ-scoreafterp&#x-499;&#x.763;02,andithasthesteepestcurve.TriadParticipationRatio(T)alsoexhibitsdesirablebehavior.Ontheotherhand,Modularity(M)scoredoesnotchange 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 0.1 0.2 0.3 0.4 0.5 0.6 Z-scorePerturbation intensityC T M F D CR (a)NODESWAP 0 2 4 6 8 10 0 0.1 0.2 0.3 0.4 0.5 0.6 Z-scorePerturbation intensityC T M F D CR (b)RANDOM -0.2 0 0.2 0.4 0.6 0.8 1 1.2 0 0.1 0.2 0.3 0.4 0.5 0.6 Z-scorePerturbation intensityC T M F D CR (c)EXPAND 0 0.5 1 1.5 2 2.5 0 0.1 0.2 0.3 0.4 0.5 0.6 Z-scorePerturbation intensityC T M F D CR (d)SHRINKFigure3.Z-scoresasafunctionoftheperturbationintensity.Conductance(C)andTriadParticipationRatio(T)bestdetecttheperturbationsofLiveJournalground-truthcommunities.Scoringfunction NodeSwap Random Expand Shrink Conductance(C) 1.06 1.59 0.50 0.45 Flake-ODF(F) 0.51 1.15 0.11 0.41 FOMD(D) 0.18 0.57 0.19 0.12 TPR(T) 0.37 1.85 0.74 0.21 Modularity(M) 0.23 0.14 0.03 0.15 CutRatio(CR) 0.53 0.83 0.13 0.43TableIIIAVERAGEABSOLUTEINCREMENTOFTHEZ-SCOREBETWEENSMALLANDLARGECOMMUNITYPERTURBATIONS.BESTPERFORMINGSCORESAREBOLDED.muchasweperturbtheground-truthcommunities.ThismeansthatModularityisnotgoodatdistinguishingtruecommunitiesfromrandomizedsetsofnodes.Wenoteverysimilarresultsonalloftheremainingdatasetsconsideredinthisstudy.Refertotheextendedversionfordetails[31].Sensitivityofcommunityscoringfunctions.WealsoquantifythesensitivityofcommunityscoringfunctionsbycomputingtheincreaseoftheZ-scorebetweensmall(p=005)andlargeperturbations(p=02).Asnotedabove,wepreferacommunityscoringfunctionwithfastincreaseoftheZ-scoreasthecommunityperturbationintensityincreases.TableIIIdisplaysthedifferenceoftheZ-scorebetweenalargeandasmallperturbation:Z(f;h;02)Z(f;h;005).Wecomputetheaverageincrementacrossallthe230networks.Ahighvalueofincrementmeansthatthescoreisbothrobustandsensitive.Thescoreisrobustbecauseevenatsmallperturbation(p=005)itmaintainslowZ-value,whileatlargeperturbation(p=02)ithashighZ-valueandthustheoverallZ-scoreincrementishigh.ConductanceisthemostrobustscoreunderNODESWAPandSHRINK.TheTriadParticipationRatio(T)isthemostrobustunderRANDOMandEXPAND.InbothcasesConduc-tancefollowsthemclosely. Algorithm1Communitydetectionfromaseednode Require:GraphG(V;E),seednodes,scoringfunctionf(1)ComputearandomwalkscoresrufromseednodesusingPageRank-Nibble[2].(2)Ordernodesubythedecreasingvalueofru=d(u),whered(u)isthedegreeofu.(3)Computethecommunityscoringfunctionf(Sk)oftherstknodesfk=f(Sk=fuiikg)foreveryk.(4)Detectlocalminimaloff(Sk)anddetectoneormorecommunitiesifwewanttodetectonecommunitythenFindtheindexkattherstlocaloptimaoffk.return^S=fviikgelseFindtheindiceskjateverylocaloptimaoffk.return^Sj=fviikjgendif V.DISCOVERINGCOMMUNITIESFROMASEEDNODENowwefocusonthetaskofinferringcommunitiesgivenasingleseednode.Weconsidertwotasksthatbuildontwodifferentviewpoints.Thersttaskismotivatedbyacommunity-centricviewwherewediscoverallmembersofcommunitySgivenasinglemembers2S.Thesecondtaskismotivatedbyanode-centricviewwherewewanttodiscoverallcommunitiesthatasinglenodesbelongsto.Thismeanswediscoverboththenumberofcommunitiessbelongstoaswellasthemembersofthesecommunities.Proposedmethod.Weextendthelocalspectralclusteringalgorithm[28],[3]intoascalableparameter-freecommunitydetectionmethod.Thebenetsofourmethodare:First,themethodrequiresnoinputparametersandisbeabletoauto-maticallydetectthenumberofcommunitiesaswellasthemembersofthosecommunities.Second,thecomputationalcostofourmethodisproportionaltothesizeofthedetectedcommunity(notthesizeofthenetwork).Thus,ourmethodisscalabletonetworkswithhundredsofmillionsofnodes.Ourmethod(Algorithm1)buildsonthendingsinSec-tionsIIIandIV:First,weaimtondsetsofwell-connectednodesaroundnodes.Weachievethisbydeningalocalpartitioningmethodbasedonrandomwalksstartingfromasingleseednode[2].Inparticular,weusethePageRank-NibblerandomwalkmethodthatcomputesthePageRankvectorwitherror"intimeO(1=")independentofthenetworksize[3].ThenodeswithhighPageRankscoresfromscorrespondtothewell-connectednodesarounds.Moreover,therandomworkis“truncated”asitsetsPageRankscoresruto0fornodesuwithru",forsomesmallconstant"[2].Thiswaythecomputationalcostisproportionaltothesizeofthedetectedcommunityandnotthesizeofthenetwork.AfterthePageRank-Nibbleassignstheproximityscores 10-3 10-2 10-1 100 100 101 102 103 Valuekf f' f(Sk*) f'(Sk*) Figure4.Twocommunityscoringfunctionsf(Conductance)andf0(TriadParticipationRatio)evaluatedonasetSkoftopknodeswithhighestrandomwalkproximityscoretoseednodes.Localoptimaoff(Sk)correspondtodetectedcommunities.ru,wesortthenodesindecreasingproximityruandproceedtothesecondstepofouralgorithmwhichextendstheapproachofSpielmanandTeng[28].WeevaluatethecommunityscoreonasetSkofallthenodesuptok-thone(notethatbyconstructionSk1Sk).Thismeansthatforachosencommunityscoringfunctionfwecomputef(Sk)ofthesetSkthatiscomposedofthetopknodeswiththehighestrandomwalkscoreru.Thelocalminimaofthefunctionf(Sk)thencorrespondtoextractedcommunities.Wedetectlocalminimaoff(Sk)usingthefollowingheuristic.Forincreasingk=12;:::,wemeasuref(Sk).Atsomepointk,f(Sk)willstopdecreasingandthiskbecomesour“candidatepoint”foralocalminimum.Iff(Sk)keepsincreasingafterkandeventuallybecomeshigherthan f(Sk),wetakekasavalidlocalmini-mum.However,iff(Sk)goesdownagainbeforeitreaches f(Sk),wediscardthecandidatek.Weexperimentedwithseveralvaluesof andfoundthat =12givesgoodresultsacrossallthedatasets.Forexample,Fig.4plotsf(Sk)fortwocommunityscor-ingfunctionsf(Conductance)andf0(TriadParticipationRatio).Weidentifythelocaloptima(denotedbystarsandsquares)andusethenodesinthecorrespondingsetsSkasthedetectedcommunities.Notethatourmethodcandetectmultiplecommunitiesthattheseednodebelongstobyidentifyingdifferentlocalminimaoff(Sk).However,weassumethatthecommunitiesarenested(smallercommunitiesarecontainedinthelargerones)eventhoughtheground-truthcommunitiesmaynotnecessarilyfollowsuchanestedstructure.Also,notethatourmethodisparameter-free.Ourmethoddiffersfromlocalgraphclusteringapproaches[2],[28]intwoimportantas-pects.First,insteadofsweepingonlyusingConductance,weconsidersweepingusingotherscoringfunctions.Second,wendthelocaloptimaofthesweepcurveinsteadoftheglobaloptimum—thischangegivesalargeimprovementovertheconventionallocalspectralclusteringapproaches[2],[28].Detectingacommunityfromasinglemember.Werstconsiderthetaskwhereweaimtoreconstructasingleground-truthcommunitySbasedononemembernodes.ForeachcommunityS,wepickarandommembernodes F1-score C F D T M CR LC CPM LJ 0.64 0.64 0.62 0.57 0.15 0.61 0.54 0.43 FS 0.23 0.22 0.24 0.25 0.24 0.18 0.13 0.14 Orkut 0.21 0.19 0.19 0.18 0.20 0.09 0.20 0.13 Ning 0.24 0.19 0.10 0.19 0.08 0.19 0.17 0.11 Amazon 0.87 0.75 0.73 0.79 0.06 0.85 0.74 0.85 DBLP 0.61 0.61 0.65 0.66 0.04 0.61 0.46 0.53 Avg.F1 0.46 0.43 0.42 0.44 0.13 0.42 0.37 0.36 Avg.Prec 0.50 0.53 0.52 0.55 0.13 0.53 0.49 0.38 Avg.Rec 0.60 0.47 0.51 0.47 0.71 0.49 0.65 0.69TableIVPERFORMANCEOFOUR6METHODSAND2BASELINES(LC,CPM)ATDETECTINGCOMMUNITIESFROMASEEDNODE.asaseednodeandcomparethecommunitywedetectfromswiththeground-truthcommunityS.Startingfromnodes,wegenerateasweepcurvef(Sk).Letkbethevalueofkwheref(Sk)achievestherstlocalminima.WethenusethesetSkasthedetectedcommunity.Now,giventheground-truthcommunitySandthedetectedcommunitySk,weevaluatetheprecision,therecallandtheF1-score.Weconsider6communityscoringfunctionsf().Wecomparetheperformanceofourmethodtotwostandardcommunitydetectionmethods:LocalSpectralclustering(LC)[2],andthe3-cliqueCliquePercolationMethod(CPM)[23].TableIVshowstheperformanceoftheproposedmethodforeachscoringfunctionandforthetwobaselines.First5rowsshowtheF1-scoreforeachofthedatasets,andthelast3rowsshowtheaverageF1-score,precisionandrecalloverallthedatasets.WeobservethattheConduc-tance(C)givesthebestaverageF1-score,andoutperformsallotherscoresonLiveJournal(LJ),Orkut,Amazon,andNing.ForFriendster(FS)andDBLP,theTriadparticipationratio(T)performsbest.Thisagreeswithourintuitionthatfornetworks,likeLiveJournal,thathavefewercommunityoverlapsscoringfunctionsthatfocusongoodseparabilityperformwell.Innetworkswherenodesbelongtomultiplecommunities(likeDBLPwhereauthorspublishatmultiplevenues),theTriadparticipationratio(T)performsbest.WealsonotethattheaverageF1-scoreofConductanceis0.46,whilethebaselinesCPMandLCachieveF1-scoreofonly0.36and0.37,respectively.Notethisis10%absoluteand30%relativeimprovementoverthestateoftheartbaselines.Last,weobservethatsomemethodsdetectlargercommu-nitiesthannecessary(higherrecall,lowerprecision).Mod-ularity(M)mostseverelyoverestimatescommunitysize.Conductance(C)andbothbaselines(CRandCPM)exhibitsimilarbehaviorbuttoalesserextent.Onthecontrary,Flake-ODF(F),Fractionovermedian(D),TriadParticipa-tionRatio(T),andCutRatio(CR)tendtounderestimatethecommunitysize(higherprecisionthanrecall).Detectingallcommunitiesthataseednodebelongsto.Wealsoexplorethesecondtaskwherewewanttodetectallthecommunitiestowhichagivenseednodesbelongs.Inthistask,wearegivenanodesthatisamemberofmultiplecommunities,butwedonotknowwhichandhowmanyg 1 2 3 4 5 Allnodes LJ 0.52 0.59 0.52 0.42 0.38 0.53 FS 0.13 0.10 0.08 0.05 0.02 0.13 Orkut 0.21 0.17 0.13 0.11 0.10 0.20 Ning(225nets) 0.11 0.09 0.07 0.06 0.05 0.11 Amazon 0.59 0.73 0.69 0.66 0.55 0.61 DBLP 0.34 0.24 0.20 0.21 0.16 0.33TableVAVERAGEF-SCOREBETWEENDETECTEDCOMMUNITIESANDTHEGROUND-TRUTHCOMMUNITIESTOWHICHASEEDNODEBELONGSTO,WHENTHESEEDNODEBELONGSTOgDIFFERENTCOMMUNITIES.communitiessbelongsto.Wedetectmultiplecommunitiesbydetectingallthelocalminima(andnotjusttherstone)ofthesweepcurve.Thiswayourmethodbothdetectsthenumberaswellasthemembersofcommunities.Foreachdataset,wesampleanodes,detectcommunities^Sj,andcomparethemtotheground-truthcommunitiesSithatnodesbelongsto.Tomeasurecorrespondencebetweenthetrueandthedetectedcommunities,wematchground-truthcommunitiestodetectedcommunitiesbytheHungarianmatchingmethod[15].WethencomputetheaverageF1-scoreoverthematchedpairs.WeuseConductanceasthecommunityscoringfunctionandreportresultsinTableV.Notethatthistaskisharderthanthepreviousoneashereweaimtodiscovermultiplecommunitiessimultaneously.Whereastheprevioustaskevaluatedourmethodforeachground-truthcommunity,herewerstsamplenodesandthensearchforthecommunitiesSithatsbelongsto.Therefore,largerground-truthcommunitieswillbeincludedinSimoreoften.Sincelargerground-truthcommunitiesarelesswellseparated[18]thismakesthetaskharder.TableVreportstheaverageF1-scoreasafunctionofthenumberofcommunitiesgthattheseednodesbelongsto.Giventhatthisisahardertask,weobservelowervaluesoftheF-score.Intuitivelywealsoexpectthatthetaskbecomesharderassbelongstomorecommunities.Infactweobservethattheperformancedegradeswithincreasingg.Interestingly,inLiveJournalandAmazonitappearstobeeasiertodetectcommunitiesofnodesthatbelongto2communitiesthantodetectacommunityofanodethatbelongstoonlyasinglecommunity.ThisisduetothefactthatsinglecommunitynodesresideontheborderofthecommunityandconsequentlyConductanceproducescommunitiesthataretoosmall[18].VI.CONCLUSIONThelackofreliableground-truthgold-standardcommuni-tieshasmadenetworkcommunitydetectionaverychalleng-ingtask.Inthispaper,westudiedasetof230differentlargesocial,collaborationandinformationnetworksinwhichwedenedthenotionofground-truthcommunitiesbynodesexplicitlystatingtheirgroupmemberships.Wedevelopedanevaluationmethodologyforcomparingnetworkcommunitydetectionalgorithmsbasedontheiraccuracyonrealdataandcompareddifferentdenitions ofnetworkcommunitiesandexaminedtheirrobustness.Ourresultsdemonstratelargedifferencesinbehaviorofcommunityscoringfunctions.Last,wealsostudiedtheproblemofcommunitydetectionfromasingleseednode.Weexaminedclassofscalableparameter-freecommunitydetectionmethodsbasedonRandomWalksandfoundthatourmethodsreliablydetectaground-truthcommunities.Theavailabilityofground-truthcommunitiesallowsforarangeofinterestingfuturedirections.Forexample,furtherexaminingtheconnectivitystructureofground-truthcom-munitiescouldleadtonovelcommunitydetectionmeth-ods[30].Overall,webelievethatthepresentworkwillbringmorerigortotheevaluationofnetworkcommunitydetection,andthedatasetspubliclyreleasedasapartofthisworkwillbenettheresearchcommunity.Acknowledgements.ThisresearchhasbeensupportedinpartbyNSFIIS-1016909,CNS-1010921,CAREERIIS-1149837,IIS-1159679,DARPAXDATA,DARPAGRAPHS,AlbertYu&MaryBechmannFoundation,Boe-ing,Allyes,Samsung,Intel,AlfredP.SloanFellowshipandtheMicrosoftFacultyFellowship.REFERENCES[1]Y.-Y.Ahn,J.P.Bagrow,andS.Lehmann.Linkcommunitiesrevealmulti-scalecomplexityinnetworks.Nature,Oct.2010.[2]R.Andersen,F.Chung,andK.Lang.LocalgraphpartitioningusingPageRankvectors.InFOCS'06,pages475–486,2006.[3]R.AndersenandK.Lang.Communitiesfromseedsets.InWWW'06,pages223–232,2006.[4]L.Backstrom,D.Huttenlocher,J.Kleinberg,andX.Lan.Groupformationinlargesocialnetworks:membership,growth,andevolution.InKDD'06,pages44–54,2006.[5]L.Danon,J.Duch,A.Diaz-Guilera,andA.Arenas.Com-paringcommunitystructureidentication.J.ofStat.Mech.,2005.[6]I.Dhillon,Y.Guan,andB.Kulis.Weightedgraphcutswithouteigenvectors:Amultilevelapproach.IEEEPAMI,29(11):1944–1957,2007.[7]S.L.Feld.Thefocusedorganizationofsocialties.Am.J.ofSociology,86(5):1015–1035,1981.[8]G.Flake,S.Lawrence,andC.Giles.Efcientidenticationofwebcommunities.InKDD'00,pages150–160,2000.[9]S.Fortunato.Communitydetectioningraphs.PhysicsReports,486(3-5):75–174,2010.[10]S.FortunatoandM.Barth´elemy.Resolutionlimitincommu-nitydetection.PNAS,104(1):36–41,2007.[11]M.GirvanandM.Newman.Communitystructureinsocialandbiologicalnetworks.PNAS,99(12):7821–7826,2002.[12]M.S.Granovetter.Thestrengthofweakties.Am.J.ofSociology,78:1360–1380,1973.[13]S.Kairam,D.Wang,andJ.Leskovec.Thelifeanddeathofonlinegroups:Predictinggroupgrowthandlongevity.InWSDM'12,2012.[14]G.KarypisandV.Kumar.Afastandhighqualitymultilevelschemeforpartitioningirregulargraphs.SIAMJournalonScienticComputing,20:359–392,1998.[15]H.W.Kuhn.TheHungarianmethodfortheassignmentproblem.NavalResearchLogisticQuarterly,2:83–97,1955.[16]J.Leskovec,L.Adamic,andB.Huberman.Thedynamicsofviralmarketing.ACMTWeb,1(1),2007.[17]J.Leskovec,K.Lang,andM.Mahoney.Empiricalcompari-sonofalgorithmsfornetworkcommunitydetection.InWWW'10,2010.[18]J.Leskovec,K.J.Lang,A.Dasgupta,andM.W.Mahoney.Communitystructureinlargenetworks:Naturalclustersizesandtheabsenceoflargewell-denedclusters.InternetMathematics,6(1):29–123,2009.[19]M.McPherson.Anecologyofafliation.AmericanSocio-logicalReview,48(4):519–532,1983.[20]A.Mislove,M.Marcon,K.P.Gummadi,P.Druschel,andB.Bhattacharjee.Measurementandanalysisofonlinesocialnetworks.InIMC'07,pages29–42,2007.[21]M.Newman.Modularityandcommunitystructureinnet-works.PNAS,103(23):8577–8582,2006.[22]M.NewmanandM.Girvan.Findingandevaluatingcommu-nitystructureinnetworks.Phys.Rev.E,69:026113,2004.[23]G.Palla,I.Der´enyi,I.Farkas,andT.Vicsek.Uncoveringtheoverlappingcommunitystructureofcomplexnetworksinnatureandsociety.Nature,435(7043):814–818,2005.[24]F.Radicchi,C.Castellano,F.Cecconi,V.Loreto,andD.Parisi.Deningandidentifyingcommunitiesinnetworks.PNAS,101(9):2658–2663,2004.[25]Y.Ren,R.Kraut,andS.Kiesler.Applyingcommonidentityandbondtheorytodesignofonlinecommunities.Organiza-tionStudies,28(3):377–408,2007.[26]S.Schaeffer.Graphclustering.ComputerScienceReview,1(1):27–64,2007.[27]J.ShiandJ.Malik.Normalizedcutsandimagesegmentation.IEEEPAMI,22(8):888–905,2000.[28]D.SpielmanandS.-H.Teng.Nearly-lineartimealgorithmsforgraphpartitioning,graphsparsication,andsolvinglinearsystems.InSTOC'04,pages81–90,2004.[29]D.WattsandS.Strogatz.Collectivedynamicsofsmall-worldnetworks.Nature,393:440–442,1998.[30]J.YangandJ.Leskovec.Community-AfliationGraphModelforOverlappingNetworkCommunityDetectionInICDM'12,2012.[31]J.YangandJ.Leskovec.DeningandEvaluatingNetworkCommunitiesbasedonGround-truth.Extendedversion,2012.