ufmgbr Wagner Meira Jr Universidade Federal de Minas Gerais Belo Horizonte Brasil meiradccufmgbr Mohammed J Zaki Rensselaer Polytechnic Institute Troy NY zakicsrpiedu ABSTRACT In this work we study the correlation between attribute sets and the occur ID: 58946
Download Pdf The PPT/PDF document "Mining Attributestructure Correlated Pat..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
473 471 477 470 469 466 468 476 472 467 474 475 MiningAttributestructureCorrelatedPatternsinLargeAttributedGraphsArleiSilvaUniversidadeFederaldeMinasGeraisBeloHorizonte,Brasilarlei@dcc.ufmg.brWagnerMeiraJr.UniversidadeFederaldeMinasGeraisBeloHorizonte,Brasilmeira@dcc.ufmg.brMohammedJ.ZakiRensselaerPolytechnicInstituteTroy,NYzaki@cs.rpi.eduABSTRACTInthiswork,westudythecorrelationbetweenattributesetsandtheoccurrenceofdensesubgraphsinlargeattributedgraphs,ataskwecallstructuralcorrelationpatternmin-ing.Astructuralcorrelationpatternisadensesubgraphinducedbyaparticularattributeset.Existingmethodsarenotabletoextractrelevantknowledgeregardinghowvertexattributesinteractwithdensesubgraphs.Structuralcorre-lationpatternminingcombinesaspectsoffrequentitemsetandquasi-cliqueminingproblems.Weproposestatisticalsignicancemeasuresthatcomparethestructuralcorrela-tionofattributesetsagainsttheirexpectedvaluesusingnullmodels.Moreover,weevaluatetheinterestingnessofstruc-turalcorrelationpatternsintermsofsizeanddensity.Anecientalgorithmthatcombinessearchandpruningstrate-giesintheidenticationofthemostrelevantstructuralcor-relationpatternsispresented.Weapplyourmethodfortheanalysisofthreereal-worldattributedgraphs:acollab-oration,amusic,andacitationnetwork,verifyingthatitprovidesvaluableknowledgeinafeasibletime.1.INTRODUCTIONInseveralreal-lifegraphs,attributescanbeassociatedwithverticesinordertorepresentvertexproperties.Inso-cialnetworks,forexample,vertexattributesareusefultomodelpersonalcharacteristics.Moreover,vertexattributescanbeassociatedwithcontent(e.g.,keywords,tags)inthewebgraph.Suchanextendedgraphrepresentation,whichiscalledanattributedgraph,maysupportgraphpatternsthatproviderelevantknowledgeinvariousapplicationscenarios.Aninterestingquestionrelatedtoattributedgraphsishowparticularattributesareassociatedwiththetopologyofrealgraphs.Inotherwords,dothereexistpatternsthatexplainhowvertexattributesinteractwiththegraphstruc-ture?Howcanweextractandevaluatesuchpatterns?Inthispaper,westudytheproblemofcorrelatingattributesetswithanimportanttopologicalpropertyofgraphs,whichistheorganizationofverticesintodensesubgraphs.Forin-stance,weaimtoaddressquestionssuchas:Howdoesaparticularsetofinterestsinducecommunitiesinasocialnetwork?Whatarethecommunitiesthatemergearoundsuchinterests?Suchquestionsarerelatedtoimportantso-cialphenomenasuchashomophily[11]andin uence[2].Althoughseveraldenitionsofdensesubgraphshavebeenproposedintheliterature,mostofthemdonottakevertexattributesintoconsideration.Furthermore,suchdenitionsdonotprovideanyknowledgeregardinghowdierentsetsofattributesinducedensesubgraphs.Thisworkstudiesthecorrelationbetweenvertexattributesanddensesubgraphs,ataskwecallstructuralcorrelationpatternmining.Thestructuralcorrelationofanattributesetistheprobabilityofavertextobememberofadensesubgraphinitsinducedgraph.Moreover,astructuralcor-relationpatternisadensesubgraphinducedbyaparticularattributeset.Figure1illustratesadatasetforstructuralcorrelationpatternmining.ThevertexattributesaregiveninFigure1(a)andthegraphisshowninFigure1(b).Ex-ampledensesubgraphsareshowninFigures1(c)and1(d).ThestructuralcorrelationoftheattributeAis0.82,since9outof11verticesarecoveredbydensesubgraphsinitsinducedgraph.Ontheotherhand,thestructuralcorrela-tionofCis0,becausethereisnodensesubgraphinsidethegraphinducedbyC.ThestructuralcorrelationoffA,Bgis1,duetothefactthateveryvertexisamemberofadensesubgraphinthegraphinducedbyfA,Bg.Thepair(fA,Bg,f6,7,8,9,10,11g)isanexampleofastructuralcor-relationpattern,forwhichthesubgraphisshowninFigure1(d).Anotherexampleisthepattern(fAg,f3,4,5,6g),forwhichtheinducedsubgraphisshowninFigure1(d).Thestructuralcorrelationofattributesetsandthestruc-turalcorrelationpatternsarecomplementaryinformation,whiletherstisameasureofthecorrelationbetweenagivenattributesetandtheoccurrenceofdensesubgraphs,thesec-ondprovidesrepresentativesforsuchacorrelationthroughspecicsubgraphs.Weformulatethestructuralcorrelationpatternminingintermsoftwoexistingdataminingprob-lems:frequentitemsetandquasi-cliquemining.Frequentitemsetmining[1,19]isappliedtohandlethepossiblelargenumberofattributesetsfromthegraphandquasi-cliques[14,10]areusedasadenitionfordensesubgraphs.Westudystructuralcorrelationpatternminingfocusingontwoimportantaspects.Therstaspectisthesignicanceofthepatterns.Morespecically,itisrelevanttoprovidesignicancemeasuresforthestructuralcorrelationofat-tributesetsandthestructuralcorrelationpatterns.The vertexattributes 1A,C 2A 3A,C,D 4A,D 5A,E 6A,B,C 7A,B,E 8A,B 9A,B 10A,B,D 11A,B (a)Vertexattributes (b)Graph (c)Densesubgraph (d)DensesubgraphFigure1:Structuralcorrelationpatternmining(illustrativeexample)secondaspectisrelatedtothecomputationalcostoftheproposedtask.Ourobjectiveistoenabletheanalysisoflargerealgraphsinafeasibletime.Althoughsignicanceandhigh-performancearenotnecessarilyconcordantgoals,weproposesignicancemetricsthatmayleadtoecientpruningstrategiesforstructuralcorrelationpatternmining.Regardingthesignicanceofpatterns,weformulatenor-malizationapproachesforstructuralcorrelationpatternmin-inginordertomeasurethestatisticalsignicanceofthestructuralcorrelationofagivenattributeset.Theideaistocomparethestructuralcorrelationagainstitsexpectedvalue,whichisprovidedbyanullmodel.Moreover,weevaluatethestructuralcorrelationpatternsintermsofsize(i.e.,numberofvertices)anddensity(i.e.,cohesion).Suchevaluationisusefultorankthemostinterestingpatterns.Wecombinethestatisticalsignicanceofthestructuralcorrelationofattributesetsandthesizeanddensityofstruc-turalcorrelationpatternswitheectiveconstraintstoprunedownthesearchspace.Moreover,weproposetwostrategiesforcomputingthestructuralcorrelationofattributesetse-ciently.ThesepruningandsearchtechniquesareintegratedintotheSCPM(StructuralCorrelationPatternMining)al-gorithm,whichisdescribedandevaluatedinthispaper.Inparticular,weapplySCPMtotheanalysisofthreerealat-tributedgraphs:collaboration,musicandcitationnetworks.TheresultsshowthatSCPMisabletoextractrelevantknowledgeregardinghowvertexattributesarecorrelatedwithdensesubgraphsinlargeattributedgraphs.2.STRUCTURALCORRELATIONPATTERNMINING2.1Denitions2.1.1StructuralCorrelationWedeneanattributedgraphasa4-tupleG=(V;E;A;F)whereVisthesetofvertices,Eisthesetofedges,A=fa1;a2;:::angisthesetofattributes,andF:V!P(A)isafunctionthatreturnsthesetofattributesofavertex.Pisthepowersetfunction.EachvertexviinVhasasetofattributesF(vi)=fai1;ai2;:::aipg,wherep=jF(vi)jandF(vi)A.Figure1(b)showsanexampleofanattributedgraphwherethevertexattributesaregiveninFigure1(a).GiventhesetofattributesA,wedeneanattributesetSasasubsetofA(SA).Moreover,wedenotebyV(S)VthevertexsetinducedbyS(i.e.,V(S)=fvi2VjSF(vi)g)andbyE(S)EtheedgesetinducedbyS(i.e.,E(S)=f(vi;vj)2Ejvi;vj2V(S)g).ThegraphG(S),in-ducedbyS,isthepair(V(S);E(S)).Wealsodeneasup-portfunction,whichgivesthenumberofoccurrencesofanattributesetintheinputgraph((S)=jV(S)j),i.e.,thenumberofverticesthatcontainS.Thestructuralcorrelationfunctionmeasuresthecorre-lationbetweenagivenattributesetandtheoccurrenceofdensesubgraphsinanattributedgraph.Weapplyquasi-cliquesasadenitionfordensesubgraphs.Quasi-cliquesareanaturalextensionofthetraditionalcliquedenition.DEFINITION1.(Quasi-clique)Givenaminimumden-sitythreshold min(0 min1)andaminimumsizethresholdmin size,aquasi-cliqueisamaximalvertexsetQsuchthatforeachv2Q,thedegreeofvinQisatleastd min:(jQj1)eandjQjmin size.Figures1(c)and1(d)areexamplesofan1-quasi-cliqueofsize4anda0.6-quasi-cliqueofsize6,respectively,fromthegraphshowninFigure1(b).Thequasi-cliqueminingprob-lemconsistsofidentifyingthequasi-cliquesfromagraphconsideringminimumsizeanddensityparameters,aprob-lemknowntobe#P-hard[14,17].WedenethestructuralcorrelationofanattributesetSastheprobabilityofavertexvwithattributeStobepartofaquasi-cliqueinG(S).DEFINITION2.(Structuralcorrelationfunction)GivenanattributesetS,thestructuralcorrelationofS,(S),isgivenas:(S)=jKSj jV(S)j(1)whereKSisthesetofverticesinquasi-cliquesinG(S).InthegraphfromFigure1,KfAg=f3;4;5;6;7;8;9;10;11g,KfCg=fgandKfA;Bg=f6;7;8;9;10;11g,andthusthecorrespondingvaluesof(fAg),(fCg),and(fA;Bg)are0.82,0,and1,respectively.StructuralcorrelationmeasuresthedependencebetweenattributesetSandthedensityoftheassociatedvertices.ItindicateshowlikelySistobepartofdensesubgraphs.Ourformulationenablestheiden-ticationofattributesthatinduceverticesthatarewellcon-nectedinthegraph.Inasocialnetwork,forinstance,suchattributesareofgreatinterestsincetheymayberelatedtohomophilyorin uence.Nevertheless,itisalsorelevantto understandthedensesubgraphsinducedbyattributesets.Wecallstructuralcorrelationpatternaquasi-cliquethatishomogeneousw.r.t.anattributeset.DEFINITION3.(Structuralcorrelationpattern).Astructuralcorrelationpatternisapair(S;Q),whereSisanattributeset(SA),andQisaquasi-cliquefromthegraphinducedbyS(QV(S)),giventhequasi-cliqueparameters minandmin size.Thepair(fAg,f3;4;5;6g)isanexampleofasize4struc-turalcorrelationpatternwithdensity1inducedbytheat-tributeAinthegraphfromFigure1.Anotherexampleofastructuralcorrelationpatternis(fA;Bg,f6;7;8;9;10;11g),whichisasize6structuralcorrelationpatternwithdensity0.6inducedbytheattributesetfA;Bg.2.1.2StructuralCorrelationPatternMiningProblemBasedonthedenitionofstructuralcorrelationpatternsandstructuralcorrelationfunction,weformulatethestruc-turalcorrelationpatternminingproblem.Itcomprisestheidenticationoftheattributesetscorrelatedwithdensesub-graphsandthedensesubgraphsinducedbysuchattributesets.Weapplyaminimumsupportthresholdminforat-tributesetsinordertoprunedownthenumberofpatterns.DEFINITION4.(Structuralcorrelationpatternmin-ingproblem).GivenanattributedgraphG(V;E;A;F),aminimumsupportthresholdmin,aminimumquasi-cliquedensity minandsizemin size,andaminimumstructuralcorrelationmin,thestructuralcorrelationpatternminingconsistsofidentifyingthesetofstructuralcorrelationpat-terns(S,Q)fromG,suchthatSisanattributesetforwhich(S)min,(S)min,andQisa min-quasi-cliqueforwhichQV(S)andjQjmin size.Asanexample,weconsidertheattributedgraphshowninFigure1andtheparametersmin, min,min sizeandminsetto3,0.6,4,and0.5,respectively.ThesetofstructuralcorrelationpatternsareshowninTable1.Foreachpattern,wegivethepair(attributeset,densesubgraph),therespec-tivequasi-cliquesizeanddensity( ),andtheattributesetsupport()andstructuralcorrelation(). patternsize (fAg,f6;7;8;9;10;11g)60.60110.82 (fAg,f3;4;5;6g)41110.82 (fAg,f3;4;6;7g)40.67110.82 (fAg,f3;5;6;7g)40.67110.82 (fAg,f3;6;7;8g)40.67110.82 (fBg,f6;7;8;9;10;11g)60.6061.0 (fA;Bg,f6;7;8;9;10;11g)60.6061.0 Table1:PatternsfromthegraphshowninFigure1Similartothequasi-cliquemining,thestructuralcorre-lationpatternminingis#P-hard[17].Thisisbecausethequasi-cliqueminingproblemcanbereducedtothestructuralcorrelationpatternminingbyassigningthesameattributetoeachvertexfromthegraphandsettingminto1.Structuralcorrelationpatternminingisbasedonthestruc-turalcorrelationfunction,whichmeasureshowagivenat-tributesetisassociatedwiththeoccurrenceofdensesub-graphsinanattributedgraph.However,itisimportanttoassessthesignicance/interestingnessofagivenstructuralcorrelation,whichisthesubjectofthenextsection.2.1.3StatisticalSignicanceoftheStructuralCorrelationGiventhestructuralcorrelationofanattributeset,howcanweevaluateit?Inotherwords,whatcanbeconsideredahighorlowstructuralcorrelation?Inthissection,wead-dresssuchquestionsbyproposingnullmodelsforstructuralcorrelation.Thesemodelsspecifytheexpectedstructuralcorrelationofanattributesetassumingthatthecorrela-tionbetweenvertexattributesanddensesubgraphsisran-dom.Normalizedstructuralcorrelationmeasureshowthestructuralcorrelationofanattributesetdeviatesfromitsexpectedvalue,andallowsustoassessthestatisticalsignif-icanceofagivenstructuralcorrelationvalue.DEFINITION5.(Normalizedstructuralcorrelation).GivenanattributesetSwithsupport(S)andafunc-tionexp,whichgivestheexpectedstructuralcorrelationofanattributesetbasedonitssupportandtheattributedgraphG,thenormalizedstructuralcorrelationofSisgivenby:(S;G)=(S) exp((S);G)(2)AccordingtoDenition5,thenormalizedstructuralcor-relationfunctiongiveshowmuchthestructuralcorrelationofanattributesetSishigherthanexpected.Therefore,itrequiresthedenitionofthefunctionexp,whichreceivesthesupportofS((S))andtheattributedgraphGasargu-ments.Bynormalizingthestructuralcorrelation,weexpecttoobtainameasureofthecorrelationofanattributesetSthatisindependentofitssupportandtheinputgraph.WeassumethattheinputgraphGcomprisestheobjectofinterest,i.e.,itisthe\population"graph.Assumethatwearegiventheattributesetsupportvalue(S)(independentoftheactualattributesetS).Tocomputetheexpectedstructuralcorrelation,oursamplespaceisthesetofallver-texsubsetsofsize(S)drawnrandomlyfromG.Thestatis-ticofinterestisthemeanstructuralcorrelationvalue,exp.Thatis,theexpectedprobabilitythatarandomvertexinagivensampleinducesdensesubgraphs(quasi-cliques)inthatsampleofsize(S).Thequasi-cliqueparameters, minandmin size,areassumedtobexedaswell.Anintuitiveapproachforcomputingexpisthroughsim-ulation.Giventhesupport(S)oftheattributeset,aran-domsampleof(S)verticesfromGisselected.Eachvertexfromthesampleischeckedtobeinaquasi-clique,accordingtothequasi-cliqueparameters.Thestructuralcorrelationofthesampleisthefractionofverticesfromitthatareinatleastonequasi-clique.Thesimulation-basedexpectedstruc-turalcorrelationsim-expisgivenbytheaveragestructuralcorrelationofrrandomsamples.Thesimulation-basedstructuralcorrelationisverysimpleconceptuallybutmayrequireahighrtoachieveaccurateestimates,whichisprohibitiveinrealsettings.Thuswealsoproposeananalyticalformulationforanupperboundontheexpectedstructuralcorrelationofanattributeset.Theideaisthatavertexmusthaveaminimumdegreeofd min:(min size1)einordertobememberofa min-quasi-cliqueofminimumsizemin size.Consequently,theprobabilityofavertextohaveadegreeofd min:(min size1)einarandomsubgraphofsize(S)fromGgivesanupperboundontheexpectedstructuralcorrelationofS.Givenarandomsize(S)subgraphG(S)fromG,thedegreeofvinGandG(S)arerelatedasfollows. THEOREM1.(ProbabilityofavertexthathasadegreeinGtohaveadegreeinG(S)).IfarandomvertexvfromGwithdegreeisselectedtobepartofG(S),theprobabilityofsuchvertextohaveadegreeinG(S)isgivenbythefollowingbinomialfunction:F(;;)= !::(1)(3)whereistheprobabilityofaspecicvertexufromGtobeinG(S),ifvisalreadychosen,whichisgivenas:=(S)1 jVj1(4)Proofsketch.ThereareverticesadjacenttovinG,thus,theprobabilityofvtohaveadegreeofinG(S)istheprobabilityofselectingoutofverticestobepartofG(S).Sincevisalreadyselected,theprobabilityofselectinganyremainingvertexfromGisgivenbyequation4.BasedonTheorem1,wedeneanupperboundontheexpectedstructuralcorrelationastheprobabilityofavertextohaveadegreeofatleastd min:(min size1)einG(S).THEOREM2.(Upperboundontheexpectedstruc-turalcorrelation).Giventhequasi-cliqueparameters minandmin size,thestructuralcorrelationofanattributesetwithsupport(S)isupperboundedby:max-exp((S))=mX=zp():X=zF(;;)(5)wherez=d min:(min size1)e,misthemaximumdegreeofavertexfromG,andpisthedegreedistributionofG.Proofsketch.GivenavertexwithdegreeinG,theprob-abilityofsuchvertextohaveadegreeofatleastd min:(min size1)einG(S)isthesumofexpression3overthedegreeintervalfromd min:(min size1)eto.IfwemultiplythissumbytheprobabilityofavertexofdegreefromGtobeinG(S),i.e.,p(),itgivestheprobabil-ityofanyvertexwithdegreefromGtohaveadegreeofatleastd min:(min size1)einG(S).Equation5isthesumofsuchproductsoverthevertexdegreeshigherthand min:(min size1)e.TheproposedupperboundontheexpectedstructuralcorrelationofanattributeSisbasedontheexpectedde-greedistributionofarandomgraphofsize(S)fromG.However,thedegreeisnottheonlycriteriaforavertextobepartofaquasi-clique.Verticesthatsatisfytheminimumdegreethresholdmaynotbepartofaquasi-cliqueiftheyareconnectedtolowdegreevertices.Nevertheless,sinceweapplytheproposedformulationinordertonormalizethestructuralcorrelationofattributesetswithdierentsup-ports,ourobjectiveistoprovideafunctionthatpresentsaslopethatissimilartoexpectedstructuralcorrelation.InSection4.1,wecomparetheexpectedstructuralcorrelationcomputedusingsimulationwiththeproposedupperbound.Wecallsimandlbthenormalizedstructuralcorrela-tionfunctionsthatapplytheexpectedstructuralcorrela-tionbasedonsimulationsim-expandthetheoreticalupperboundmax-exp,respectively.Sincemax-expsim-exp,lb=(S) maxexp(S) simexp=sim,thus,lbisalowerboundonsim.Itisimportanttonoticethatmax-expismonotonicallynon-decreasing,i.e.,max-exp(1)max-exp(2)ifandonlyif12.Itfollowsdirectlyfromthefactthattheanalyticalupperbound(Equation5)isbasedonacumula-tivebinomialfunction,whichisknowntobemonotonicallynon-decreasingw.r.t..Wealsoassumethatsim-expismonotonicallynon-decreasingforsucientlyhighvaluesofr,sinceanincreaseinthesizeoftherandomgraphsselectedfromGisnotexpectedtodecreasetheprobabilityofndingavertexinaquasi-clique.Suchpropertieswillbeexploitedbyourpruningtechniques,whichwillbeproposedfurtherinthispaper(seeSection3.2.1).Weapplythenormalizedstructuralcorrelationintheiden-ticationofstatisticallysignicantstructuralcorrelationval-ues.Therefore,weextendthestructuralcorrelationpat-ternminingproblem(Denition4)byaddingaminimumnormalizedstructuralcorrelationthresholdmin.Suchathresholdmayalsobeusefultoimprovetheperformanceofstructuralcorrelationpatternminingalgorithms,aswillbediscussedinSection3.2.Sinceausermaybeinterestedinpatternsthathavehighstructuralcorrelation()aswellasbeingstatisticallysignicant(),wepresentresultsusingbothregularandnormalizedstructuralcorrelation.2.2RelatedWorkFindingcommunities[6,3]anddensesubgraphs[5,10,8,20]hasbeenanactiveresearchtopic.Acommunityisusuallydenedassetofverticessignicantlymorecon-nectedamongthemselvesthanwithverticesoutsideit[3].Ontheotherhand,densesubgraphs,suchascliques[18],arestronglybasedoninternalcohesionandmaximality.Thisworkappliesadensesubgraphdenitioncalledquasi-clique,whichisasetofverticeswhereeachvertexiscon-nectedatleasttoafractionoftheothers.[14]introducestheproblemofminingcross-graphquasi-cliques.Theyfurtherstudiedtheproblemofminingfrequentcross-graphquasi-cliques[8].In[20]and[21]theauthorsstudytheproblemofminingfrequentcoherentclosedquasi-cliques.[10]stud-iestheproblemofndingquasi-cliquesfromasinglegraph,proposingpruningtechniquesforquasi-cliquemining.Graphclusteringanddensesubgraphdiscoverymethodsthatconsidervertexattributesascomplementaryinforma-tionhaveattractedtheinterestoftheresearchcommunityintherecentyears[12,4,22,13].Ageneralassumptionofthesemethodsisthatclustersbasedonboththetopologyofthegraphandtheattributesofverticesaremoremeaningfulthanthosebasedonlyonthetopologyortheattributes.[4]proposestwoecientalgorithmsfortheconnectedk-centerproblem,whichhasasobjectivetopartitionagraphconsid-eringboththeattributesandthetopology.[22]proposesarandomwalk-baseddistancemetricinanaugmentedgraphwhereverticesfromtheoriginalgraphareconnectedtonewverticesthatrepresentvertexattributes.In[12],theauthorsintroducetheproblemofminingcohesivepatterns,whicharedenseconnectedsubgraphswhereverticeshavehomo-geneousattributes(orfeatures).[13]considerstheproblemofcomputingmaximalhomogeneouscliquesinattributedgraphs.Dierentfromthesemethods,structuralcorrela-tionpatternminingdoesnotassumethatvertexattributesarecomplementaryinformation.Infact,weareinterestedinndingattributesetsthatexplaintheformationofdensesubgraphsthroughcorrelation.Assessinghowvertexattributesarerelatedtothegraph topologyhasledtothedenitionofnewpatterns.[15]proposedtheproblemofndingitemset-sharingsubgraphs,whichconsistsofextractingsubgraphswithcommonitem-sets.Itisimportanttonoticethatsuchmethoddonotcon-siderthedensityofsubgraphs.[9]denestheproximitypat-ternmining,whichevaluateshowclosevertexattributesareinthegraph.Aproximitypatternisasetoflabelsthatco-occurinneighborhoods.Therefore,proximitypatternsarenotnecessarilydensesubgraphsorcohesive,dierentlyfromstructuralcorrelationpatterns.In[7],theauthorsproposeadierentdenitionforthestructuralcorrelation,whichcom-parestheclosenessamongverticesinducedbyagivensingleattributeagainstasubgraphwhereattributesarerandomlydistributed.Ourworkdiersfrom[7]bycombiningmultipleattributesandconsideringaparticulartopologicalpropertywhichistheorganizationintodensesubgraphs.Moreover,besidestheevaluationofstructuralcorrelationofattributesets,weareinterestedinthediscoveryofrelevantdensesub-graphstoberepresentativesofthestructuralcorrelation.In[16],weintroducethestructuralcorrelationpatternminingandpresentanalgorithmforthisproblemcalledSCORP.Inthispaper,westudytheproblemofidentifyingstatisticallysignicantstructuralcorrelationpatternsbasedonanormalizationofthestructuralcorrelation.WealsopresenttheSCPMalgorithm,whichextendsSCORPwithnewpruningandsearchstrategiesforstructuralcorrelationpatternmining.DierentfromSCORP,SCPMenumeratesthetopstructuralcorrelationpatternsintermsofsizeanddensityeciently,insteadofthecompletesetofpatterns.3.ALGORITHMS3.1NaiveAlgorithmSincestructuralcorrelationpatternminingcombinesas-pectsofthefrequentitemsetminingandthequasi-cliqueminingproblems,wemaycombineafrequentitemsetmin-ingalgorithmandaquasi-cliqueminingalgorithmintoanaivealgorithmforstructuralcorrelationpatternmining.Thenaivealgorithmsolvesthestructuralcorrelationpat-ternminingproblem(seeDenition4)byrstenumeratingthesetoffrequentattributesetsFfromGandtheniden-tifyingthesetofquasi-cliquesQfromthegraphinducedbyeachfrequentattributesetSfromF.ThestructuralcorrelationofeachfrequentattributesetSiscomputedbycheckingwhethereachvertexv2V(S)ispartofaquasi-cliqueinQ.Frequentattributesetscanbeidentiedusingafrequentitemsetminingalgorithm[1,19].Inthiswork,weapplytheEclatalgorithm[19].Moreover,anyalgorithmforquasi-cliqueminingcanbeappliedbysuchnaivealgorithm.WeapplytheQuickalgorithm[10].Themaindrawbackofthenaivealgorithmisthatitenu-meratesthecompletesetoffrequentattributesetsfromGandthecompletesetofquasi-cliquesfromeachinducedgraphG(S),whereSisafrequentattributeset.Sincethefrequentitemsetminingandthequasi-cliqueminingprob-lemsareknowntobe#P-hard,thenaivealgorithmisex-pectedtonotbeabletoprocesslargeattributedgraphs.Inordertoachievesuchgoal,intheupcomingsections,wedescribeseveralstrategiesforecientstructuralcorrelationpatternmining.Wecombinesuchstrategiesintoanewal-gorithm,whichisdescribedinSection3.2.Furtherinthispaper,wecomparetheperformanceoftheproposedalgo-rithmagainstthisnaivemethod.3.2SCPMAlgorithmThissectionpresentstheSCPM(StructuralCorrelationPatternMining)algorithm,whichappliesseveralstrategiesinordertoenablethestructuralcorrelationpatternmin-inginlargeattributedgraphs.Unlikethenaivealgorithm,SCPMdoesnotenumerateeveryfrequentattributesetbutprunesthoseattributesetsthatcannotsatisfyaminimumstructuralcorrelationthreshold.Moreover,insteadofidenti-fyingeachquasi-cliquefromaninducedgraph,SCPMcheckswhetherverticesareinquasi-cliquesbyverifyingareducednumberofquasi-cliquecandidates.Finally,SCPMreturnsthesetofthetop-kmostrelevantstructuralcorrelationpat-ternsfromtheattributedgraph.3.2.1PruningStrategiesforSCPMiningThissectionpresentspruningtechniquesforstructuralcorrelationpatternmining.Theobjectiveofthesepruningtechniquesistoreducetheexecutiontimeofthestructuralcorrelationpatternminingalgorithmswithoutcompromis-ingitscorrectness.Theorem3allowsthepruningofverticesduringthelevel-wiseenumerationofattributesets.THEOREM3.(Vertexpruningforattributesets).LetKSbethesetofverticesindensesubgraphsinthegraphinducedbyanattributesetS.IfSiSj,thenKSjKSi.Proofsketch.Letssupposethatthereexistsavertexvsuchthatv2KSjandv=2KSi.Sincev2KSj,thereexistsadensesubgraphVV(Sj),suchthatv2V.Moreover,ifv=2KSi,theredoesnotexistanydensesubgraphUV(Si)suchthatv2U.Nevertheless,ifSiSj,thenV(Sj)V(Si),whichimpliesthatVV(Si)(contradiction).BasedonTheorem3,wecanpruneverticesthatarenotindensesubgraphsinthegraphinducedbyagivenattributesetbeforeextendingittogeneratelargerattributesets.At-tributesetscanalsobeprunedbasedonanupperboundonthestructuralcorrelationfunction,asstatedbyTheorem4.THEOREM4.(Attributesetpruningbasedontheupperboundonthestructuralcorrelation).Fortwoat-tributesetsSiandSj,ifSiSjand(Sj)min,then(Sj)(Si):jV(Si)j=minProofsketch.AccordingtoTheorem3,(Si):jV(Si)j(Sj):jV(Sj)j,sinceeveryvertexcoveredbyadensesub-graphinV(Sj)isalsocoveredbyadensesubgraphinV(Si).Moreover,since(Sj)min,(Sj)isupperboundedby(Si):jV(Si)j=minbasedonthedenitionofthestructuralcorrelationfunction(seeDenition2).GivenanattributesetSi,ofsizei,if(Si):jV(Si)j=minmin,thenSiisnotincludedinthesetofattributesetstobecombinedforthegenerationofsizei+1attributesets.Theorem4guaranteesthattheredoesnotexistanattributesetSj,suchthatSiSjand(Sj)min.Asimilarpruningrulecanbeformulatedbasedonthenormalizedstructuralcorrelationfunctiondenition.THEOREM5.(Attributesetpruningbasedontheupperboundonthenormalizedstructuralcorrelation).FortwoattributesetsSiandSj,ifSiSj,expisamono-tonicallynon-decreasing,and(Sj)min,then(Sj)(Si):jV(Si)j=(exp(min):min)Proofsketch.AccordingtoTheorem4,(Sj)(Si):jV(Si)j=min.Since(Sj)minandexpis Figure2:Setenumerationtree Algorithm1GeneralStructuralCorrelationAlgorithm Require:G(S), min,min sizeEnsure:Q1:Q ;2:X ;3:candExts(X) V(S)4:ApplyvertexpruningincandExts(X)5:qcCands f(X;candExts(X))g6:whileqcCands6=;do7:q qcCands:get()8:Applycandidatequasi-cliquepruninginq9:ifq:X[q:candExts(X)isaquasi-cliquethen10:Q Q[fq:X[q:candExts(X)g11:else12:ifq:Xisaquasi-cliquethen13:Q Q[fq:Xg14:endif15:insertextensionsofqintoqcCands16:endif17:endwhile monotonicallynon-decreasing,thenexp((Sj))exp(min).Therefore,(Sj)(Si):jV(Si)j=(exp(min):min).If(Si):jV(Si)j=(exp(min):min)min,theattributesetSi,ofsizei,isnotincludedinthesetofattributesetstobecombinedforthegenerationofsizei+1attributesets.Sincelbgivesalowerboundonthenormalizedstructuralcorrelation,thewholepruningpotentialofTheorem5maynotbeexplored.Nevertheless,theresultsshowthatuseoflbenablessignicantperformancegains(seeSection4.2).ThepruningstrategystatedbyTheorem3reducesthenumberofverticestobecheckedtobeinquasi-cliquesinthecomputationofstructuralcorrelation.Theorems4and5enablethereductionoftheattributesetsforwhichthestructuralcorrelationiscomputedtoasetthatisexpectedtobesmallerthanthesetoffrequentattributesets.3.2.2ComputingtheStructuralCorrelationAsdiscussedinSection3.1,thenaivealgorithmcomputesthestructuralcorrelationofanattributesetSthroughtheenumerationofthequasi-cliquesfromG(S).Inthissection,wedescribehowthestructuralcorrelationcanbecomputedbyidentifyingareducednumberofquasi-cliquecandidates.Quasi-cliquescanbeenumeratedbasedonavertexsetX,initiallysetas;,andasetofcandidateextensionsofX,candExts(X),initiallysetasV.VerticesaremovedfromcandExts(X)toX,oneatatime,untilthecompletesetofquasi-cliquecandidatesaregenerated.Figure2showsasetenumerationtreethatrepresentsthesearchspaceofquasi-cliquesconsideringasetof4vertices(1-4).Inordertoprunedownsuchsearchspace,quasi-cliqueminingalgo-rithmsapplyseveralpruningtechniques.Wedividethesetechniquesintotwogroups:1.Vertexpruning:Removalofverticesthatcannotbepartofanyquasi-cliqueinGaccordingtothequasi-cliquedenitionandthequasi-cliqueparameters.Ver-texpruningisperformediterativelyoverthegraphinordertominimizethesearchspaceofquasi-cliques.2.Candidatequasi-cliquepruning:Removalofcan-didatequasi-cliques(i.e.,pairs(X,candExts(X)))fromthesearchspaceofquasi-cliques.SuchremovalisbasedonthepropertiesofthesubgraphcomposedbyverticesfromXandcandExts(X).Algorithm1givesageneraldescriptionofhowquasi-cliquesareidentiedinthecomputationofstructuralcorre-lation.Thisalgorithmisalsousedasthebasisfortheenu-merationofthetop-kstructuralcorrelationpatterns.ThealgorithmreceivesaninducedgraphG(S),andthemini-mumdensity( min)andsize(min size)forquasi-cliques.Itgivesasoutputasetofquasi-cliquesQfromG.Vertexandquasi-cliquecandidatepruningsareappliedinlines4and8,respectively.Candidatequasi-cliquesaremanagedbythedatastructureqcCands,whichwillbediscussedlater.Eachcandidatepatternischeckedtobealookaheadquasi-clique(i.e.,q:X[q:candExts(X)isaquasi-clique)rst,duetothefactthatquasi-cliquesaremaximal.Incasesuchacondi-tiondoesnothold,q:Xischeckedtobeaquasi-cliqueandtheextensionsofqareinsertedintoqcCands(line15).ThealgorithmnisheswhenqcCandsbecomesempty.ThesetKS,whichiscomposedofverticescoveredbyquasi-cliquesinG(S),canbeobtaineddirectlyfromQ.Sincethequasi-cliqueminingproblemisknowntobe#P-hard,theidenticationofquasi-cliquesmayrequireprocess-ingalargenumberofquasi-cliquecandidates,whichwouldconstituteanimportantlimitationtothecomputationofthestructuralcorrelationoflargeinducedgraphs.Neverthe-less,computingthestructuralcorrelationdoesnotrequiretheenumerationofthecompletesetofquasi-cliques.Thenecessaryinformationiswhethereachvertexfromthein-ducedgraphiscoveredbyaquasi-cliqueornot.Therefore,candidatequasi-cliquescomposedofverticesalreadyknowntobecoveredbyquasi-cliquescanbeprunedfromthenewquasi-cliquecandidatesgeneratedinline15ofAlgorithm1.Besidespruningcandidatequasi-cliquesthatarealreadyknowntobecoveredbydensesubgraphs,wealsoproposesearchstrategiesforcomputingthestructuralcorrelation.Thesesearchstrategiesdeterminetheorderinwhichcan-didatequasi-cliquesareenumerated.Abreadth-rstsearch(BFS)strategyforcomputingthestructuralcorrelationtra-versesthesearchspaceofquasi-cliquesinabreadth-rstorder,startingfromtherootandvisitingthesmallervertexsetsbeforethelargerones.Ontheotherhand,adepth-rstsearch(DFS)strategyextendsvertexsetsasmuchaspossible.TheBFSstrategyisexpectedtoperformbetterincasecoveringverticeswithsmallerquasi-cliquesismoreecientthanwithlargerquasi-cliques.Consideringasetof4vertices,forwhichthesearchspaceofquasi-cliquesisshowninFigure2,theBFSandtheDFSstrategyvisitthequasi-cliquecandidatesasfollows: Algorithm2SCPMAlgorithm Require:G,min, min,min size,min,min,kEnsure:P1:P ;2:T ;3:I frequentattributesfromG4:forallS2Ido5: structuralcorrelationofS6:ifminAND=exp(S)minthen7:Q top-kpatternsfromG(S)8:forallq2Qdo9:P P[(S;q)10:endfor11:endif12:if:(S)min:minAND:(S)min:exp(min):minthen13:T T[S14:endif15:endfor16:P P[enumerate-patterns(T;G;min; min;min size;min;min;k) Algorithm3enumerate-patterns Require:T;G;min; min;min size;min;min;kEnsure:P1:P ;2:forallSi2Tdo3:R ;4:forallSj2Tdo5:ifijthen6:S Si[Sj7:if(S)minthen8: structuralcorrelationofS9:ifminAND=exp(S)minthen10:Q top-kpatternsfromG(S)11:forallq2Qdo12:P P[(S;q)13:endfor14:endif15:if:(S)min:minAND:(S)min:exp(min):minthen16:R R[S17:endif18:endif19:endif20:endfor21:P P[enumerate-patterns(R;G;min; min;min size;min;min;k)22:endfor BFS:f1g,f2g,f3g,f4g,f1;2g,:::f1;2;3;4g.DFS:f1g,f1;2g,f1;2;3g,f1;2;3;4g,f1;3g,:::f4g.Quasi-cliquescanbeenumeratedinBFSorderbyusingaqueueasadatastructuretomanagequasi-cliquecandidatesinAlgorithm1.Similarly,aDFSstrategyforenumeratingquasi-cliquescanapplyastackinordertomanipulatecandi-datepatterns.Furtherinthispaper,weevaluatethesearchstrategiespresentedinthissection.3.2.3EnumeratingTopkPatternsAsdiscussedinSection2.1.2,enumeratingstructuralcor-relationpatternsisacomputationallyexpensivetask.Inthissection,westudyhowtoreducethecostofenumerat-ingstructuralcorrelationpatternsbyrestrictingtheoutputsettoonlythetop-kmostrelevantpatternsintermsofsize(primarycriteria)anddensity(secondarycriteria).Theenumerationofthetop-kstructuralcorrelationpat-ternsfollowsthesameproceduredescribedinAlgorithm1.WeuseaDFSstrategyinthediscoveryofthetop-kpat-ternsbecausestructuralcorrelationpatternsaremaximal(seeDenition3).However,sincethenumberofpatternstobediscoveredisknown,acurrentsetofpatternscanbeap-pliedtoprunethesearchspaceofnewcandidates.Newcan-didatequasi-cliquesaregeneratedinline15.Incasethecur-rentsetoftoppatternscontainskpatternsandacandidatepatternpcannotproduceapatternlargerthanthesmallestcurrenttop-kpatternt(i.e.,jp:X[p:candExts(X)jtj),pcanbepruned.Byupdatingthesetoftop-kpatterns,theminimumsizethresholdisincreasediteratively.Asaconse-quence,thetop-kpatternsareenumeratedmoreecientlythanthecompletesetofpatternsfromaninducedgraph.Algorithm2isahigh-leveldescriptionoftheSCPMal-gorithm,whichappliesthestrategiesforecientstructuralcorrelationpatternminingpresentedinthissection.TheinitialsetofattributesIiscomposedbythosewithasup-portofatleastmin(line3).ThestructuralcorrelationofeachsizeoneattributesetS2IiscomputedasdescribedinSection3.2.2.IncasethestructuralcorrelationofSsat-isesminimumstructuralcorrelation(min)andnormalizedstructuralcorrelation(min)thresholds,thetop-kpatternsinducedbySareidentiedusingthealgorithmdescribedinthissection(line7).ThesepatternsareincludedintoasetofpatternsPthatwillbegivenasoutput.Thepruningrulesforattributesetsbasedonand(seeSection3.2.1)areappliedinline12.PrunedattributesarenotincludedintothesetofattributesTtobeextended.Theseattributesareextendedbythefunctionenumerate-patterns(line16).Algorithm3describesthefunctionenumerate-patterns.ItreceivesthesameinputparametersofSCPM,andalsothesetofpatternstobeextendedT.Itreturnsthesetoftop-kpatterns(S;V)thathaveattributesetsextendedfromthoseinTregardingtheinputparameters.Newat-tributesetsareextendedthroughtheunionofexistingones(line6).AttributesetsaretraversedinaDFSorder(e.g.,fAg;fA;Bg;fA;B;Cg:::fEg).Theenumerate-patternsfunctionissimilartoAlgorithm2,exceptthateachnewattributesetSischeckedtosatisfytheminimumsupportthresholdmin(line7).Allvalidattributesetsaregener-atedthroughrecursivecallstoenumerate-patterns(line21).4.EXPERIMENTALRESULTSThissectionpresentscasestudiesonthestructuralcor-relationpatternminingusingrealdatasets.Moreover,weevaluatetheperformanceandstudythesensitivityofimpor-tantinputparametersofSCPM.Experimentswereexecutedona16-coreIntelXeon2.4Ghzwith50GBofRAM.Theimplementationsareavailableasopen-source1.4.1CaseStudies4.1.1DBLPIntheattributedgraphextractedfromtheDBLP2digitallibrary,eachvertexrepresentsanauthorandtwoauthorsareconnectediftheyhaveco-authoredapaper.Theattributesofauthorsaretermsthatappearinthetitlesofpapersau-thoredbythem3.IntheDBLPdatasetanattributesetde-nesatopic(i.e.,setoftermsthatcarryaspecicmeaningintheliterature)andadensesubgraphisacommunity.TheDBLPdatasethas108,030vertices,276,658edgesand23,285attributes.Table2showsthetop10attribute 1http://code.google.com/p/scpm/2http://www.informatik.uni-trier.de/~ley/db3Stemmingandremovalofstopwordswereapplied. (a)Graphinducedbyfsearch;rankg (b)Patterninducedbyfperform;systemgFigure3:ExamplesofresultsfromtheDBLPdataset lb Slb Slb Slb basesystem5492.0414.0 gridapplic840.2641577 searchrank420.19635349 baseus5421.0413.5 gridservic599.23154703 performle404.14555067 basemodel4852.0313.3 environgrid525.21256793 structurindex404.14555067 modelus4168.0321.0 querixml615.21123533 searchmine413.14490932 systemus3989.0536.8 searchweb1031.2013738 usxml400.11442638 basenetwork3774.0541.8 searchrank420.19635349 searchwebdata424.14431589 modelsystem3460.0221.7 dynamsimul469.19383169 basesearchanalysi414.12416385 basedata3452.0771.6 queridata1540.192758 modelinternet401.10406059 baseimag3424.0217.6 chipsystem702.1963351 processdatadatabas416.12405363 imagus3345.0219.6 datastream1073.1810653 performdistributparallel416.11388818 Table2:DBLP-Topsupport(),str.correlation(),andnormalizedstr.correlation(lb)attributesets. Figure4:DBLP-Expectedcomputedbythesim-ulation(sim-exp)andanalytical(max-exp)models.setsw.r.tsupport(),structuralcorrelation(),andnor-malizedstructuralcorrelation(lb).Theminimumsize(min size)anddensity( min)parametersweresetto10and0.5,respectively.Theminimumsupportthreshold(min)wassetto400andweconsideredonlyattributesetsofsizeatleast2.Theparametersusedinourcasestudieswereselectedempirically.Top-attributesetspresentalowcorrelationwiththeformationofdensesubgraphsintheDBLPdataset.Suchtermsarepopularinpapertitles,butdonotcarrymuchknowledgeregardingtheformationofresearchcommuni-ties.Ontheotherhand,top-structuralcorrelationmaybemoreeasilyassociatedtoknowntopicsincomputersci-ence.Theattributesetfgrid;applicghasthehigheststruc-turalcorrelation(0.26),i.e.,26%oftheauthorsthathavethekeywords\grid"and\applic"areinsideacommunityofresearchersofsizeatleast10whereeachofthemhavecol-laboratedwithhalfoftheothermembers.Itisinterestingtopointoutthatthegraphinducedbyfgrid;applicghasmoreverticesindensesubgraphsthanthegraphinducedbyfbase;systemg,thoughfbase;systemgismorethan6timesmorefrequentthanfgrid;applicg.Ingeneral,highsupportattributesetsdonotpresenthighstructuralcorrelation.Figure4,showstheexpectedstructuralcorrelationfordierentsupportvaluesintheDBLPdataset.Theinputparametersarethesameasthoseusedtogeneratethere-sultsshowninTable2.Forthesimulationmodel,weex-ecuted1000simulationsforeachsupportvalueandshowalsothestandarddeviationoftheexpectedstructuralcor-relationestimated.Theanalyticalupperboundisnottightw.r.t.thesimulationresults,butpresentsasimilargrowth,whichshowsthatitenablesaccuratecomparisonsbetweenthestructuralcorrelationofattributesets.Basedontheproposedanalyticalmodel,thethirdcolumnofTable2showsthetopattributesetsintermsofanalyti-calnormalizedstructuralcorrelation(lb).Theattributeset (a)GraphinducedbyfSStevens,Wilcog (b)PatterninducedbyfVanMorrisongFigure5:ExamplesofresultsfromtheLastFmdataset lb Slb Slb Slb Radiohead121892.11.37 Radiohead121892.11.37 SStevens,Wilco28798.041.14 Coldplay118053.09.33 Coldplay118053.09.33 SStevens,OfMontreal28621.041.13 Beatles109037.09.36 Beatles109037.09.36 Beirut27605.041.11 RPeppers105984.09.35 RPeppers105984.09.35 SStevens,Decemberists,Beatles27415.041.11 Nirvana100604.07.31 Metallica83587.08.41 NHotel,SStevens29260.041.10 TKillers96305.07.32 DCforCutie82025.07.41 SStevens,FLips,Beatles27571.041.09 Muse94382.07.33 Beck83360.07.40 ACollective33555.051.09 Oasis87875.06.30 Muse94382.07.33 BSScene,NMHotel27308.041.09 FFighters87001.06.33 Nirvana100604.07.31 Radiohead,Spoon,SStevens27113.041.06 PFloyd86807.07.34 TheShins68480.07.50 NHotel,Radiohead,Beatles28776.041.04 Table3:LastFm-Topsupport(),str.correlation(),andnormalizedstr.correlation(lb)attributesets.fsearch;rankghasthehighestnormalizedstructuralcorre-lation(635,349),i.e.,thestructuralcorrelationis635,349timestheupperboundonitsexpectedstructuralcorrela-tiongivenbytheanalyticalmodel.Figure3(a)presentsthegraphinducedbyfsearch;rankg.Verticescontainedinadensesubgraphareindicated.Densesubgraphscoverthedensestcomponentsoftheinducedgraph.Ingeneral,top-attributesetshavelowlbwhencomparedtothetop-lbattributesets.Moreover,highvaluesofdonotnecessar-ilyleadtohighvaluesoflb.Figure3(b)showsthelargeststructuralcorrelationpatternintermsofnumberofverticesfromDBLP,whichrepresentstwoimportantinterconnectedresearchgroupsonhighperformancesystems.4.1.2LastFmLastFm4isanonlinesocialmusicnetwork.Weuseasam-pleoftheLastFmuserscrawledthroughanAPIprovidedbyLastFm.IntheLastFmnetwork,verticesrepresentusersandedgesrepresentfriendships.Theattributesofavertexaretheartiststherespectiveuserhaslistenedto.Anat-tributesetintheLastFmdatasetrepresents,inamoregen-eralinterpretation,amusicaltaste(i.e.,setofartists)andadensesubgraphisacommunity.TheLastFmdatasetcontains272,412vertices,350,239edges,and3,929,101attributes.Table3showsthetop10attributesetsintermsofsupport(),structuralcorrelation()andnormalizedstructuralcorrelation(lb)discoveredfromLastFm.Theminimumsize(min size)anddensity 4http://www.last.fm Figure7:LastFm-Expectedcomputedbythesim-ulation(sim-exp)andanalytical(max-exp)models.( min)parametersweresetto5and0.5,respectively.Theminimumsupportthreshold(min)wassetto27,000.Ingeneral,thetop-attributesetsarethemostfrequentones.However,suchattributesetspresentlownormalizedstructuralcorrelation.Inotherwords,althoughtheseat-tributesarefrequentandhaveseveralverticescoveredbycommunities,thiscoverageisnotmuchhigherthanex-pected.Consideringthenormalizedstructuralcorrelation,whichtakesintoaccounttheexpectedstructuralcorrela-tionofanattributeset,thetoppatternschangesigni-cantly.Figure7showstheexpectedstructuralcorrelationforsupportvaluesvaryingfrom20,000to100,000.Eachsimulation-basedexpectedstructuralcorrelationvaluecor-respondstoanaverageof100simulations.Thetoplbat-tributesetfSStevens,WilcogincludestheAmericansingerandsongwriterSufjanStevensandtheAmericanbandWilco. (a)Graphinducedbyfnode,wirelessg (b)Patterninducedbyfperform,systemgFigure6:ExamplesofresultsfromtheCiteSeerdataset lb Slb Slb Slb systempaper57906.16.77 networksensor3276.47108.7 nodewireless2086.35164.4 basepaper56566.10.47 networkhoc2744.47141.2 protocolrout2134.35157.6 paperresult47516.08.45 adnetworkhoc2725.44134.6 memoricach2150.32143.8 papermodel43929.09.59 networkrout5084.4148.0 networkhoc2744.47141.2 uspaper43573.05.32 networkwireless5242.4044.7 protocolwireless2048.29138.7 systembase42079.09.63 nodewireless2086.35164.4 adnetworkhoc2725.44134.7 approachpaper38690.05.40 protocolrout2134.35157.6 networknoderout2075.25118.3 performpaper37349.131.04 adnetwork3563.3469.3 optimqueri2094.26118.2 paperpropos37243.06.46 programlogic5895.3331.2 performinstruct2111.25115.95 paperalgorithm37027.12.95 memoricach2150.32143.8 paperadnetwork2081.23108.86 Table4:CiteSeer-Topsupport(),str.correlation(),andnormalizedstr.correlation(lb)attributesets.Figure5(a)showsthegraphinducedbytheattributesetfSStevens,Wilcog.Forclarity,weremovedverticeswithdegreelowerthan2.Byvisualizingverticesinsideandout-sidestructuralcorrelationpatterns,wecanunderstandhowthestructuralcorrelationcapturestherelationshipbetweenattributesanddensesubgraphs.Thelargeststructuralcor-relationpatternfoundispresentedinFigure5(b).Itrep-resentsacommunityof34userswhohavelistenedtotheNorthernIrishsingerandsongwriterVanMorrison.Vertexidentiersarenotshownduetoprivacyissues.4.1.3CiteSeerCiteSeerX5isascienticliteraturedigitallibraryandsearchengine.WebuiltacitationgraphfromCiteSeerXasofMarchof2010.IntheCiteSeergraph,papersarerep-resentedbyverticesandcitationsbyundirectededges.Eachpaperhasasattributestermsextractedfromitsabstract6.AttributesetsrepresenttopicsanddensesubgraphsdenegroupsofrelatedworkintheCiteSeergraph.TheCiteSeerdatasethas294,104vertices,782,147edges,and206,430attributes.Theparameterssettingappliedinthiscasestudyismin=2000,min size=5,and min=0:5.Table4showsthetopstructuralcorrelationattribute 5http://citeseerx.ist.psu.edu6Stemmingandstopwordsremovalwereapplied. Figure9:CiteSeer-Expectedforthesimulation(sim-exp)andanalytical(max-exp)models.setsw.r.t.,,andlbdiscovered.Top-attributesetspresentlowstructuralcorrelationandnormalizedstructuralcorrelationwhencomparedtothetop-andtop-lbattributesets,respectively.Moreover,similartotheDBLPdataset,whilethetop-attributesetsfromtheCiteSeerdatasetaregenericterms,thetop-andtop-lbattributesetsmaybeeasilyassociatedtoknownresearchtopics(e.g,computernetworks,queryoptimization).Figure9showstheexpectedstructuralcorrelationfordif-ferentsupportvaluesinCiteSeer.Theattributesetfnode;wirelessghasthehighestnormalizedstructuralcorrelation(lb=164.40).Figure6(a)showsthegraphinducedby (a)Runtimex min (b)Runtimexmin size (c)Runtimexmin (d)Runtimexmin (e)Runtimexmin (f)RuntimexkFigure8:Performanceevaluationtheattributesetfnode;wirelessginCiteSeer.Figure6(b)presentsthelargeststructuralcorrelationpatterndiscoveredintheCiteSeerdataset.Vertexlabelsaretheinitialsofpa-pertitles.Thepapersincludedinthepatterncovertopicssuchascaching,memorymanagement,computernetworks,processordesign,andinstructionleveloptimization(e.g.,AttributeCaches,SystemsforLateCodeModication,Lim-itsofInstructionLevelParallelism,Link-timeOptimizationofAddressCalculationona64-bitArchitecture).Wedonotshowthefulllistofpapertitlesduetospacelimitations.4.2PerformanceEvaluationThissectionevaluatestheperformanceofthestructuralcorrelationpatternminingalgorithms.ThedatasetusedisasmallerversionoftheDBLPdataset(SmallDBLP),whichhas32,908vertices,82,376edges,and11,192attributes.TheSCPM-BFSandSCPM-DFSareversionsoftheSCPMalgorithmusingtheBFSandDFSstrategy,respec-tively.TheNaivealgorithmenumeratesthecompletesetofquasi-cliquesfromtheinducedgraphs,asdescribedinSec-tion3.1.Wevaryeachparameterofthealgorithmskeepingtheothersconstant.Defaultvaluesfor min,min size,andminare0.5,11,and100.Moreovermin,min,andkaresetto0.1,1,and5,respectively,unlessstatedotherwise.Figures8(a),8(b),and8(c)showtheruntimeoftheal-gorithmsvaryingthevaluesof min,min size,andmin,respectively.Ingeneral,SCPM-DFSachievesthebestre-sults,beingupto3ordersofmagnitudefasterthantheNaivealgorithm.Moreover,SCPM-BFSperformsbetterthantheNaivealgorithminalltheexperiments.Intermsofthemin(Figure8(d))andmin(Figure8(e))parameters,boththeSCPM-BFSandSCPM-DFSapplythepruningtechniquesdescribedinSection3.2.1.BasedontheresultsshowninFigures8(d)and8(e),wecannoticethatsuchtechniquesleadtosignicantperformancegainswhenthevaluesofminandminareincreased.InFigure8(f),weshowtheruntimeofSCPM-DFSandtheNaivealgorithmfordierentvaluesofk.TheresultsofSCPM-BFSareomittedbecausebothSCPM-BFSandSCPM-DFSalgorithmsapplythesamestrategyforiden-tifyingthetop-kstructuralcorrelationpatterns(seeSection3.2.3).TheinsetalsoshowstheexecutiontimeofSCPM-DFSusingalinearscaleforthey-axis,tomoreclearlyseetheeectofkontheruntime.Theresultsshowthatforlowvaluesoftopk,SCPM-DFSisabletoachievelowrunningtimes,outperformingtheNaivealgorithmsignicantly.4.3ParameterSensitivityandSettingWenowassesshowdierentinputparametersaecttheoutputofstructuralcorrelationpatternmining.Ourobjec-tiveistoprovideguidelinesforsettingtheparametersofSCPM.Figure10showstheaveragestructuralcorrelationandnormalizedstructuralcorrelationofthecompleteoutput(global)andthetop-10%attributesetsfromtheSmallDBLPdatasetvaryingthe min,min size,andminparameters.Defaultvaluesfor min,min size,andminare0.5,10and100.Theresultsshowthatmorerestrictivequasi-cliquepa-rameters(i.e.,highvaluesof minandmin size)reducetheaveragebutmayincrease,sincedensesubgraphsbecomelessexpected.Moreover,highvaluesofminarerelatedtohighvaluesofstructuralcorrelation.However,suchat-tributesetsalsopresenthighvaluesofexp,leadingtolowvaluesofnormalizedstructuralcorrelation.SCPMisanexploratorypatternminingmethod,andthusreasonablevaluesforthedierentparameterscanbeob-tainedbysearchingtheparameterspace.Theminimumdensityparameter, min,andtheminimumquasi-cliquesize,min size,willdependontheapplication.Formin,ause-fulguidelineistoselectvaluesthatproduceasignicantexpectedstructuralcorrelation.Infrequentattributesetsmaynotbeexpectedtoinduceanydensesubgraph.Theotherparameters(min,min,andk)haveasobjectivestospeedupthealgorithmandmustbesetaccordingtotheavailablecomputationalresourcesandtime.5.CONCLUSIONSInthispaper,westudiedtheproblemofcorrelatingvertexattributesanddensesubgraphsinlargeattributedgraphs.Theconceptofstructuralcorrelation,whichmeasureshowanattributesetinducesdensesubgraphsinanattributedgraphwasproposed.Wealsopresentednormalizationap-proachesthatcomparethestructuralcorrelationofagivenattributesetagainstitsexpectedvalue,whichprovidesa (a)x min (b)xmin size (c)xmin (d)x min (e)xmin size (f)xminFigure10:Parametersensitivitymeasureofthestatisticalsignicanceforthestructuralcor-relation.Inordertoenabletheanalysisoflargedatabases,weintroducedsearchandpruningstrategiesforstructuralcorrelationpatternmining.Wealsoproposedanalgorithmfortheidenticationofthetopstructuralcorrelationpat-terns,whicharethelargestanddensestsubgraphsinducedbyagivensetofattributes.Thepatternsandalgorithmsproposedwereappliedtothreerealdatasets.Theattributesetsandpatternsfoundrepresentrelevantknowledgeintermsofthecorrelationbetweenattributesanddensesubgraphs.Acknowledgements:ThisworkwassupportedbyCNPQ,CAPES,Fapemig,FINEP,InWeb,NSFawardEMT-0829835,andNIHaward1R01EB0080161.WewouldliketothankAlbertoLaenderandLocCerffortheircomments.6.REFERENCES[1]R.Agrawal,T.Imielinski,andA.Swami.Miningassociationrulesbetweensetsofitemsinlargedatabases.InSIGMOD,pages207{216,1993.[2]A.Anagnostopoulos,R.Kumar,andM.Mahdian.In uenceandcorrelationinsocialnetworks.InKDD,pages7{15,2008.[3]S.Fortunato.Communitydetectioningraphs.PhysicsReports,486(3-5):75{174,2010.[4]R.Ge,M.Ester,B.J.Gao,Z.Hu,B.Bhattacharya,andB.Ben-Moshe.Jointclusteranalysisofattributedataandrelationshipdata:Theconnectedk-centerproblem,algorithmsandapplications.ACMTrans.Knowl.Discov.Data,2(2):1{35,2008.[5]D.Gibson,R.Kumar,andA.Tomkins.Discoveringlargedensesubgraphsinmassivegraphs.InVLDB,pages721{732,2005.[6]M.GirvanandM.Newman.Communitystructureinsocialandbiologicalnetworks.InPNAS,pages7821{7826,2002.[7]Z.Guan,J.Wu,Q.Zhang,A.Singh,andX.Yan.Assessingandrankingstructuralcorrelationingraphs.InSIGMOD,pages937{948,2011.[8]D.JiangandJ.Pei.Miningfrequentcross-graphquasi-cliques.ACMTrans.Knowl.Discov.Data,2(4):1{42,2009.[9]A.Khan,X.Yan,andK.-L.Wu.Towardsproximitypatternmininginlargegraphs.InSIGMOD,pages867{878,2010.[10]G.LiuandL.Wong.Eectivepruningtechniquesforminingquasi-cliques.InPKDD,pages33{49,2008.[11]M.McPherson,L.Smith-Lovin,andJ.Cook.Birdsofafeather:Homophilyinsocialnetworks.AnnualReviewofSociology,27(1):415{444,2001.[12]F.Moser,R.Colak,A.Raey,andM.Ester.Miningcohesivepatternsfromgraphswithfeaturevectors.InSDM,pages593{604,2009.[13]P.-N.Mougel,M.Plantevit,C.Rigotti,O.Gandrillon,andJ.-F.Boulicaut.Constraint-BasedMiningofSetsofCliquesSharingVertexProperties.InACNE,pages48{62,2010.[14]J.Pei,D.Jiang,andA.Zhang.Onminingcross-graphquasi-cliques.InKDD,pages228{238,2005.[15]J.Sese,M.Seki,andM.Fukuzaki.Miningnetworkswithshareditems.InCIKM,pages1681{1684,2010.[16]A.Silva,W.Meira,Jr.,andM.J.Zaki.Structuralcorrelationpatternminingforlargegraphs.InMLG,pages119{126,2010.[17]L.Valiant.Thecomplexityofcomputingthepermanent.Theoreticalcomputerscience,8(2):189{201,1979.[18]J.Wang,Z.Zeng,andL.Zhou.Clan:Analgorithmforminingclosedcliquesfromlargedensegraphdatabases.InICDE,pages73{82,2006.[19]M.J.Zaki.Scalablealgorithmsforassociationmining.IEEETrans.onKnowl.andDataEng.,12:372{390,2000.[20]Z.Zeng,J.Wang,L.Zhou,andG.Karypis.Coherentclosedquasi-cliquediscoveryfromlargedensegraphdatabases.InKDD,pages797{802,2006.[21]Z.Zeng,J.Wang,L.Zhou,andG.Karypis.Out-of-corecoherentclosedquasi-cliqueminingfromlargedensegraphdatabases.ACMTrans.DatabaseSyst.,32,June2007.[22]Y.Zhou,H.Cheng,andJ.X.Yu.Graphclusteringbasedonstructural/attributesimilarities.PVLDB,2(1):718{729,2009.