/
Mining Attributestructure Correlated Patterns in Large Mining Attributestructure Correlated Patterns in Large

Mining Attributestructure Correlated Patterns in Large - PDF document

stefany-barnette
stefany-barnette . @stefany-barnette
Follow
395 views
Uploaded On 2015-05-02

Mining Attributestructure Correlated Patterns in Large - PPT Presentation

ufmgbr Wagner Meira Jr Universidade Federal de Minas Gerais Belo Horizonte Brasil meiradccufmgbr Mohammed J Zaki Rensselaer Polytechnic Institute Troy NY zakicsrpiedu ABSTRACT In this work we study the correlation between attribute sets and the occur ID: 58946

ufmgbr Wagner Meira Universidade

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Mining Attributestructure Correlated Pat..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

473 471 477 470 469 466 468 476 472 467 474 475 MiningAttribute­structureCorrelatedPatternsinLargeAttributedGraphsArleiSilvaUniversidadeFederaldeMinasGeraisBeloHorizonte,Brasilarlei@dcc.ufmg.brWagnerMeiraJr.UniversidadeFederaldeMinasGeraisBeloHorizonte,Brasilmeira@dcc.ufmg.brMohammedJ.ZakiRensselaerPolytechnicInstituteTroy,NYzaki@cs.rpi.eduABSTRACTInthiswork,westudythecorrelationbetweenattributesetsandtheoccurrenceofdensesubgraphsinlargeattributedgraphs,ataskwecallstructuralcorrelationpatternmin-ing.Astructuralcorrelationpatternisadensesubgraphinducedbyaparticularattributeset.Existingmethodsarenotabletoextractrelevantknowledgeregardinghowvertexattributesinteractwithdensesubgraphs.Structuralcorre-lationpatternminingcombinesaspectsoffrequentitemsetandquasi-cliqueminingproblems.Weproposestatisticalsigni cancemeasuresthatcomparethestructuralcorrela-tionofattributesetsagainsttheirexpectedvaluesusingnullmodels.Moreover,weevaluatetheinterestingnessofstruc-turalcorrelationpatternsintermsofsizeanddensity.Anecientalgorithmthatcombinessearchandpruningstrate-giesintheidenti cationofthemostrelevantstructuralcor-relationpatternsispresented.Weapplyourmethodfortheanalysisofthreereal-worldattributedgraphs:acollab-oration,amusic,andacitationnetwork,verifyingthatitprovidesvaluableknowledgeinafeasibletime.1.INTRODUCTIONInseveralreal-lifegraphs,attributescanbeassociatedwithverticesinordertorepresentvertexproperties.Inso-cialnetworks,forexample,vertexattributesareusefultomodelpersonalcharacteristics.Moreover,vertexattributescanbeassociatedwithcontent(e.g.,keywords,tags)inthewebgraph.Suchanextendedgraphrepresentation,whichiscalledanattributedgraph,maysupportgraphpatternsthatproviderelevantknowledgeinvariousapplicationscenarios.Aninterestingquestionrelatedtoattributedgraphsishowparticularattributesareassociatedwiththetopologyofrealgraphs.Inotherwords,dothereexistpatternsthatexplainhowvertexattributesinteractwiththegraphstruc-ture?Howcanweextractandevaluatesuchpatterns?Inthispaper,westudytheproblemofcorrelatingattributesetswithanimportanttopologicalpropertyofgraphs,whichistheorganizationofverticesintodensesubgraphs.Forin-stance,weaimtoaddressquestionssuchas:Howdoesaparticularsetofinterestsinducecommunitiesinasocialnetwork?Whatarethecommunitiesthatemergearoundsuchinterests?Suchquestionsarerelatedtoimportantso-cialphenomenasuchashomophily[11]andin uence[2].Althoughseveralde nitionsofdensesubgraphshavebeenproposedintheliterature,mostofthemdonottakevertexattributesintoconsideration.Furthermore,suchde nitionsdonotprovideanyknowledgeregardinghowdi erentsetsofattributesinducedensesubgraphs.Thisworkstudiesthecorrelationbetweenvertexattributesanddensesubgraphs,ataskwecallstructuralcorrelationpatternmining.Thestructuralcorrelationofanattributesetistheprobabilityofavertextobememberofadensesubgraphinitsinducedgraph.Moreover,astructuralcor-relationpatternisadensesubgraphinducedbyaparticularattributeset.Figure1illustratesadatasetforstructuralcorrelationpatternmining.ThevertexattributesaregiveninFigure1(a)andthegraphisshowninFigure1(b).Ex-ampledensesubgraphsareshowninFigures1(c)and1(d).ThestructuralcorrelationoftheattributeAis0.82,since9outof11verticesarecoveredbydensesubgraphsinitsinducedgraph.Ontheotherhand,thestructuralcorrela-tionofCis0,becausethereisnodensesubgraphinsidethegraphinducedbyC.ThestructuralcorrelationoffA,Bgis1,duetothefactthateveryvertexisamemberofadensesubgraphinthegraphinducedbyfA,Bg.Thepair(fA,Bg,f6,7,8,9,10,11g)isanexampleofastructuralcor-relationpattern,forwhichthesubgraphisshowninFigure1(d).Anotherexampleisthepattern(fAg,f3,4,5,6g),forwhichtheinducedsubgraphisshowninFigure1(d).Thestructuralcorrelationofattributesetsandthestruc-turalcorrelationpatternsarecomplementaryinformation,whilethe rstisameasureofthecorrelationbetweenagivenattributesetandtheoccurrenceofdensesubgraphs,thesec-ondprovidesrepresentativesforsuchacorrelationthroughspeci csubgraphs.Weformulatethestructuralcorrelationpatternminingintermsoftwoexistingdataminingprob-lems:frequentitemsetandquasi-cliquemining.Frequentitemsetmining[1,19]isappliedtohandlethepossiblelargenumberofattributesetsfromthegraphandquasi-cliques[14,10]areusedasade nitionfordensesubgraphs.Westudystructuralcorrelationpatternminingfocusingontwoimportantaspects.The rstaspectisthesigni canceofthepatterns.Morespeci cally,itisrelevanttoprovidesigni cancemeasuresforthestructuralcorrelationofat-tributesetsandthestructuralcorrelationpatterns.The vertexattributes 1A,C 2A 3A,C,D 4A,D 5A,E 6A,B,C 7A,B,E 8A,B 9A,B 10A,B,D 11A,B (a)Vertexattributes (b)Graph (c)Densesubgraph (d)DensesubgraphFigure1:Structuralcorrelationpatternmining(illustrativeexample)secondaspectisrelatedtothecomputationalcostoftheproposedtask.Ourobjectiveistoenabletheanalysisoflargerealgraphsinafeasibletime.Althoughsigni canceandhigh-performancearenotnecessarilyconcordantgoals,weproposesigni cancemetricsthatmayleadtoecientpruningstrategiesforstructuralcorrelationpatternmining.Regardingthesigni canceofpatterns,weformulatenor-malizationapproachesforstructuralcorrelationpatternmin-inginordertomeasurethestatisticalsigni canceofthestructuralcorrelationofagivenattributeset.Theideaistocomparethestructuralcorrelationagainstitsexpectedvalue,whichisprovidedbyanullmodel.Moreover,weevaluatethestructuralcorrelationpatternsintermsofsize(i.e.,numberofvertices)anddensity(i.e.,cohesion).Suchevaluationisusefultorankthemostinterestingpatterns.Wecombinethestatisticalsigni canceofthestructuralcorrelationofattributesetsandthesizeanddensityofstruc-turalcorrelationpatternswithe ectiveconstraintstoprunedownthesearchspace.Moreover,weproposetwostrategiesforcomputingthestructuralcorrelationofattributesetse-ciently.ThesepruningandsearchtechniquesareintegratedintotheSCPM(StructuralCorrelationPatternMining)al-gorithm,whichisdescribedandevaluatedinthispaper.Inparticular,weapplySCPMtotheanalysisofthreerealat-tributedgraphs:collaboration,musicandcitationnetworks.TheresultsshowthatSCPMisabletoextractrelevantknowledgeregardinghowvertexattributesarecorrelatedwithdensesubgraphsinlargeattributedgraphs.2.STRUCTURALCORRELATIONPATTERNMINING2.1Denitions2.1.1StructuralCorrelationWede neanattributedgraphasa4-tupleG=(V;E;A;F)whereVisthesetofvertices,Eisthesetofedges,A=fa1;a2;:::angisthesetofattributes,andF:V!P(A)isafunctionthatreturnsthesetofattributesofavertex.Pisthepowersetfunction.EachvertexviinVhasasetofattributesF(vi)=fai1;ai2;:::aipg,wherep=jF(vi)jandF(vi)A.Figure1(b)showsanexampleofanattributedgraphwherethevertexattributesaregiveninFigure1(a).GiventhesetofattributesA,wede neanattributesetSasasubsetofA(SA).Moreover,wedenotebyV(S)VthevertexsetinducedbyS(i.e.,V(S)=fvi2VjSF(vi)g)andbyE(S)EtheedgesetinducedbyS(i.e.,E(S)=f(vi;vj)2Ejvi;vj2V(S)g).ThegraphG(S),in-ducedbyS,isthepair(V(S);E(S)).Wealsode neasup-portfunction,whichgivesthenumberofoccurrencesofanattributesetintheinputgraph((S)=jV(S)j),i.e.,thenumberofverticesthatcontainS.Thestructuralcorrelationfunctionmeasuresthecorre-lationbetweenagivenattributesetandtheoccurrenceofdensesubgraphsinanattributedgraph.Weapplyquasi-cliquesasade nitionfordensesubgraphs.Quasi-cliquesareanaturalextensionofthetraditionalcliquede nition.DEFINITION1.(Quasi-clique)Givenaminimumden-sitythreshold min(0 min1)andaminimumsizethresholdmin size,aquasi-cliqueisamaximalvertexsetQsuchthatforeachv2Q,thedegreeofvinQisatleastd min:(jQj�1)eandjQjmin size.Figures1(c)and1(d)areexamplesofan1-quasi-cliqueofsize4anda0.6-quasi-cliqueofsize6,respectively,fromthegraphshowninFigure1(b).Thequasi-cliqueminingprob-lemconsistsofidentifyingthequasi-cliquesfromagraphconsideringminimumsizeanddensityparameters,aprob-lemknowntobe#P-hard[14,17].Wede nethestructuralcorrelationofanattributesetSastheprobabilityofavertexvwithattributeStobepartofaquasi-cliqueinG(S).DEFINITION2.(Structuralcorrelationfunction)GivenanattributesetS,thestructuralcorrelationofS,(S),isgivenas:(S)=jKSj jV(S)j(1)whereKSisthesetofverticesinquasi-cliquesinG(S).InthegraphfromFigure1,KfAg=f3;4;5;6;7;8;9;10;11g,KfCg=fgandKfA;Bg=f6;7;8;9;10;11g,andthusthecorrespondingvaluesof(fAg),(fCg),and(fA;Bg)are0.82,0,and1,respectively.StructuralcorrelationmeasuresthedependencebetweenattributesetSandthedensityoftheassociatedvertices.ItindicateshowlikelySistobepartofdensesubgraphs.Ourformulationenablestheiden-ti cationofattributesthatinduceverticesthatarewellcon-nectedinthegraph.Inasocialnetwork,forinstance,suchattributesareofgreatinterestsincetheymayberelatedtohomophilyorin uence.Nevertheless,itisalsorelevantto understandthedensesubgraphsinducedbyattributesets.Wecallstructuralcorrelationpatternaquasi-cliquethatishomogeneousw.r.t.anattributeset.DEFINITION3.(Structuralcorrelationpattern).Astructuralcorrelationpatternisapair(S;Q),whereSisanattributeset(SA),andQisaquasi-cliquefromthegraphinducedbyS(QV(S)),giventhequasi-cliqueparameters minandmin size.Thepair(fAg,f3;4;5;6g)isanexampleofasize4struc-turalcorrelationpatternwithdensity1inducedbytheat-tributeAinthegraphfromFigure1.Anotherexampleofastructuralcorrelationpatternis(fA;Bg,f6;7;8;9;10;11g),whichisasize6structuralcorrelationpatternwithdensity0.6inducedbytheattributesetfA;Bg.2.1.2StructuralCorrelationPatternMiningProblemBasedonthede nitionofstructuralcorrelationpatternsandstructuralcorrelationfunction,weformulatethestruc-turalcorrelationpatternminingproblem.Itcomprisestheidenti cationoftheattributesetscorrelatedwithdensesub-graphsandthedensesubgraphsinducedbysuchattributesets.Weapplyaminimumsupportthresholdminforat-tributesetsinordertoprunedownthenumberofpatterns.DEFINITION4.(Structuralcorrelationpatternmin-ingproblem).GivenanattributedgraphG(V;E;A;F),aminimumsupportthresholdmin,aminimumquasi-cliquedensity minandsizemin size,andaminimumstructuralcorrelationmin,thestructuralcorrelationpatternminingconsistsofidentifyingthesetofstructuralcorrelationpat-terns(S,Q)fromG,suchthatSisanattributesetforwhich(S)min,(S)min,andQisa min-quasi-cliqueforwhichQV(S)andjQjmin size.Asanexample,weconsidertheattributedgraphshowninFigure1andtheparametersmin, min,min sizeandminsetto3,0.6,4,and0.5,respectively.ThesetofstructuralcorrelationpatternsareshowninTable1.Foreachpattern,wegivethepair(attributeset,densesubgraph),therespec-tivequasi-cliquesizeanddensity( ),andtheattributesetsupport()andstructuralcorrelation(). patternsize  (fAg,f6;7;8;9;10;11g)60.60110.82 (fAg,f3;4;5;6g)41110.82 (fAg,f3;4;6;7g)40.67110.82 (fAg,f3;5;6;7g)40.67110.82 (fAg,f3;6;7;8g)40.67110.82 (fBg,f6;7;8;9;10;11g)60.6061.0 (fA;Bg,f6;7;8;9;10;11g)60.6061.0 Table1:PatternsfromthegraphshowninFigure1Similartothequasi-cliquemining,thestructuralcorre-lationpatternminingis#P-hard[17].Thisisbecausethequasi-cliqueminingproblemcanbereducedtothestructuralcorrelationpatternminingbyassigningthesameattributetoeachvertexfromthegraphandsettingminto1.Structuralcorrelationpatternminingisbasedonthestruc-turalcorrelationfunction,whichmeasureshowagivenat-tributesetisassociatedwiththeoccurrenceofdensesub-graphsinanattributedgraph.However,itisimportanttoassessthesigni cance/interestingnessofagivenstructuralcorrelation,whichisthesubjectofthenextsection.2.1.3StatisticalSignicanceoftheStructuralCor­relationGiventhestructuralcorrelationofanattributeset,howcanweevaluateit?Inotherwords,whatcanbeconsideredahighorlowstructuralcorrelation?Inthissection,wead-dresssuchquestionsbyproposingnullmodelsforstructuralcorrelation.Thesemodelsspecifytheexpectedstructuralcorrelationofanattributesetassumingthatthecorrela-tionbetweenvertexattributesanddensesubgraphsisran-dom.Normalizedstructuralcorrelationmeasureshowthestructuralcorrelationofanattributesetdeviatesfromitsexpectedvalue,andallowsustoassessthestatisticalsignif-icanceofagivenstructuralcorrelationvalue.DEFINITION5.(Normalizedstructuralcorrelation).GivenanattributesetSwithsupport(S)andafunc-tionexp,whichgivestheexpectedstructuralcorrelationofanattributesetbasedonitssupportandtheattributedgraphG,thenormalizedstructuralcorrelationofSisgivenby:(S;G)=(S) exp((S);G)(2)AccordingtoDe nition5,thenormalizedstructuralcor-relationfunctiongiveshowmuchthestructuralcorrelationofanattributesetSishigherthanexpected.Therefore,itrequiresthede nitionofthefunctionexp,whichreceivesthesupportofS((S))andtheattributedgraphGasargu-ments.Bynormalizingthestructuralcorrelation,weexpecttoobtainameasureofthecorrelationofanattributesetSthatisindependentofitssupportandtheinputgraph.WeassumethattheinputgraphGcomprisestheobjectofinterest,i.e.,itisthe\population"graph.Assumethatwearegiventheattributesetsupportvalue(S)(independentoftheactualattributesetS).Tocomputetheexpectedstructuralcorrelation,oursamplespaceisthesetofallver-texsubsetsofsize(S)drawnrandomlyfromG.Thestatis-ticofinterestisthemeanstructuralcorrelationvalue,exp.Thatis,theexpectedprobabilitythatarandomvertexinagivensampleinducesdensesubgraphs(quasi-cliques)inthatsampleofsize(S).Thequasi-cliqueparameters, minandmin size,areassumedtobe xedaswell.Anintuitiveapproachforcomputingexpisthroughsim-ulation.Giventhesupport(S)oftheattributeset,aran-domsampleof(S)verticesfromGisselected.Eachvertexfromthesampleischeckedtobeinaquasi-clique,accordingtothequasi-cliqueparameters.Thestructuralcorrelationofthesampleisthefractionofverticesfromitthatareinatleastonequasi-clique.Thesimulation-basedexpectedstruc-turalcorrelationsim-expisgivenbytheaveragestructuralcorrelationofrrandomsamples.Thesimulation-basedstructuralcorrelationisverysimpleconceptuallybutmayrequireahighrtoachieveaccurateestimates,whichisprohibitiveinrealsettings.Thuswealsoproposeananalyticalformulationforanupperboundontheexpectedstructuralcorrelationofanattributeset.Theideaisthatavertexmusthaveaminimumdegreeofd min:(min size�1)einordertobememberofa min-quasi-cliqueofminimumsizemin size.Consequently,theprobabilityofavertextohaveadegreeofd min:(min size�1)einarandomsubgraphofsize(S)fromGgivesanupperboundontheexpectedstructuralcorrelationofS.Givenarandomsize(S)subgraphG(S)fromG,thedegreeofvinGandG(S)arerelatedasfollows. THEOREM1.(Probabilityofavertexthathasadegree inGtohaveadegree inG(S)).IfarandomvertexvfromGwithdegree isselectedtobepartofG(S),theprobabilityofsuchvertextohaveadegree inG(S)isgivenbythefollowingbinomialfunction:F( ; ;)= !: :(1�) � (3)whereistheprobabilityofaspeci cvertexufromGtobeinG(S),ifvisalreadychosen,whichisgivenas:=(S)�1 jVj�1(4)Proofsketch.Thereare verticesadjacenttovinG,thus,theprobabilityofvtohaveadegreeof inG(S)istheprobabilityofselecting outof verticestobepartofG(S).Sincevisalreadyselected,theprobabilityofselectinganyremainingvertexfromGisgivenbyequation4.BasedonTheorem1,wede neanupperboundontheexpectedstructuralcorrelationastheprobabilityofavertextohaveadegreeofatleastd min:(min size�1)einG(S).THEOREM2.(Upperboundontheexpectedstruc-turalcorrelation).Giventhequasi-cliqueparameters minandmin size,thestructuralcorrelationofanattributesetwithsupport(S)isupperboundedby:max-exp((S))=mX =zp( ): X =zF( ; ;)(5)wherez=d min:(min size�1)e,misthemaximumdegreeofavertexfromG,andpisthedegreedistributionofG.Proofsketch.Givenavertexwithdegree inG,theprob-abilityofsuchvertextohaveadegreeofatleastd min:(min size�1)einG(S)isthesumofexpression3overthedegreeintervalfromd min:(min size�1)eto .Ifwemultiplythissumbytheprobabilityofavertexofdegree fromGtobeinG(S),i.e.,p( ),itgivestheprobabil-ityofanyvertexwithdegree fromGtohaveadegreeofatleastd min:(min size�1)einG(S).Equation5isthesumofsuchproductsoverthevertexdegreeshigherthand min:(min size�1)e.TheproposedupperboundontheexpectedstructuralcorrelationofanattributeSisbasedontheexpectedde-greedistributionofarandomgraphofsize(S)fromG.However,thedegreeisnottheonlycriteriaforavertextobepartofaquasi-clique.Verticesthatsatisfytheminimumdegreethresholdmaynotbepartofaquasi-cliqueiftheyareconnectedtolowdegreevertices.Nevertheless,sinceweapplytheproposedformulationinordertonormalizethestructuralcorrelationofattributesetswithdi erentsup-ports,ourobjectiveistoprovideafunctionthatpresentsaslopethatissimilartoexpectedstructuralcorrelation.InSection4.1,wecomparetheexpectedstructuralcorrelationcomputedusingsimulationwiththeproposedupperbound.Wecallsimandlbthenormalizedstructuralcorrela-tionfunctionsthatapplytheexpectedstructuralcorrela-tionbasedonsimulationsim-expandthetheoreticalupperboundmax-exp,respectively.Sincemax-expsim-exp,lb=(S) max�exp(S) sim�exp=sim,thus,lbisalowerboundonsim.Itisimportanttonoticethatmax-expismonotonicallynon-decreasing,i.e.,max-exp(1)�max-exp(2)ifandonlyif12.Itfollowsdirectlyfromthefactthattheanalyticalupperbound(Equation5)isbasedonacumula-tivebinomialfunction,whichisknowntobemonotonicallynon-decreasingw.r.t..Wealsoassumethatsim-expismonotonicallynon-decreasingforsucientlyhighvaluesofr,sinceanincreaseinthesizeoftherandomgraphsselectedfromGisnotexpectedtodecreasetheprobabilityof ndingavertexinaquasi-clique.Suchpropertieswillbeexploitedbyourpruningtechniques,whichwillbeproposedfurtherinthispaper(seeSection3.2.1).Weapplythenormalizedstructuralcorrelationintheiden-ti cationofstatisticallysigni cantstructuralcorrelationval-ues.Therefore,weextendthestructuralcorrelationpat-ternminingproblem(De nition4)byaddingaminimumnormalizedstructuralcorrelationthresholdmin.Suchathresholdmayalsobeusefultoimprovetheperformanceofstructuralcorrelationpatternminingalgorithms,aswillbediscussedinSection3.2.Sinceausermaybeinterestedinpatternsthathavehighstructuralcorrelation()aswellasbeingstatisticallysigni cant(),wepresentresultsusingbothregularandnormalizedstructuralcorrelation.2.2RelatedWorkFindingcommunities[6,3]anddensesubgraphs[5,10,8,20]hasbeenanactiveresearchtopic.Acommunityisusuallyde nedassetofverticessigni cantlymorecon-nectedamongthemselvesthanwithverticesoutsideit[3].Ontheotherhand,densesubgraphs,suchascliques[18],arestronglybasedoninternalcohesionandmaximality.Thisworkappliesadensesubgraphde nitioncalledquasi-clique,whichisasetofverticeswhereeachvertexiscon-nectedatleasttoafractionoftheothers.[14]introducestheproblemofminingcross-graphquasi-cliques.Theyfurtherstudiedtheproblemofminingfrequentcross-graphquasi-cliques[8].In[20]and[21]theauthorsstudytheproblemofminingfrequentcoherentclosedquasi-cliques.[10]stud-iestheproblemof ndingquasi-cliquesfromasinglegraph,proposingpruningtechniquesforquasi-cliquemining.Graphclusteringanddensesubgraphdiscoverymethodsthatconsidervertexattributesascomplementaryinforma-tionhaveattractedtheinterestoftheresearchcommunityintherecentyears[12,4,22,13].Ageneralassumptionofthesemethodsisthatclustersbasedonboththetopologyofthegraphandtheattributesofverticesaremoremeaningfulthanthosebasedonlyonthetopologyortheattributes.[4]proposestwoecientalgorithmsfortheconnectedk-centerproblem,whichhasasobjectivetopartitionagraphconsid-eringboththeattributesandthetopology.[22]proposesarandomwalk-baseddistancemetricinanaugmentedgraphwhereverticesfromtheoriginalgraphareconnectedtonewverticesthatrepresentvertexattributes.In[12],theauthorsintroducetheproblemofminingcohesivepatterns,whicharedenseconnectedsubgraphswhereverticeshavehomo-geneousattributes(orfeatures).[13]considerstheproblemofcomputingmaximalhomogeneouscliquesinattributedgraphs.Di erentfromthesemethods,structuralcorrela-tionpatternminingdoesnotassumethatvertexattributesarecomplementaryinformation.Infact,weareinterestedin ndingattributesetsthatexplaintheformationofdensesubgraphsthroughcorrelation.Assessinghowvertexattributesarerelatedtothegraph topologyhasledtothede nitionofnewpatterns.[15]proposedtheproblemof ndingitemset-sharingsubgraphs,whichconsistsofextractingsubgraphswithcommonitem-sets.Itisimportanttonoticethatsuchmethoddonotcon-siderthedensityofsubgraphs.[9]de nestheproximitypat-ternmining,whichevaluateshowclosevertexattributesareinthegraph.Aproximitypatternisasetoflabelsthatco-occurinneighborhoods.Therefore,proximitypatternsarenotnecessarilydensesubgraphsorcohesive,di erentlyfromstructuralcorrelationpatterns.In[7],theauthorsproposeadi erentde nitionforthestructuralcorrelation,whichcom-parestheclosenessamongverticesinducedbyagivensingleattributeagainstasubgraphwhereattributesarerandomlydistributed.Ourworkdi ersfrom[7]bycombiningmultipleattributesandconsideringaparticulartopologicalpropertywhichistheorganizationintodensesubgraphs.Moreover,besidestheevaluationofstructuralcorrelationofattributesets,weareinterestedinthediscoveryofrelevantdensesub-graphstoberepresentativesofthestructuralcorrelation.In[16],weintroducethestructuralcorrelationpatternminingandpresentanalgorithmforthisproblemcalledSCORP.Inthispaper,westudytheproblemofidentifyingstatisticallysigni cantstructuralcorrelationpatternsbasedonanormalizationofthestructuralcorrelation.WealsopresenttheSCPMalgorithm,whichextendsSCORPwithnewpruningandsearchstrategiesforstructuralcorrelationpatternmining.Di erentfromSCORP,SCPMenumeratesthetopstructuralcorrelationpatternsintermsofsizeanddensityeciently,insteadofthecompletesetofpatterns.3.ALGORITHMS3.1NaiveAlgorithmSincestructuralcorrelationpatternminingcombinesas-pectsofthefrequentitemsetminingandthequasi-cliqueminingproblems,wemaycombineafrequentitemsetmin-ingalgorithmandaquasi-cliqueminingalgorithmintoanaivealgorithmforstructuralcorrelationpatternmining.Thenaivealgorithmsolvesthestructuralcorrelationpat-ternminingproblem(seeDe nition4)by rstenumeratingthesetoffrequentattributesetsFfromGandtheniden-tifyingthesetofquasi-cliquesQfromthegraphinducedbyeachfrequentattributesetSfromF.ThestructuralcorrelationofeachfrequentattributesetSiscomputedbycheckingwhethereachvertexv2V(S)ispartofaquasi-cliqueinQ.Frequentattributesetscanbeidenti edusingafrequentitemsetminingalgorithm[1,19].Inthiswork,weapplytheEclatalgorithm[19].Moreover,anyalgorithmforquasi-cliqueminingcanbeappliedbysuchnaivealgorithm.WeapplytheQuickalgorithm[10].Themaindrawbackofthenaivealgorithmisthatitenu-meratesthecompletesetoffrequentattributesetsfromGandthecompletesetofquasi-cliquesfromeachinducedgraphG(S),whereSisafrequentattributeset.Sincethefrequentitemsetminingandthequasi-cliqueminingprob-lemsareknowntobe#P-hard,thenaivealgorithmisex-pectedtonotbeabletoprocesslargeattributedgraphs.Inordertoachievesuchgoal,intheupcomingsections,wedescribeseveralstrategiesforecientstructuralcorrelationpatternmining.Wecombinesuchstrategiesintoanewal-gorithm,whichisdescribedinSection3.2.Furtherinthispaper,wecomparetheperformanceoftheproposedalgo-rithmagainstthisnaivemethod.3.2SCPMAlgorithmThissectionpresentstheSCPM(StructuralCorrelationPatternMining)algorithm,whichappliesseveralstrategiesinordertoenablethestructuralcorrelationpatternmin-inginlargeattributedgraphs.Unlikethenaivealgorithm,SCPMdoesnotenumerateeveryfrequentattributesetbutprunesthoseattributesetsthatcannotsatisfyaminimumstructuralcorrelationthreshold.Moreover,insteadofidenti-fyingeachquasi-cliquefromaninducedgraph,SCPMcheckswhetherverticesareinquasi-cliquesbyverifyingareducednumberofquasi-cliquecandidates.Finally,SCPMreturnsthesetofthetop-kmostrelevantstructuralcorrelationpat-ternsfromtheattributedgraph.3.2.1PruningStrategiesforSCPMiningThissectionpresentspruningtechniquesforstructuralcorrelationpatternmining.Theobjectiveofthesepruningtechniquesistoreducetheexecutiontimeofthestructuralcorrelationpatternminingalgorithmswithoutcompromis-ingitscorrectness.Theorem3allowsthepruningofverticesduringthelevel-wiseenumerationofattributesets.THEOREM3.(Vertexpruningforattributesets).LetKSbethesetofverticesindensesubgraphsinthegraphinducedbyanattributesetS.IfSiSj,thenKSjKSi.Proofsketch.Letssupposethatthereexistsavertexvsuchthatv2KSjandv=2KSi.Sincev2KSj,thereexistsadensesubgraphVV(Sj),suchthatv2V.Moreover,ifv=2KSi,theredoesnotexistanydensesubgraphUV(Si)suchthatv2U.Nevertheless,ifSiSj,thenV(Sj)V(Si),whichimpliesthatVV(Si)(contradiction).BasedonTheorem3,wecanpruneverticesthatarenotindensesubgraphsinthegraphinducedbyagivenattributesetbeforeextendingittogeneratelargerattributesets.At-tributesetscanalsobeprunedbasedonanupperboundonthestructuralcorrelationfunction,asstatedbyTheorem4.THEOREM4.(Attributesetpruningbasedontheupperboundonthestructuralcorrelation).Fortwoat-tributesetsSiandSj,ifSiSjand(Sj)min,then(Sj)(Si):jV(Si)j=minProofsketch.AccordingtoTheorem3,(Si):jV(Si)j(Sj):jV(Sj)j,sinceeveryvertexcoveredbyadensesub-graphinV(Sj)isalsocoveredbyadensesubgraphinV(Si).Moreover,since(Sj)min,(Sj)isupperboundedby(Si):jV(Si)j=minbasedonthede nitionofthestructuralcorrelationfunction(seeDe nition2).GivenanattributesetSi,ofsizei,if(Si):jV(Si)j=minmin,thenSiisnotincludedinthesetofattributesetstobecombinedforthegenerationofsizei+1attributesets.Theorem4guaranteesthattheredoesnotexistanattributesetSj,suchthatSiSjand(Sj)min.Asimilarpruningrulecanbeformulatedbasedonthenormalizedstructuralcorrelationfunctionde nition.THEOREM5.(Attributesetpruningbasedontheupperboundonthenormalizedstructuralcorrelation).FortwoattributesetsSiandSj,ifSiSj,expisamono-tonicallynon-decreasing,and(Sj)min,then(Sj)(Si):jV(Si)j=(exp(min):min)Proofsketch.AccordingtoTheorem4,(Sj)(Si):jV(Si)j=min.Since(Sj)minandexpis Figure2:Setenumerationtree Algorithm1GeneralStructuralCorrelationAlgorithm Require:G(S), min,min sizeEnsure:Q1:Q ;2:X ;3:candExts(X) V(S)4:ApplyvertexpruningincandExts(X)5:qcCands f(X;candExts(X))g6:whileqcCands6=;do7:q qcCands:get()8:Applycandidatequasi-cliquepruninginq9:ifq:X[q:candExts(X)isaquasi-cliquethen10:Q Q[fq:X[q:candExts(X)g11:else12:ifq:Xisaquasi-cliquethen13:Q Q[fq:Xg14:endif15:insertextensionsofqintoqcCands16:endif17:endwhile monotonicallynon-decreasing,thenexp((Sj))exp(min).Therefore,(Sj)(Si):jV(Si)j=(exp(min):min).If(Si):jV(Si)j=(exp(min):min)min,theattributesetSi,ofsizei,isnotincludedinthesetofattributesetstobecombinedforthegenerationofsizei+1attributesets.Sincelbgivesalowerboundonthenormalizedstructuralcorrelation,thewholepruningpotentialofTheorem5maynotbeexplored.Nevertheless,theresultsshowthatuseoflbenablessigni cantperformancegains(seeSection4.2).ThepruningstrategystatedbyTheorem3reducesthenumberofverticestobecheckedtobeinquasi-cliquesinthecomputationofstructuralcorrelation.Theorems4and5enablethereductionoftheattributesetsforwhichthestructuralcorrelationiscomputedtoasetthatisexpectedtobesmallerthanthesetoffrequentattributesets.3.2.2ComputingtheStructuralCorrelationAsdiscussedinSection3.1,thenaivealgorithmcomputesthestructuralcorrelationofanattributesetSthroughtheenumerationofthequasi-cliquesfromG(S).Inthissection,wedescribehowthestructuralcorrelationcanbecomputedbyidentifyingareducednumberofquasi-cliquecandidates.Quasi-cliquescanbeenumeratedbasedonavertexsetX,initiallysetas;,andasetofcandidateextensionsofX,candExts(X),initiallysetasV.VerticesaremovedfromcandExts(X)toX,oneatatime,untilthecompletesetofquasi-cliquecandidatesaregenerated.Figure2showsasetenumerationtreethatrepresentsthesearchspaceofquasi-cliquesconsideringasetof4vertices(1-4).Inordertoprunedownsuchsearchspace,quasi-cliqueminingalgo-rithmsapplyseveralpruningtechniques.Wedividethesetechniquesintotwogroups:1.Vertexpruning:Removalofverticesthatcannotbepartofanyquasi-cliqueinGaccordingtothequasi-cliquede nitionandthequasi-cliqueparameters.Ver-texpruningisperformediterativelyoverthegraphinordertominimizethesearchspaceofquasi-cliques.2.Candidatequasi-cliquepruning:Removalofcan-didatequasi-cliques(i.e.,pairs(X,candExts(X)))fromthesearchspaceofquasi-cliques.SuchremovalisbasedonthepropertiesofthesubgraphcomposedbyverticesfromXandcandExts(X).Algorithm1givesageneraldescriptionofhowquasi-cliquesareidenti edinthecomputationofstructuralcorre-lation.Thisalgorithmisalsousedasthebasisfortheenu-merationofthetop-kstructuralcorrelationpatterns.ThealgorithmreceivesaninducedgraphG(S),andthemini-mumdensity( min)andsize(min size)forquasi-cliques.Itgivesasoutputasetofquasi-cliquesQfromG.Vertexandquasi-cliquecandidatepruningsareappliedinlines4and8,respectively.Candidatequasi-cliquesaremanagedbythedatastructureqcCands,whichwillbediscussedlater.Eachcandidatepatternischeckedtobealookaheadquasi-clique(i.e.,q:X[q:candExts(X)isaquasi-clique) rst,duetothefactthatquasi-cliquesaremaximal.Incasesuchacondi-tiondoesnothold,q:Xischeckedtobeaquasi-cliqueandtheextensionsofqareinsertedintoqcCands(line15).Thealgorithm nisheswhenqcCandsbecomesempty.ThesetKS,whichiscomposedofverticescoveredbyquasi-cliquesinG(S),canbeobtaineddirectlyfromQ.Sincethequasi-cliqueminingproblemisknowntobe#P-hard,theidenti cationofquasi-cliquesmayrequireprocess-ingalargenumberofquasi-cliquecandidates,whichwouldconstituteanimportantlimitationtothecomputationofthestructuralcorrelationoflargeinducedgraphs.Neverthe-less,computingthestructuralcorrelationdoesnotrequiretheenumerationofthecompletesetofquasi-cliques.Thenecessaryinformationiswhethereachvertexfromthein-ducedgraphiscoveredbyaquasi-cliqueornot.Therefore,candidatequasi-cliquescomposedofverticesalreadyknowntobecoveredbyquasi-cliquescanbeprunedfromthenewquasi-cliquecandidatesgeneratedinline15ofAlgorithm1.Besidespruningcandidatequasi-cliquesthatarealreadyknowntobecoveredbydensesubgraphs,wealsoproposesearchstrategiesforcomputingthestructuralcorrelation.Thesesearchstrategiesdeterminetheorderinwhichcan-didatequasi-cliquesareenumerated.Abreadth- rstsearch(BFS)strategyforcomputingthestructuralcorrelationtra-versesthesearchspaceofquasi-cliquesinabreadth- rstorder,startingfromtherootandvisitingthesmallervertexsetsbeforethelargerones.Ontheotherhand,adepth- rstsearch(DFS)strategyextendsvertexsetsasmuchaspossible.TheBFSstrategyisexpectedtoperformbetterincasecoveringverticeswithsmallerquasi-cliquesismoreecientthanwithlargerquasi-cliques.Consideringasetof4vertices,forwhichthesearchspaceofquasi-cliquesisshowninFigure2,theBFSandtheDFSstrategyvisitthequasi-cliquecandidatesasfollows: Algorithm2SCPMAlgorithm Require:G,min, min,min size,min,min,kEnsure:P1:P ;2:T ;3:I frequentattributesfromG4:forallS2Ido5: structuralcorrelationofS6:ifminAND=exp(S)minthen7:Q top-kpatternsfromG(S)8:forallq2Qdo9:P P[(S;q)10:endfor11:endif12:if:(S)min:minAND:(S)min:exp(min):minthen13:T T[S14:endif15:endfor16:P P[enumerate-patterns(T;G;min; min;min size;min;min;k) Algorithm3enumerate-patterns Require:T;G;min; min;min size;min;min;kEnsure:P1:P ;2:forallSi2Tdo3:R ;4:forallSj2Tdo5:ifi�jthen6:S Si[Sj7:if(S)minthen8: structuralcorrelationofS9:ifminAND=exp(S)minthen10:Q top-kpatternsfromG(S)11:forallq2Qdo12:P P[(S;q)13:endfor14:endif15:if:(S)min:minAND:(S)min:exp(min):minthen16:R R[S17:endif18:endif19:endif20:endfor21:P P[enumerate-patterns(R;G;min; min;min size;min;min;k)22:endfor BFS:f1g,f2g,f3g,f4g,f1;2g,:::f1;2;3;4g.DFS:f1g,f1;2g,f1;2;3g,f1;2;3;4g,f1;3g,:::f4g.Quasi-cliquescanbeenumeratedinBFSorderbyusingaqueueasadatastructuretomanagequasi-cliquecandidatesinAlgorithm1.Similarly,aDFSstrategyforenumeratingquasi-cliquescanapplyastackinordertomanipulatecandi-datepatterns.Furtherinthispaper,weevaluatethesearchstrategiespresentedinthissection.3.2.3EnumeratingTop­kPatternsAsdiscussedinSection2.1.2,enumeratingstructuralcor-relationpatternsisacomputationallyexpensivetask.Inthissection,westudyhowtoreducethecostofenumerat-ingstructuralcorrelationpatternsbyrestrictingtheoutputsettoonlythetop-kmostrelevantpatternsintermsofsize(primarycriteria)anddensity(secondarycriteria).Theenumerationofthetop-kstructuralcorrelationpat-ternsfollowsthesameproceduredescribedinAlgorithm1.WeuseaDFSstrategyinthediscoveryofthetop-kpat-ternsbecausestructuralcorrelationpatternsaremaximal(seeDe nition3).However,sincethenumberofpatternstobediscoveredisknown,acurrentsetofpatternscanbeap-pliedtoprunethesearchspaceofnewcandidates.Newcan-didatequasi-cliquesaregeneratedinline15.Incasethecur-rentsetoftoppatternscontainskpatternsandacandidatepatternpcannotproduceapatternlargerthanthesmallestcurrenttop-kpatternt(i.e.,jp:X[p:candExts(X)jtj),pcanbepruned.Byupdatingthesetoftop-kpatterns,theminimumsizethresholdisincreasediteratively.Asaconse-quence,thetop-kpatternsareenumeratedmoreecientlythanthecompletesetofpatternsfromaninducedgraph.Algorithm2isahigh-leveldescriptionoftheSCPMal-gorithm,whichappliesthestrategiesforecientstructuralcorrelationpatternminingpresentedinthissection.TheinitialsetofattributesIiscomposedbythosewithasup-portofatleastmin(line3).ThestructuralcorrelationofeachsizeoneattributesetS2IiscomputedasdescribedinSection3.2.2.IncasethestructuralcorrelationofSsat-is esminimumstructuralcorrelation(min)andnormalizedstructuralcorrelation(min)thresholds,thetop-kpatternsinducedbySareidenti edusingthealgorithmdescribedinthissection(line7).ThesepatternsareincludedintoasetofpatternsPthatwillbegivenasoutput.Thepruningrulesforattributesetsbasedonand(seeSection3.2.1)areappliedinline12.PrunedattributesarenotincludedintothesetofattributesTtobeextended.Theseattributesareextendedbythefunctionenumerate-patterns(line16).Algorithm3describesthefunctionenumerate-patterns.ItreceivesthesameinputparametersofSCPM,andalsothesetofpatternstobeextendedT.Itreturnsthesetoftop-kpatterns(S;V)thathaveattributesetsextendedfromthoseinTregardingtheinputparameters.Newat-tributesetsareextendedthroughtheunionofexistingones(line6).AttributesetsaretraversedinaDFSorder(e.g.,fAg;fA;Bg;fA;B;Cg:::fEg).Theenumerate-patternsfunctionissimilartoAlgorithm2,exceptthateachnewattributesetSischeckedtosatisfytheminimumsupportthresholdmin(line7).Allvalidattributesetsaregener-atedthroughrecursivecallstoenumerate-patterns(line21).4.EXPERIMENTALRESULTSThissectionpresentscasestudiesonthestructuralcor-relationpatternminingusingrealdatasets.Moreover,weevaluatetheperformanceandstudythesensitivityofimpor-tantinputparametersofSCPM.Experimentswereexecutedona16-coreIntelXeon2.4Ghzwith50GBofRAM.Theimplementationsareavailableasopen-source1.4.1CaseStudies4.1.1DBLPIntheattributedgraphextractedfromtheDBLP2digitallibrary,eachvertexrepresentsanauthorandtwoauthorsareconnectediftheyhaveco-authoredapaper.Theattributesofauthorsaretermsthatappearinthetitlesofpapersau-thoredbythem3.IntheDBLPdatasetanattributesetde- nesatopic(i.e.,setoftermsthatcarryaspeci cmeaningintheliterature)andadensesubgraphisacommunity.TheDBLPdatasethas108,030vertices,276,658edgesand23,285attributes.Table2showsthetop10attribute 1http://code.google.com/p/scpm/2http://www.informatik.uni-trier.de/~ley/db3Stemmingandremovalofstopwordswereapplied. (a)Graphinducedbyfsearch;rankg (b)Patterninducedbyfperform;systemgFigure3:ExamplesofresultsfromtheDBLPdataset   lb Slb Slb Slb basesystem5492.0414.0 gridapplic840.2641577 searchrank420.19635349 baseus5421.0413.5 gridservic599.23154703 perform le404.14555067 basemodel4852.0313.3 environgrid525.21256793 structurindex404.14555067 modelus4168.0321.0 querixml615.21123533 searchmine413.14490932 systemus3989.0536.8 searchweb1031.2013738 usxml400.11442638 basenetwork3774.0541.8 searchrank420.19635349 searchwebdata424.14431589 modelsystem3460.0221.7 dynamsimul469.19383169 basesearchanalysi414.12416385 basedata3452.0771.6 queridata1540.192758 modelinternet401.10406059 baseimag3424.0217.6 chipsystem702.1963351 processdatadatabas416.12405363 imagus3345.0219.6 datastream1073.1810653 performdistributparallel416.11388818 Table2:DBLP-Topsupport(),str.correlation(),andnormalizedstr.correlation(lb)attributesets. Figure4:DBLP-Expectedcomputedbythesim-ulation(sim-exp)andanalytical(max-exp)models.setsw.r.tsupport(),structuralcorrelation(),andnor-malizedstructuralcorrelation(lb).Theminimumsize(min size)anddensity( min)parametersweresetto10and0.5,respectively.Theminimumsupportthreshold(min)wassetto400andweconsideredonlyattributesetsofsizeatleast2.Theparametersusedinourcasestudieswereselectedempirically.Top-attributesetspresentalowcorrelationwiththeformationofdensesubgraphsintheDBLPdataset.Suchtermsarepopularinpapertitles,butdonotcarrymuchknowledgeregardingtheformationofresearchcommuni-ties.Ontheotherhand,top-structuralcorrelationmaybemoreeasilyassociatedtoknowntopicsincomputersci-ence.Theattributesetfgrid;applicghasthehigheststruc-turalcorrelation(0.26),i.e.,26%oftheauthorsthathavethekeywords\grid"and\applic"areinsideacommunityofresearchersofsizeatleast10whereeachofthemhavecol-laboratedwithhalfoftheothermembers.Itisinterestingtopointoutthatthegraphinducedbyfgrid;applicghasmoreverticesindensesubgraphsthanthegraphinducedbyfbase;systemg,thoughfbase;systemgismorethan6timesmorefrequentthanfgrid;applicg.Ingeneral,highsupportattributesetsdonotpresenthighstructuralcorrelation.Figure4,showstheexpectedstructuralcorrelationfordi erentsupportvaluesintheDBLPdataset.Theinputparametersarethesameasthoseusedtogeneratethere-sultsshowninTable2.Forthesimulationmodel,weex-ecuted1000simulationsforeachsupportvalueandshowalsothestandarddeviationoftheexpectedstructuralcor-relationestimated.Theanalyticalupperboundisnottightw.r.t.thesimulationresults,butpresentsasimilargrowth,whichshowsthatitenablesaccuratecomparisonsbetweenthestructuralcorrelationofattributesets.Basedontheproposedanalyticalmodel,thethirdcolumnofTable2showsthetopattributesetsintermsofanalyti-calnormalizedstructuralcorrelation(lb).Theattributeset (a)GraphinducedbyfSStevens,Wilcog (b)PatterninducedbyfVanMorrisongFigure5:ExamplesofresultsfromtheLastFmdataset   lb Slb Slb Slb Radiohead121892.11.37 Radiohead121892.11.37 SStevens,Wilco28798.041.14 Coldplay118053.09.33 Coldplay118053.09.33 SStevens,OfMontreal28621.041.13 Beatles109037.09.36 Beatles109037.09.36 Beirut27605.041.11 RPeppers105984.09.35 RPeppers105984.09.35 SStevens,Decemberists,Beatles27415.041.11 Nirvana100604.07.31 Metallica83587.08.41 NHotel,SStevens29260.041.10 TKillers96305.07.32 DCforCutie82025.07.41 SStevens,FLips,Beatles27571.041.09 Muse94382.07.33 Beck83360.07.40 ACollective33555.051.09 Oasis87875.06.30 Muse94382.07.33 BSScene,NMHotel27308.041.09 FFighters87001.06.33 Nirvana100604.07.31 Radiohead,Spoon,SStevens27113.041.06 PFloyd86807.07.34 TheShins68480.07.50 NHotel,Radiohead,Beatles28776.041.04 Table3:LastFm-Topsupport(),str.correlation(),andnormalizedstr.correlation(lb)attributesets.fsearch;rankghasthehighestnormalizedstructuralcorre-lation(635,349),i.e.,thestructuralcorrelationis635,349timestheupperboundonitsexpectedstructuralcorrela-tiongivenbytheanalyticalmodel.Figure3(a)presentsthegraphinducedbyfsearch;rankg.Verticescontainedinadensesubgraphareindicated.Densesubgraphscoverthedensestcomponentsoftheinducedgraph.Ingeneral,top-attributesetshavelowlbwhencomparedtothetop-lbattributesets.Moreover,highvaluesofdonotnecessar-ilyleadtohighvaluesoflb.Figure3(b)showsthelargeststructuralcorrelationpatternintermsofnumberofverticesfromDBLP,whichrepresentstwoimportantinterconnectedresearchgroupsonhighperformancesystems.4.1.2LastFmLastFm4isanonlinesocialmusicnetwork.Weuseasam-pleoftheLastFmuserscrawledthroughanAPIprovidedbyLastFm.IntheLastFmnetwork,verticesrepresentusersandedgesrepresentfriendships.Theattributesofavertexaretheartiststherespectiveuserhaslistenedto.Anat-tributesetintheLastFmdatasetrepresents,inamoregen-eralinterpretation,amusicaltaste(i.e.,setofartists)andadensesubgraphisacommunity.TheLastFmdatasetcontains272,412vertices,350,239edges,and3,929,101attributes.Table3showsthetop10attributesetsintermsofsupport(),structuralcorrelation()andnormalizedstructuralcorrelation(lb)discoveredfromLastFm.Theminimumsize(min size)anddensity 4http://www.last.fm Figure7:LastFm-Expectedcomputedbythesim-ulation(sim-exp)andanalytical(max-exp)models.( min)parametersweresetto5and0.5,respectively.Theminimumsupportthreshold(min)wassetto27,000.Ingeneral,thetop-attributesetsarethemostfrequentones.However,suchattributesetspresentlownormalizedstructuralcorrelation.Inotherwords,althoughtheseat-tributesarefrequentandhaveseveralverticescoveredbycommunities,thiscoverageisnotmuchhigherthanex-pected.Consideringthenormalizedstructuralcorrelation,whichtakesintoaccounttheexpectedstructuralcorrela-tionofanattributeset,thetoppatternschangesigni -cantly.Figure7showstheexpectedstructuralcorrelationforsupportvaluesvaryingfrom20,000to100,000.Eachsimulation-basedexpectedstructuralcorrelationvaluecor-respondstoanaverageof100simulations.Thetoplbat-tributesetfSStevens,WilcogincludestheAmericansingerandsongwriterSufjanStevensandtheAmericanbandWilco. (a)Graphinducedbyfnode,wirelessg (b)Patterninducedbyfperform,systemgFigure6:ExamplesofresultsfromtheCiteSeerdataset   lb Slb Slb Slb systempaper57906.16.77 networksensor3276.47108.7 nodewireless2086.35164.4 basepaper56566.10.47 networkhoc2744.47141.2 protocolrout2134.35157.6 paperresult47516.08.45 adnetworkhoc2725.44134.6 memoricach2150.32143.8 papermodel43929.09.59 networkrout5084.4148.0 networkhoc2744.47141.2 uspaper43573.05.32 networkwireless5242.4044.7 protocolwireless2048.29138.7 systembase42079.09.63 nodewireless2086.35164.4 adnetworkhoc2725.44134.7 approachpaper38690.05.40 protocolrout2134.35157.6 networknoderout2075.25118.3 performpaper37349.131.04 adnetwork3563.3469.3 optimqueri2094.26118.2 paperpropos37243.06.46 programlogic5895.3331.2 performinstruct2111.25115.95 paperalgorithm37027.12.95 memoricach2150.32143.8 paperadnetwork2081.23108.86 Table4:CiteSeer-Topsupport(),str.correlation(),andnormalizedstr.correlation(lb)attributesets.Figure5(a)showsthegraphinducedbytheattributesetfSStevens,Wilcog.Forclarity,weremovedverticeswithdegreelowerthan2.Byvisualizingverticesinsideandout-sidestructuralcorrelationpatterns,wecanunderstandhowthestructuralcorrelationcapturestherelationshipbetweenattributesanddensesubgraphs.Thelargeststructuralcor-relationpatternfoundispresentedinFigure5(b).Itrep-resentsacommunityof34userswhohavelistenedtotheNorthernIrishsingerandsongwriterVanMorrison.Vertexidenti ersarenotshownduetoprivacyissues.4.1.3CiteSeerCiteSeerX5isascienti cliteraturedigitallibraryandsearchengine.WebuiltacitationgraphfromCiteSeerXasofMarchof2010.IntheCiteSeergraph,papersarerep-resentedbyverticesandcitationsbyundirectededges.Eachpaperhasasattributestermsextractedfromitsabstract6.Attributesetsrepresenttopicsanddensesubgraphsde negroupsofrelatedworkintheCiteSeergraph.TheCiteSeerdatasethas294,104vertices,782,147edges,and206,430attributes.Theparameterssettingappliedinthiscasestudyismin=2000,min size=5,and min=0:5.Table4showsthetopstructuralcorrelationattribute 5http://citeseerx.ist.psu.edu6Stemmingandstopwordsremovalwereapplied. Figure9:CiteSeer-Expectedforthesimulation(sim-exp)andanalytical(max-exp)models.setsw.r.t.,,andlbdiscovered.Top-attributesetspresentlowstructuralcorrelationandnormalizedstructuralcorrelationwhencomparedtothetop-andtop-lbattributesets,respectively.Moreover,similartotheDBLPdataset,whilethetop-attributesetsfromtheCiteSeerdatasetaregenericterms,thetop-andtop-lbattributesetsmaybeeasilyassociatedtoknownresearchtopics(e.g,computernetworks,queryoptimization).Figure9showstheexpectedstructuralcorrelationfordif-ferentsupportvaluesinCiteSeer.Theattributesetfnode;wirelessghasthehighestnormalizedstructuralcorrelation(lb=164.40).Figure6(a)showsthegraphinducedby (a)Runtimex min (b)Runtimexmin size (c)Runtimexmin (d)Runtimexmin (e)Runtimexmin (f)RuntimexkFigure8:Performanceevaluationtheattributesetfnode;wirelessginCiteSeer.Figure6(b)presentsthelargeststructuralcorrelationpatterndiscoveredintheCiteSeerdataset.Vertexlabelsaretheinitialsofpa-pertitles.Thepapersincludedinthepatterncovertopicssuchascaching,memorymanagement,computernetworks,processordesign,andinstructionleveloptimization(e.g.,AttributeCaches,SystemsforLateCodeModi cation,Lim-itsofInstructionLevelParallelism,Link-timeOptimizationofAddressCalculationona64-bitArchitecture).Wedonotshowthefulllistofpapertitlesduetospacelimitations.4.2PerformanceEvaluationThissectionevaluatestheperformanceofthestructuralcorrelationpatternminingalgorithms.ThedatasetusedisasmallerversionoftheDBLPdataset(SmallDBLP),whichhas32,908vertices,82,376edges,and11,192attributes.TheSCPM-BFSandSCPM-DFSareversionsoftheSCPMalgorithmusingtheBFSandDFSstrategy,respec-tively.TheNaivealgorithmenumeratesthecompletesetofquasi-cliquesfromtheinducedgraphs,asdescribedinSec-tion3.1.Wevaryeachparameterofthealgorithmskeepingtheothersconstant.Defaultvaluesfor min,min size,andminare0.5,11,and100.Moreovermin,min,andkaresetto0.1,1,and5,respectively,unlessstatedotherwise.Figures8(a),8(b),and8(c)showtheruntimeoftheal-gorithmsvaryingthevaluesof min,min size,andmin,respectively.Ingeneral,SCPM-DFSachievesthebestre-sults,beingupto3ordersofmagnitudefasterthantheNaivealgorithm.Moreover,SCPM-BFSperformsbetterthantheNaivealgorithminalltheexperiments.Intermsofthemin(Figure8(d))andmin(Figure8(e))parameters,boththeSCPM-BFSandSCPM-DFSapplythepruningtechniquesdescribedinSection3.2.1.BasedontheresultsshowninFigures8(d)and8(e),wecannoticethatsuchtechniquesleadtosigni cantperformancegainswhenthevaluesofminandminareincreased.InFigure8(f),weshowtheruntimeofSCPM-DFSandtheNaivealgorithmfordi erentvaluesofk.TheresultsofSCPM-BFSareomittedbecausebothSCPM-BFSandSCPM-DFSalgorithmsapplythesamestrategyforiden-tifyingthetop-kstructuralcorrelationpatterns(seeSection3.2.3).TheinsetalsoshowstheexecutiontimeofSCPM-DFSusingalinearscaleforthey-axis,tomoreclearlyseethee ectofkontheruntime.Theresultsshowthatforlowvaluesoftopk,SCPM-DFSisabletoachievelowrunningtimes,outperformingtheNaivealgorithmsigni cantly.4.3ParameterSensitivityandSettingWenowassesshowdi erentinputparametersa ecttheoutputofstructuralcorrelationpatternmining.Ourobjec-tiveistoprovideguidelinesforsettingtheparametersofSCPM.Figure10showstheaveragestructuralcorrelationandnormalizedstructuralcorrelationofthecompleteoutput(global)andthetop-10%attributesetsfromtheSmallDBLPdatasetvaryingthe min,min size,andminparameters.Defaultvaluesfor min,min size,andminare0.5,10and100.Theresultsshowthatmorerestrictivequasi-cliquepa-rameters(i.e.,highvaluesof minandmin size)reducetheaveragebutmayincrease,sincedensesubgraphsbecomelessexpected.Moreover,highvaluesofminarerelatedtohighvaluesofstructuralcorrelation.However,suchat-tributesetsalsopresenthighvaluesofexp,leadingtolowvaluesofnormalizedstructuralcorrelation.SCPMisanexploratorypatternminingmethod,andthusreasonablevaluesforthedi erentparameterscanbeob-tainedbysearchingtheparameterspace.Theminimumdensityparameter, min,andtheminimumquasi-cliquesize,min size,willdependontheapplication.Formin,ause-fulguidelineistoselectvaluesthatproduceasigni cantexpectedstructuralcorrelation.Infrequentattributesetsmaynotbeexpectedtoinduceanydensesubgraph.Theotherparameters(min,min,andk)haveasobjectivestospeedupthealgorithmandmustbesetaccordingtotheavailablecomputationalresourcesandtime.5.CONCLUSIONSInthispaper,westudiedtheproblemofcorrelatingvertexattributesanddensesubgraphsinlargeattributedgraphs.Theconceptofstructuralcorrelation,whichmeasureshowanattributesetinducesdensesubgraphsinanattributedgraphwasproposed.Wealsopresentednormalizationap-proachesthatcomparethestructuralcorrelationofagivenattributesetagainstitsexpectedvalue,whichprovidesa (a)x min (b)xmin size (c)xmin (d)x min (e)xmin size (f)xminFigure10:Parametersensitivitymeasureofthestatisticalsigni canceforthestructuralcor-relation.Inordertoenabletheanalysisoflargedatabases,weintroducedsearchandpruningstrategiesforstructuralcorrelationpatternmining.Wealsoproposedanalgorithmfortheidenti cationofthetopstructuralcorrelationpat-terns,whicharethelargestanddensestsubgraphsinducedbyagivensetofattributes.Thepatternsandalgorithmsproposedwereappliedtothreerealdatasets.Theattributesetsandpatternsfoundrepresentrelevantknowledgeintermsofthecorrelationbetweenattributesanddensesubgraphs.Acknowledgements:ThisworkwassupportedbyCNPQ,CAPES,Fapemig,FINEP,InWeb,NSFawardEMT-0829835,andNIHaward1R01EB0080161.WewouldliketothankAlbertoLaenderandLocCerffortheircomments.6.REFERENCES[1]R.Agrawal,T.Imielinski,andA.Swami.Miningassociationrulesbetweensetsofitemsinlargedatabases.InSIGMOD,pages207{216,1993.[2]A.Anagnostopoulos,R.Kumar,andM.Mahdian.In uenceandcorrelationinsocialnetworks.InKDD,pages7{15,2008.[3]S.Fortunato.Communitydetectioningraphs.PhysicsReports,486(3-5):75{174,2010.[4]R.Ge,M.Ester,B.J.Gao,Z.Hu,B.Bhattacharya,andB.Ben-Moshe.Jointclusteranalysisofattributedataandrelationshipdata:Theconnectedk-centerproblem,algorithmsandapplications.ACMTrans.Knowl.Discov.Data,2(2):1{35,2008.[5]D.Gibson,R.Kumar,andA.Tomkins.Discoveringlargedensesubgraphsinmassivegraphs.InVLDB,pages721{732,2005.[6]M.GirvanandM.Newman.Communitystructureinsocialandbiologicalnetworks.InPNAS,pages7821{7826,2002.[7]Z.Guan,J.Wu,Q.Zhang,A.Singh,andX.Yan.Assessingandrankingstructuralcorrelationingraphs.InSIGMOD,pages937{948,2011.[8]D.JiangandJ.Pei.Miningfrequentcross-graphquasi-cliques.ACMTrans.Knowl.Discov.Data,2(4):1{42,2009.[9]A.Khan,X.Yan,andK.-L.Wu.Towardsproximitypatternmininginlargegraphs.InSIGMOD,pages867{878,2010.[10]G.LiuandL.Wong.E ectivepruningtechniquesforminingquasi-cliques.InPKDD,pages33{49,2008.[11]M.McPherson,L.Smith-Lovin,andJ.Cook.Birdsofafeather:Homophilyinsocialnetworks.AnnualReviewofSociology,27(1):415{444,2001.[12]F.Moser,R.Colak,A.Ra ey,andM.Ester.Miningcohesivepatternsfromgraphswithfeaturevectors.InSDM,pages593{604,2009.[13]P.-N.Mougel,M.Plantevit,C.Rigotti,O.Gandrillon,andJ.-F.Boulicaut.Constraint-BasedMiningofSetsofCliquesSharingVertexProperties.InACNE,pages48{62,2010.[14]J.Pei,D.Jiang,andA.Zhang.Onminingcross-graphquasi-cliques.InKDD,pages228{238,2005.[15]J.Sese,M.Seki,andM.Fukuzaki.Miningnetworkswithshareditems.InCIKM,pages1681{1684,2010.[16]A.Silva,W.Meira,Jr.,andM.J.Zaki.Structuralcorrelationpatternminingforlargegraphs.InMLG,pages119{126,2010.[17]L.Valiant.Thecomplexityofcomputingthepermanent.Theoreticalcomputerscience,8(2):189{201,1979.[18]J.Wang,Z.Zeng,andL.Zhou.Clan:Analgorithmforminingclosedcliquesfromlargedensegraphdatabases.InICDE,pages73{82,2006.[19]M.J.Zaki.Scalablealgorithmsforassociationmining.IEEETrans.onKnowl.andDataEng.,12:372{390,2000.[20]Z.Zeng,J.Wang,L.Zhou,andG.Karypis.Coherentclosedquasi-cliquediscoveryfromlargedensegraphdatabases.InKDD,pages797{802,2006.[21]Z.Zeng,J.Wang,L.Zhou,andG.Karypis.Out-of-corecoherentclosedquasi-cliqueminingfromlargedensegraphdatabases.ACMTrans.DatabaseSyst.,32,June2007.[22]Y.Zhou,H.Cheng,andJ.X.Yu.Graphclusteringbasedonstructural/attributesimilarities.PVLDB,2(1):718{729,2009.