/
In press The American Statistician August   v In press The American Statistician August   v

In press The American Statistician August v - PDF document

pasty-toler
pasty-toler . @pasty-toler
Follow
496 views
Uploaded On 2015-01-15

In press The American Statistician August v - PPT Presentation

5 Corrgrams Exploratory displays for correlation matrices Michael Friendly York University Abstract Correlation and covariance matrices provide the basis for all classical multivariate techniques Many statistical tools exist for analyzing their struc ID: 31786

Corrgrams Exploratory displays for

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "In press The American Statistician Augus..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Inpress:TheAmericanStatisticianAugust19,2002(v1.5) Corrgrams:ExploratorydisplaysforcorrelationmatricesMichaelFriendlyYorkUniversityCorrelationandcovariancematricesprovidethebasisforallclassicalmultivariatetechniques.Manystatisticaltoolsexistforanalyzingtheirstructure,but,surprisingly,therearefewtechniques  Whenwegobeyondarelativelysmallnumberofvariables,itbecomesprogressivelymoredifculttoshowallthedatadirectly.Themainapproach,asindicatedabove,hasbeentheapplicationofdimension-reductiontechniques.Here,weconsidertechniquestodisplaythepatternofrelationsamongapossiblylargesetofvari-ablesdirectly,intermsoftheircorrelations.Todosoinacomprehensibleway,evenforamoderatelylargenumberofvariablesrequiressomeschematicvisualsummary—aneffectivevisualthinning,asintheboxplot(Tukey,1977),whichsacricesdetailinthemiddletoprovidemoreessentialinformationonunivariateshape,center,spreadandoutliers.Inthispaperwefocusontechniquestodisplaytheofcorrelationsintermsoftheirsignsandmagnitudesusingvisualthinningandcorrelation-basedvariableordering.Someofthespecicideasandtechniquesweillustratehavebeensuggestedbefore,andsomearenovel.Themaincontributionsofthispaperaretointegratethesemethodswithinacoherentframeworkbasedontheprinciplesofcorrelationrenderingcorrelationordering,withdetails,comparisons,andsoftware.Inparticular,wecompareavarietyofvisualencodingsforschematicrenderingofbivariatere-lationsamongquantitativevariablesandillustratetheperceptualdifferencesamongthemforvariousdata-analytictasks.Wealsointroduceanewmethodforarrangingthevariablesinsuchdisplayssothatthepatternofrelationsamongvariablesmaybemoreeasilydiscerned.Finally,extensionsofthisframeworkleaddirectlytousefulnewdisplaysforexploringconditionalindependenceandpartialForsimplicity,weconsiderthecaseforvariables,Y,assumedtobeatleastapprox-imatelymultivariatenormal,sothatacorrelationisareasonablenumericalsummary,andexpressedinstandardizedform(),sothatwemayfocusonacorrelationmatrixratherthanacovariancematrix.Section2describesseveralmethodsforvisuallyencodingacorrelationvaluetoshowbothitssignandmagnitude,withthegoalofdepictingthepatternofrelationsamongvariablesinapotentiallylargematrixofcorrelations.Section3describesandillustratesamethodforre-orderingthevariablesinacorrelationmatrixsothat“similar”variablesarepositionedadjacently,inordertomakesuchpatternsmoreapparent.InSection4,weextendtheseideastodisplaysdesignedtoshowconditionalandpartialindependencerelationsamongvariables.Section5describessomerelatedmethods,whileSection6describessoftwareimplementingourproceduresandsomeothers.Section7presentssomeconclusionsandquestionsforfurtherworkonthistopic.2CorrelationrenderingAmatrixofcorrelationscanbedisplayedschematicallyinavarietyofforms:asnumbers,shadedsquares,bars,ellipses,orascircular‘pac-man’symbols,asshowninFigure1.Theseschemesallattempttoshowboththesignandmagnitudeofthecorrelationvalue,usingacolormappingoftwohuesinvaryinglightness(Cleveland,1993),wheretheintensityofcolorincreasesuniformlyasthecorrelationvaluemovesawayfrom0.Color(blueforpositivevalues,redfornegativevalues)isusedtoencodethesignofthecorrelation,buttherenderingsaredesignedsothatthesignmaystillbediscernedwhenreproducedinblackandwhite.Intheshadedrow,eachcellisshadedblueorreddependingonthesignofthecorrelation,andwiththeintensityofcolorscaled0–100%inproportiontothemagnitudeofthecorrelation.(SuchscaledcolorsareeasilycomputedusingRGBcodingfromred,,throughwhite,toblue.Forsimplicity,weignorethenon-linearitiesofcolorreproductionandperception,butnotethattheseareeasilyaccommodatedinthecolormappingfunction.)Whitediagonallinesareaddedsothatthedirectionofthecorrelationmaystillbediscernedinblackandwhite.Thisbipolarscaleofcolorwaschosentoleavecorrelationsnear0empty(white),andtomakepositiveandnegativevaluesofequalmagnitudeapproximatelyequallyintenselyshaded.Grayscaleandothercolorschemesareimplementedinoursoftware(Section6),butnotillustratedhere. Correlation value (x 100) -100 -85 -70 -55 -40 -25 -10 5 20 35 50 65 80 95 r Circle Ellipse d Figure1:Somerenderingsforcorrelationvalues.Thebarandcircularsymbolsalsousethesamescaledcolors,butllanareaproportionaltotheabsolutevalueofthecorrelation.Forthebars,negativevaluesarelledfromthebottom,positivevaluesfromthetop.Thecirclesarelledclockwiseforpositivevalues,anti-clockwisefornegativevalues.Theellipseshavetheireccentricityparametricallyscaledtothecorrelationvalue(MurdochandChow,1996).Perceptually,theyhavethepropertyofbecomingvisuallylessprominentasthemagnitudeofthecorrelationincreases,incontrasttotheotherglyphs.Weusetheseiconicencodingstodisplaythepatternofcorrelationsamongvariablesintheen-tirematrix,asshowninFigure2,whichdepictsthematrixofcorrelationsamong11measuresofperformanceandsalaryfor263baseballplayersinthe1986season(fromthe1988DataExpoattheASAmeetings,ascorrectedbyHoaglinandVelleman(1994);seehttp://lib.stat.cmu.edu/data-expo/1988.html).Toillustratethedifferencesamongtheseencodings,wehaveusedshad-ingforthelowertriangle,andcirclesfortheuppertriangle.Thediagonalcells,whichhavevaluesof1.0areintentionallyleftempty.Theinterpretationforthisexample,andthemethodusedtoorderthevariablesaredescribedinSection3.Thechoiceofvisualrepresentationforgraphicsalwaysdependsonthetasktobecarriedoutbytheviewer.FromFigure1andFigure2wenotethatitappearseasiestto“read”thenumericalvaluefromthenumberitself,nextfromthecircularsymbols,thenfromtheellipsesandthebars,andlastforthepureshadings.Forexploratoryvisualization,wherethetaskistodetectpatternsofrelations,andanomalies,thisorderingmaywellbereversed—fromshadedboxesasthebesttonumericalvaluesastheworst.Otherformsofencodingmayalsobeuseful,orthoseshownheremaybeenhancedforcertainpurposes.Forexample,itisstraightforwardtoaddvisualindicationsofthesignicancelevel,orofthevalueofacorrelationrequiredforsignicance.Wedonotconsidertheseextensionshere,becauseouremphasisisonexploratorydisplay.3CorrelationorderingForexploratoryvisualization,thetaskofdetectingpatternsofrelations,trends,andanomaliesismadeconsiderablyeasierwhen“similar”variablesarearrangedcontiguouslyandorderedinawaythatsimpliesthepatternofrelationsamongvariables.Thisisaninstanceofasimplegeneralprinciple,called“effect-ordereddatadisplay”(FriendlyandKwan,2002)whichsayssimplythatinanydatadisplay(tableorgraph),unorderedfactorsorvariablesshouldbeorderedaccordingtowhatwewishtoshoworsee.Thisprincipleextendstheideaof“maineffectordering”(e.g.,Cleveland(1993))—sortquantitative,multi-waydatabymeansormedians—andisgroundedintheperceptualideasof Theorderofthepiesandbarsmaybeupforgrabs,butweputourmoneyonthemuch-malignedCamembert,whenthepurposeistobeabletosay“whichismore,”orestimatethecorrelationvalue. Baseball data: PC2/1 order logSa l Home r Putou ts RBI Walks Runs Hits Atbat Errors Assis ts Figure2:CorrgramforBaseballdata.VariablesorderedbyvectoranglesfromFigure3.Lowertriangle:correlationsshownbycolorandintensityofshading;upper:circlesymbols.similarityandgroupingwhichstemfromGestaltpsychology.Ofcourse,forcorrelations,therearemanywaysofspecifyingwhatwemeanby“similar”.Severalvariationshaveinterpretedthiscriterionintermsofaclustering,typicallyhierarchical,ofthevariableswhosecorrelationsaredisplayed.Theseproceduresinduceonlyapartial-orderingonthevariables:variableswithinclustersarecontiguous,butthosewithinclustersatanylevelmaybepermutedinanyorderwiththesamevisualinterpretation.(GruvaeusandWainer(1972)provideamethodtomaketheorderingofvariablesunique,butthismethodisadhocandnotnecessarilyoptimal.)Here,wetakeadifferenttackandoptforaslightlystrongercriterion:doingareasonablejobofplacingthevariablesinawell-denedoptimalunidimensionalorder.Weconneourselvesheretomethodsbasedontheeigenvaluesandeigenvectorsofthecorrelationmatrix,,orsomefunctionofit;weconsideronlymethodsbasedontheeigenvectorsassociatedwiththelargesteigenvalues,.FriendlyandKwan(2002)showthatthisapproachprovidessolutionsforawiderangeofvisualizationmethods.Second,wedothiswiththehopethatthisapproachmaybemoreusefulinsomecases,andgiveresultswhichshouldnotbesubstantivelydifferentfromthoseobtainedbytheweakerclusteringinterpretation.Whenthestructureofcorrelationsiswell-describedbyasingle,dominantdimension(asinauni-dimensionalscaleorasimplex),orderingvariablesaccordingtotheirpositionsonthersteigenvector,,ofthecorrelationmatrix,,willsufce.Geometrically,thisimpliesthatallvariable-vectorsarecontainedwithinasegmentof-space,andall(ormost)correlationsarepositiveornearzero.Thisisnotusuallythecase,andingeneral,moresatisfactorysolutionsareobtainedbyorderingvariablesaccordingtotheanglesformedbythersttwo(orthree)eigenvectors(principalcomponents).Forexample,Figure3plotsthersttwoeigenvectorsofthecorrelationmatrixamongvariablesinthebaseballdata.Dimension1relatesmostlytomeasuresofbattingperformance,whileDimension2relatestotwomeasuresofeldingperformanceandtolongevitityinthemajorleagues.However,thelengthsoftheprojectionsonthesedimensionsisdeterminedbytheadequacy(percentofvariance)ofthetwo-dimensionalrepresentation.Ontheotherhand,the(cosinesof)anglesbetweenvectors logSal Years Homer Runs Hits RBI Atbat Walks Putouts Assists Errors -1.0-0.50.01.0 1.5 -1.0-0.50.00.51.01 .5 Figure3:EigenvectorplotforBaseballdata.Eachvariableisrepresentedbyavectorwhoseendpointisthecoordinatesofthersttwoeigenvectors.approximatethecorrelationsbetweenthesevariables,andsoanorderingbasedontheangularpositionsofthesevectorsnaturallyplacesthemostsimilarvariablescontiguously.FriendlyandKwan(2002)refertothisas“correlationordering,”ageneralmethodforarrangingvariablesinmultivariatedataInFigure2thevariableshavebeenarrangedintheangularorderoftheeigenvectorsfromFigure3.Moreprecisely,theorderofthevariablesiscalculatedfromtheorderoftheangles,aretheeigenvectorsassociatedwiththelargesttwoeigenvalues.Thiscircularorderisunfoldedtoalinearorderbysplittingatthelargestgapbetweenadjacentvectors.Falissard(1996)describesamethodforrepresentingthevariablesinacorrelationmatrixonaunitsphere,usingtherstthreeprincipalcomponents(PCs).3.1ExamplesWecontinuetheanalysisofthebaseballdataandillustratethesemethodswithadditionaldataoncharacteristicsofautomobilemodels.BothexamplesareillustratedinotherformsinthecorrgramextensionsinSection4.3.1.1BaseballdataFigure4comparesanarbitrary,alphabeticorderingofvariableswithorderingbasedontheanglesofthersttwoPCsforthebaseballdatausingtheshadedencoding.Intheleftpanel(orderedalphabet-ically),itisdifculttoseeanyoverallpatternofrelationsamongthesevariables,despitethefactthatmostcorrelationsarepositive,andtherelationsherearefairlysimple.Therightpanelshowsclearlythat(a)AssistsandErrorsstandoutasaseparatecluster,(b)RBIs,Walks,Runs,HitsandAtbatsform arelativelyhomogeneousgroupingwithhighpositivecorrelations,(c)Putoutshasweakerpositivecorrelationswiththese,and(d)thereareafewcorrelationswhichstandoutashigherorlowerthantheirneighbors.(ThevariableYearsactuallyhasanon-linearrelationwith(log)Salary,andisbetterrepresentedinalinearmodelasapiecewiselinearfunction,minYears,whichislinearuptosevenyearsandatthereafter.Similarly,anumberofthecountedvariables,suchasHits,Runs,Homer,etc.,arebetterrepresentedonasquare-rootscale.Thesetransformationsdonotaffectthegeneralnatureoftheinterpretationsdrawnhere.)Inthiscase,theseobservationscouldarguablybemademoreeasilyfromtheeigenvectordisplayinFigure3.Forlargerormorecomplexdatasets,thecorrgrammayhavesomeadvantageforexploratorypurposes,becauseitshowsallthecorrelations,ratherthanjustalow-dimensionalsummary. Atbat Errors Hits Homer logSal PutoutsRBI Runs Walks Years Years Walks Runs RBI PutoutslogSal Homer Hits Errors Atbat Assists (b) PC2/1 order logSal Homer PutoutsRBI Walks Runs Hits Atbat Errors Assist s AssistsErrors Atbat Hits Runs Walks RBI PutoutsHomer logSal Years Figure4:CorrgramsforBaseballdata.Eachcorrelationisshownbycolorandintensityofshading.Left:variablesinalphabeticorder;right:variablesorderedbyanglesofrsttwoeigenvectors.Thebaseballdatasetactuallycontainsperformancestatisticsforthe1986season,andsimilarmea-suresfortheplayer’scareer(whosenameshavea‘c’sufx).Figure5showsthecorrgramforall19variables(includingseasonandcareerbattingaverages,calculatedfromHitsandAtbats).Herewesee,amongotherthings,that:(a)thecareerhittingstatisticsareallnearlyuniformlyhighlypositivelycorrelated,and,notsurprisingly,highlycorrelatedwithYears,(b)Salary(logSal)ismosthighlyrelatedtothecareertotals(butitturnsoutthatYearsisanefcientproxyformostofthese),(c)theseasonandcareerbattingaveragestatisticshaveamoderatelystrongcorrelation,butareweaklyassociatedwithmostothervariables,(d)theeldingvariables,Putouts,Assists,andErrors(allseasonal)formaseparategroup,withweakcorrelationstomostothervariables(perhapsthecorrelationbetweenAssistsandErrorsstandsout).3.1.2AutodataFigure6showsacorrgramofdataon74automobilemodelsfromthe1979modelyear(Chambersetal.,1983,pp.352–355).Thevariablesarevariousphysicalmeasures(gearratio,head-room,trunkspace,rear-seat,length,weight,enginedisplacement(Displa),turningcirclediameter(Turn))aswellasPrice,gasmileage(MPG),andrepair-recordsforeachof1978and1979.Itisimmediatelyclearthattherearetwoseparategroupsofvariables:thoserelatedtooverallsizeandweight(whichhaveapositivecorrelationwithPrice),andtheothers,whichincludeGratio,MPG, Baseball data: All variables Figure5:CorrgramforBaseballdata,usingseasonandcareervariables.andthetworepairrecordvariables.Withintherstgroup,Length,Weight,Displa,andTurnaremostpositivelycorrelated;withinthesecondgroup,MPGandGratioarehighlycorrelated,asarethetworepairrecordvariables.WealsoseestrongnegativecorrelationsbetweenGratioandMPGontheonehand,andthesizevariablesontheother.4ExtensionsThecorrgramisdesignedtodisplaypatternsof(linear)dependenceamongvariables,aswellaspat-ternsofindependence.Thisdisplayiseasilyadaptedtoconditionalorpartialdependenceandindepen-dence.SeeFriendly(1999),Whittaker(1990)forsomerelationsamongtheseformsofindependenceforbothquantitativeandqualitativevariables.4.1ConditionalindependenceFromDempster(1969)andfromthetheoryofgraphicalmodels(e.g.,Whittaker(1990))itiswellknownthattheelementsoftheinverseofthecorrelationmatrix,,expressesconditionaldepen-denceandindependencerelationsinthesamewaythatcorrespondingelementsofexpressordinary(linear)dependenceandindependence.Moreprecisely, Auto data Figure6:CorrgramforAutodata,variablesorderedaccordingtoEqn.(1).istheijthelementofistheijthelementofmeans“isindependentof,”and“others”referstothecomplementarysetexcludingvariables.Thus,nearsignify(bivariate,marginal)independencewhilenearelementsinsignifyconditionalindependence,givenallothervariablesintheset.Whenthenegativeofisappropriatelyrescaledtohaveunitdiagonals,theoff-diagonalel-ementsareallpairwisepartialcorrelations,eachoftheform.Thus,acorrgramofprovidesavisualizationofconditionalindependenceanddependence,justasthecorrgramofformarginalindependenceanddependence.Inacorrgramof,weshouldthereforepaypartic-ularattentiontoemptyoff-diagonalcells,aswellasthosewhicharestronglyshaded.Figure7givesanexample,forthebaseballdataseasonalvariables.Foreaseofcomparison,thevariableshavebeenorderedinthesamewayasinFigure4.Weseethatmostofthepartialcorrelationsdenedfromaresmallinmagnitude,butthereareafewnotableexceptions:controllingforallothervariables,therearestillsizeablecorrelationsbetweenYearsandlogSal,HomersandRBIs,HitsandAtbats,andErrorsandAssists.Allofthesehavesensibleinterpretations.Forexample,comparingFigure7withFigure4,thepositiverelationsbetweenYearsandlogSal,andbetweenHomersandRBIsremainwhenallothervariablesarecontrolled.Ontheotherhand,althoughlogSalwaspositivelyrelated(marginally)toallofthehittingperformancestatisticsinFigure4,weseeinFigure7thatthese(conditional)relationsarenegligiblewhentheothervariablesaretakenintoaccount.Figure7maythereforebeinterpretedtosaythattherelationbetweenlogSalandthehittingperformancemeasuresislargelyareectionofYearsinthemajorleagues. Baseball data: R^-1 logSa l Home r Putou ts RBI Walk s Runs Hits Atbat Error s Assis ts Figure7:ConditionalindependencecorrgramforBaseballdata,avisualizationofpairwisepartial4.2PartialindependencePartialcorrelations,,givethecorrelationsamongonesetofvariables(),whenanotherset)havebeenstatisticallycontrolled(“heldconstant”),adjustedfor,or“partialledout”.Conceptu-ally,theydifferfromtheconditionalcorrelationsjustdiscussedonlyinthatthesetofvariablesisxed,ratherthanalltheothers,foreachpair.Computationally,theymaybeviewedasthecorrelationsamongtheresidualsintheregressionsofeachofthesonalloftheSeveralinterpretationsofpartialcorrelationsareusefulforvisualdisplaybycorrgrams.Ifordinary(zero-order)correlationsbetweentwosarelargeinmagnitude,butthepartialcorrelations,givenoneormorearenearzero,thenthesmaybesaid“toaccount”forthecorrelationbetweenthecorrespondingInterchangingthetypicalrolesof,iftheoneormoresareconsideredresponses,andthesexplanatory,thenthepartialcorrelationsmaybeinterpretedascorrelationsamongtheexplanatoryvariables“focusedon”(orpartiallingout)theirrelationstotheresponsevariables(Falissard,1999).Inbothcases,supposethatisasubsetofthevariablestobepartialledout.Then,weshowacorrgramofthepartitionedmatrix,isthematrixofpartialcorrelations.Ifthereisonlyonethatrowandcolumnwillbetherepresentationofzeros;otherwise,theportionwillportraythe correlationsamongthe Figure8:PartialindependencecorrgramforAutodata,MPGandPricepartialedout.Toillustrate,usingtheAutodata,wemightwishtoexplorethepartialcorrelationsamongtheremainingvariableswhenPriceandMPGarepartialledout,forexampletodeterminewhetherthedependenciesamongtheremainingvariablescanbeaccountedforbyPriceandgasmileage.Figure8showsthecorrgramdisplay,usingcircularencodingstobetterdepictthenumericalvaluesofthepartialcorrelations.Here,theportion(bottomright)showsthemoderatelystrongnegativezero-ordercorrelationbetweenPriceandgasmileage(MPG).portionshowsthat,controllingforbothofthese,thesizevariablesareallpositivelycorrelated,butparticularlysoforLength,Weight,DisplacementandTurn.Theremainingvariables(Gratio,Rep77andRep78)aregenerallypositivelycorrelatedwitheachother,althoughthetworepair-recordvariablesstandoutmoststrongly.Betweenthesetwosubsets,thereisaconsistentpatternforGratiovs.theothers,butsomewhatweakerforthetworepair-recordvariables.Figure9illustratesthesecondcaseofapartialcorrelationdisplay,showingthecorrelationsamongthe(seasonal)predictorsoflogSalinthebaseballdata,focusedonthesalaryresponsevariable.Con-trollingforsalary,yearsinthemajorleagueshas(weak)negativecorrelationswithallothervariables.ErrorsandAssistsarestillhighlycorrelatedwitheachother,butweaklycorrelatedwiththebattingvariables.Eventakingsalaryintoaccount,thebattingvariablesarestillhighlypositivelyrelated,andagaintherelationbetweenHomersandRBIsstandsoutagainsttherelationsoftheothervariablesintheupperleftcorner. Baseball data: Partialling logSal r Putou ts RBI Walk s Runs Hits Atbat Error s Assis ts Years logSa l lo gSal Y ears A ssists E rrors A tbat H its R uns W alks R BI P utouts H omer Figure9:PartialindependencecorrgramforBaseballdata.5RelatedmethodsSeveralgraphicmethods,oftenadhoc,fordepictingthestructureofmatricesandrenderingtheirvalueshavebeenproposedinavarietyofcontexts.Thebriefreviewbelowattemptstorelatethepresentmethodstothisotherwork.5.1OrderingForexample,PaoliniandSantangelo(1991)usesimilardisplaysforvisualanalysisofthepatternofsparsityinthecoefcientmatrix,,inlargelinearsystemsoftheform,wheremaybeofsizeormore.Permutationofthematrixrowsandcolumnsisusedtosearchforblockstructureinthenon-zeroelements,enablingspecializedsolverstondsolutionsforthesesystemsfarmoreefciently.Inearlierwork,Hills(1969)proposedtwotechniquesforgraphicalanalysisoflargecorrelationmatrices:ahalf-normalplotofFisher’s-transformstoidentifycorrelationvaluestoolargetohavecomefromzeropopulationvalues,andanapplicationofmetricmultidimensionalscaling(MDS)toidentifyclustersofvariables“suchthatmembersofthesamegroupareallfairlypositivelycorrelatedwitheachother,andbehavesimilarlyintheirrelationswithothervariables.”InthemetricMDSanalysis,thevariablesarerepresentedaspointsin-dimensionalEuclideanspacedeterminedfromthersteigenvectorsofthedouble-centeredmatrixwithelements.Distancesbetweenpairsofpointsapproximate(totheextentthatthersteigenvaluesarelarge),soclosepointsrepresentvariableswithhighpositivecorrelations.Forcomparisonwiththepresentmethods,Figure10showsanetworkgraphrepresentationoftheconditionalindependencerelationsinthebaseballdatadepictedinFigure7.Inthenetworkgraph, logSalYearsHome r RunsHitsRBIAtbatWalksPutoutsAssistsErrors Figure10:ConditionalindependencenetworkgraphfortheBaseballdata.ThelocationsofvariablesweredeterminedbyanMDSanalysisofthepartialcorrelationmatrixderivedfrom.Linestyleandthicknessareproportionaltothepartialcorrelation.thepositionsofthevariableswerederivedfromanon-metricMDSanalysisofthepartialcorrelationderivedfrom,whichallowsamonotonic,butnotnecessarilylinearrelationbetweenthepartialcorrelationvalueanddistanceinthe2Dspatialrepresentation.Notethatthespacingofthepoints,bythemselves,doesnotleadtoidenticationofsimilarclustersofvariables,nordoesitprovideanycoherentinterpretation.Toshowagraphrepresentationoftheconditionalindependenciesinthesedata,wefollowFriendly(1999)toextendthesimple0/1graphdiagramsofWhittaker(1990).InFigure10wehaveaddedlinesbetweenpairsofvariables,forallcaseswhere,thesmallestvaluerequiredtomakethegraphconnected.Inthisgraph,line-styleandthicknessencodethemagnitude,andcolorencodesthesignoftheconditionalcorrelation.ComparingFigure10withFigure7,wecanseethesameconditionalrelationsidentiedearlier:Strongpositiverelationsbetween5andlogSal,HomersandRBIs,HitsandAtbats,andErrorsandAssists,whenallothervariablesarecontrolled.Inadditionseveralnegativeconditionalrelationsattractattention,e.g.,RBIsandRuns,HitsandHomers,PutoutsandAssists.Inthisnetworkgraph,itisperhapssomewhateasiertoseetheserelationsthaninthecorrespondingcorrgram.Wehavefound,however,thattheusefulnessofsuchgraphsdependscriticallyontheuseofa(somewhatarbitrary)thresholdfordrawinglines,andtheencodingofcorrelationvaluebyline-styleandthicknessisoftenlesseffectivethaninthecorrgram.OtherrelatedtechniquesstemfromthemethodofMcQuitty(1968),whichinvolvesiterativelyre-calculatingcorrelationsamongthecolumnsofthecorrelationmatrixitself.Ifcorrtheoriginalcorrelationmatrix,thesequence,iscalculatedascorrMcQuittyshowedthatthissequenceoftenconvergestoamatrixwhoseelementsareallwhichallowsthevariablestobepartitionedintotwogroups.Recursiveapplicationofthismethodtoeachgroupmaythenbeusedtogenerateahierarchicalclusteringofthevariables.Thismethodwasapparentlyre-discoveredbyBreigeretal.(1975)astheCONCORalgorithm,whoapplieditto proximitymatricesandsocialnetworkanalysis.Mostrecently,Chen(1996,1999)usedtheseideastodevelopa“generalizedassociationplot”inwhich(a)theiterativeprocedureiscontinueduntilbecomes(nearly)ranktwo,(b)theeigenvec-torsofhaveanellipticalstructure,whoseorderisusedtoseriatethevariables,and(c)shadingsofthere-orderedmatrixareusedtodisplayitsstructure.5.2RenderingFinally,itisofsomeinteresttocomparethetechniquespresentedherewithothermethodsforschematicrenderingofcorrelationvalues.MurdochandChow(1996)adoptaminimalistapproachbyusingellip-ticalglyphswhoseeccentricityisscaledtothesignedcorrelationvalue,anapproachwhichissuitableforlarge(,say)matrices.However,theyeschewtheuseofcolorandshading,andpresentnogeneralschemeforvariableordering.Thecurrenttechniquesmayalsobecomparedwithmethodsbasedonthescatterplotmatrixandits’enhancementscitedintheintroduction.Fox(personalcommunication,2001)suggestedthecombina-tionofconcentrationellipsesandloesssmoothsasschematicvisualsummariesoflinearandpossiblynonlinearassociation.Forthebaseballdata,ourversionofsuchaplotisshowninFigure11,witha“1standarddeviation”ellipseof68%coveragecenteredatthemeans,andthevariablesorderedasinFigure4(right).Tohighlightthepatternsofassociation,allextraneousinkhasbeensuppressed—points,plotframes,tickmarks,etc.Forvariables,thisdisplayshowsfarmoredetailthanthecorrgramspresentedhere,butitalsosuggeststhatsomeoftherelationwehaveassumedtobelinearareactuallynonlinear.Whilethisformofrenderingmaybebetterforsometasks,itwouldbedifculttoaccommodatemanymorevariables,anditisclearlymoredifculttoseetheoverallpatternofrelationsamongvariablesinFigure11thaninthecorrespondingcorrgraminFigure4.6SoftwareThecorrgramsshownherearealldrawnbyageneralSASmacroprogram,,de-scribedinhttp://www.math.yorku.ca/SCS/sasmac/corrgram.html,fromwhichthesourcecodemaybedownloaded.Theprogramhasalargevarietyofoptionsandiseasilyused.Forexample,Figure2,usingshadedencodingsbelowthediagonal,andcircularencodingsabove,ispro-ducedbythemacrocall,title’Baseballdata:PC2/1order’;var=logSal5HomerRunsHitsRBIAtbatWalksPutoutsAssistsErrors,fill=SEC);TheanalogouspartialindependencecorrgraminFigure9isobtainedbyaddingthekeywordoptionandremovingthefill=SECoption.TheprogramrequiresSAS/IMLandSAS/GRAPHinadditiontothebasicSASSystem.AprogramforMATLAB,wasdevelopedbyBarryWise,andisavailableat//www.eigenvector.com/MATLAB/corrmap.html.Thisprogramusesaversionofthek-nearestneighboralgorithmtoreorderthevariables.Itappearstoencodecorrelationvaluesbya“pseudo-colormap”whichrangesfromwhitefor throughyellowandred,toblackforThisisnotagoodchoice,butothercolormappingsmaybereadilyused,andthesourcecodeisavail-Finally,SYSTATprovidesavarietyofmatrixclusteringalgorithms,clusteringbothrowsandcolumns.Whenappliedtoacorrelationmatrix,therowsandcolumnsarepermutedaccordingtotheclusterstructure,andthecorrelationvaluesaredepictedbycoloredsquares,butusinganapparentlyarbitraryandxedcolorscheme. Years1 24 logSal1.83253.3909 Homer0 40 Putouts0 1377 RBI0 121 Walks0 105 Runs0 130 Hits1 238 Atbat19 687 Errors0 32 Assists0 492 Figure11:SchematicscatterplotmatrixforBaseballdata.Eachpanelshowsthebivariate68%con-centrationellipse(truncatedatthedataboundingbox)andaloesssmoothedcurve.7ConclusionsInawidesensethereisnotmuchthatisnovelhere—variousmethodsforvisuallydepictingcorrelationmatriceshavebeenproposed(orjustused,e.g.,Dobkinsetal.(2000,Fig.2)),andvariousschemesforreorderingvariablesinsuchmatriceshavealsobeensuggested.Yet,surprisingly,therehasbeennopublishedworkwehavefoundtreatingthesemethodsinanycoherentway.claimtohavepresentedamoregeneralandcomprehensiveaccountofthepossibilitiesthanhasappearedpreviously.Wehavealso(a)suggestedanewschemefororderingvariablesinsuchdisplays,(b)extendedtheideaofcorrelationmappingtomoregeneralconceptsofdependenceandindependence,and(c)illustrated(wehopeconvincingly)whytheymightbeuseful.Wealsoprovideaexibleimplementationoftheseideas(Section6)withwhichotherscanwork,andperhapsextend.Inparticular,thedetailsofthevariousrenderingtechniquessuggestedherebearfurtherstudy:continuously-scaledvs.classedcolors,accountingforthenon-linearityofcolorreproductionandper-ception,circlesorbarsvs.shadedboxes,andsoforth.Itwasnotuntilwehadtriedseveralalternativesthatthedifferencesamongthembecameapparent.Forlargematrices,thesetechniquesscalerelativelywell,buttheresultsaremostoftensuccessfulwhenthelevelofdetailintherenderingisminimized(e.g.,usingshading,ellipticalglyphs,etc.).Labelingofthevariables,importantforinterpretation,alsobecomesmoredifcult,butthisiseasilysolvedbyexiblefontorientationandscaling.Finally,wenotethatthesegraphicaltechniquesareapplicabletothewiderclassofsymmetricmatrices,includingdistanceandproximitymatrices.Inaddition,themethodofcorrelation-basedvariableorderingdescribedherehasbeenshown(FriendlyandKwan,2002)tofacilitateperceptionofrelationsinothermultivariatedatadisplays(e.g.,parallelcoordinateplots,starplots). ReferencesAsimov,D.Grandtour.SIAMJournalofScientiÞcandStatisticalComputing,6(1):128–143,1985.Breiger,R.L.,Boorman,A.,S.,andArabie,P.Analgorithmforclusteringrelationaldatawithapplicationstosocialnetworkanalysisandcomparisonwithmultidimensionalscaling.JournalofMathematicalPsychology,12:328–383,1975.Chambers,J.M.,Cleveland,W.S.,Kleiner,B.,andTukey,P.A.GraphicalMethodsforDataAnalysisWadsworth,Belmont,CA,1983.Chen,C.H.Thepropertiesandapplicationsoftheconvergenceofcorrelationmatrices.InProceed-ingsoftheStatisticalComputingSection,pp.49–54,Alexandria,VA,1996.AmericanStatisticalChen,C.H.Extensionsofgeneralizedassociationplots(GAP).InProceedingsoftheStatisticalComputingSection,pp.111–116,Alexandria,VA,1999.AmericanStatisticalAssociation.Cleveland,W.S.VisualizingData.HobartPress,Summit,NJ,1993.Dempster,A.P.ElementsofContinuousMultivariateAnalysis.Addison-Wesley,Reading,MA,1969.Dobkins,K.R.,Gunther,K.L.,andPeterzell,D.H.Whatcovariancemechanismsunderliegreen/redequiluminance,luminancecontrastsensitivityandchromatic(green/red)contrastsensitivity?VisionResearch,40(6):613–628,2000.Falissard,B.Asphericalrepresentationofacorrelationmatrix.JournalofClassiÞcation,13(2):267–280,1996.Falissard,B.Focusedprincipalcomponentanalysis:Lookingatacorrelationmatrixwithaparticularinterestinagivenvariable.JournalofComputationalandGraphicalStatistics,8(4):906–912,1999.Friedman,J.Exploratoryprojectionpursuit.JournaloftheAmericanStatisticalAssociation,82:249–266,1987.Friendly,M.SASSystemforStatisticalGraphics.SASInstitute,Cary,NC,1stedition,1991.Friendly,M.Extendingmosaicdisplays:Marginal,conditional,andpartialviewsofcategoricaldata.JournalofComputationalandGraphicalStatistics,8:373–395,1999.Friendly,M.andKwan,E.Effectorderingfordatadisplays.ComputationalStatisticsandData,37,2002.Inpress.Gabriel,K.R.Thebiplotgraphicdisplayofmatriceswithapplicationtoprincipalcomponentsanaly-,58(3):453–467,1971.Gruvaeus,G.andWainer,H.Twoadditionstohierarchicalclusteranalysis.TheBritishJournalofMathematicalandStatisticalPsychology,25(1):200–206,1972.Hills,M.Onlookingatlargecorrelationmatrices.,56:249–253,1969.Hoaglin,D.C.andVelleman,P.F.Acriticallookatsomeanalysesofmajorleaguebaseballsalaries.TheAmericanStatistician,49:277–285,1994.McQuitty,L.L.Multipleclusters,types,anddimensionsfromiterativeintercolumnarcorrelationalMultivariateBehavioralResearch,3:465–477,1968. Murdoch,D.J.andChow,E.D.Agraphicaldisplayoflargecorrelationmatrices.TheAmerican,50(2):178–180,1996.Paolini,G.V.andSantangelo,P.AninteractivegraphictooltoplotthestructureoflargesparceIBMJournalofResearchandDevelopment,35:231–237,1991.Tukey,J.W.ExploratoryDataAnalysis.AddisonWesley,Reading,MA,1977.Whittaker,J.GraphicalModelsinAppliedMultivariateStatistics.JohnWileyandSons,NewYork,NY,1990.