/
Analysis of variance Andrew Gelman March   Abstract Analysis of variance ANOVA is a statistical Analysis of variance Andrew Gelman March   Abstract Analysis of variance ANOVA is a statistical

Analysis of variance Andrew Gelman March Abstract Analysis of variance ANOVA is a statistical - PDF document

tawny-fly
tawny-fly . @tawny-fly
Follow
566 views
Uploaded On 2014-12-26

Analysis of variance Andrew Gelman March Abstract Analysis of variance ANOVA is a statistical - PPT Presentation

When applied to generalized l inear models multilevel models and other extensions of classical regression ANOVA can be e xtended in two di64256erent directions First the Ftest can be used in an asymptotic or approximat e fashion to compare nested mo ID: 29758

When applied generalized

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Analysis of variance Andrew Gelman March..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

AnalysisofvarianceAndrewGelmanyMarch22,2006AbstractAnalysisofvariance(ANOVA)isastatisticalprocedureforsummarizingaclassicallinearmodel|adecompositionofsumofsquaresintoacomponentforeachsourceofvariationinthemodel|alongwithanassociatedtest(theF-test)ofthehypothesisthatanygivensourceofvariationinthemodeliszero.Whenappliedtogeneralizedlinearmodels,multilevelmodels,andotherextensionsofclassicalregression,ANOVAcanbeextendedintwodi erentdirections.First,theF-testcanbeused(inanasymptoticorapproximatefashion)tocomparenestedmodels,totestthehypothesisthatthesimplerofthemodelsissucienttoexplainthedata.Second,theideaofvariancedecompositioncanbeinterpretedasinferenceforthevariancesofbatchesofparameters(sourcesofvariation)inmultilevelregressions.1IntroductionAnalysisofvariance(ANOVA)representsasetofmodelsthatcanbe ttodata,andalsoasetofmethodsforsummarizeanexisting ttedmodel.We rstconsiderANOVAasitappliestoclassicallinearmodels(thecontextforwhichitwasoriginallydevised;Fisher,1925)andthendiscusshowANOVAhasbeenextendedtogeneralizedlinearmodelsandmultilevelmodels.Analysisofvarianceisparticularlye ectiveforanalyzinghighlystructuredexperimentaldata(inagriculture,multipletreatmentsappliedtodi erentbatchesofanimalsorcrops;inpsychology,multi-factorialexperimentsmanipulatingseveralindependentexperimentalconditionsandappliedtogroupsofpeople;industrialexperimentsinwhichmultiplefactorscanbealteredatdi erenttimesandindi erentlocations).Attheendofthisarticle,wecompareANOVAtosimplelinearregression.2Analysisofvarianceforclassicallinearmodels2.1ANOVAasafamilyofstatisticalmethodsWhenformulatedasastatisticalmodel,analysisofvariancereferstoanadditivedecompositionofdataintoagrandmean,maine ects,possibleinteractions,andanerrorterm.Forexample,Gawronetal.(2003)describea\right-simulatorexperimentthatwesummarizeasa58arrayofmeasurementsunder5treatmentconditionsand8di erentairports.Thecorrespondingtwo-wayANOVAmodelisyij=+ i+ j+ij.Thedataasdescribedherehavenoreplication,andsothetwo-wayinteractionbecomespartoftheerrorterm.1 FortheNewPalgraveDictionaryofEconomics,secondedition.WethankJackNeedleman,MatthewRa erty,DavidPattison,MarcShivers,GregorGorjanc,andseveralanonymouscommentersforhelpfulsuggestionsandtheNationalScienceFoundationfor nancialsupport.yDepartmentofStatisticsandDepartmentofPoliticalScience,ColumbiaUniversity,NewYork,gelman@stat.columbia.edu,www.stat.columbia.edu/gelman1If,forexample,eachtreatmentairportconditionwerereplicatedthreetimes,thenthe120datapointscouldbemodeledasyijk=+ i+ j+\rij+ijk,withtwosetsofmaine ects,atwo-wayinteraction,andanerrorterm.1 DegreesofSumofMeanSourcefreedomsquaressquareF-ratiop-value Treatment40.0780.0200.390.816Airport73.9440.56311.130:001Residual281.4170.051Figure1:Classicaltwo-wayanalysisofvariancefordataon5treatmentsand8airportswithnoreplication.Thetreatment-levelvariationisnotstatisticallydistinguishablefromnoise,buttheairporte ectsarestatisticallysigni cant.ThisandtheotherexamplesinthisarticlecomefromGelman(2005)andGelmanandHill(2006).Thisisalinearmodelwith1+4+7coecients,whichistypicallyidenti edbyconstrainingtheP5i=1 i=0andP8j=1 j=0.ThecorrespondingANOVAdisplayisshowninFigure1:Foreachsourceofvariation,thedegreesoffreedomrepresentthenumberofe ectsatthatlevel,minusthenumberofconstraints(the5treatmente ectssumtozero,the8airporte ectssumtozero,andeachrowandcolumnofthe40residualssumstozero).Thetotalsumofsquares|thatis,P5i=1P8j=1(yijy::)2|is0:078+3:944+1:417,whichcanbedecomposedintothesethreetermscorrespondingtovariancedescribedbytreatment,variancedescribedbyairport,andresiduals.Themeansquareforeachrowisthesumofsquaresdividedbydegreesoffreedom.Underthenullhypothesisofzerorowandcolumne ects,theirmeansquareswould,inexpectation,simplyequalthemeansquareoftheresiduals.TheF-ratioforeachrow(excludingtheresiduals)isthemeansquare,dividedbytheresidualmeansquare.Thisratioshouldbeapproximately1(inexpectation)ifthecorrespondinge ectsarezero;otherwisewewouldgenerallyexpecttheF-ratiotoexceed1.WewouldexpecttheF-ratiotobelessthan1onlyinunusualmodelswithnegativewithin-groupcorrelations(forexample,ifthedatayhavebeenrenormalizedinsomeway,andthishadnotbeenaccountedforinthedataanalysis.)Thep-valuegivesthestatisticalsigni canceoftheF-ratiowithreferencetotheF1;2,where1and2arethenumeratoranddenominatordegreesoffreedom,respectively.(Thus,thetwoF-ratiosinFigure1arebeingcomparedtoF4;28andF7;28distributions,respectively.)Inthisexample,thetreatmentmeansquareislowerthanexpected(anF-ratiooflessthan1),butthedi erencefrom1isnotstatisticallysigni cant(ap-valueof82%),henceitisreasonabletojudgethisdi erenceasexplainablebychance,andconsistentwithzerotreatmente ects.Theairportmeansquare,ismuchhigherthanwouldbeexpectedbychance,withanF-ratiothatishighlystatistically-signi cantlylargerthan1;hencewecancon dentlyrejectthehypothesisofzeroairporte ects.MorecomplicateddesignshavecorrespondinglycomplicatedANOVAmodels,andcomplexitiesarisewithmultipleerrorterms.Wedonotintendtoexplainsuchhierarchicaldesignsandanalyseshere,butwewishtoalertthereadertosuchcomplications.TextbookssuchasSnedecorandCochran(1989)andKirk(1995)provideexamplesofanalysisofvarianceforawiderangeofdesigns.2.2ANOVAtosummarizeamodelthathasalreadybeen ttedWehavejustdemonstratedANOVAasamethodofanalyzinghighlystructureddatabydecomposingvarianceintodi erentsources,andcomparingtheexplainedvarianceateachleveltowhatwouldbeexpectedbychancealone.Anyclassicalanalysisofvariancecorrespondstoalinearmodel(thatis,aregressionmodel,possiblywithmultipleerrorterms);conversely,ANOVAtoolscanbeusedtosummarizeanexistinglinearmodel.2 Thekeyistheideaof\sourcesofvariation,"eachofwhichcorrespondstoabatchofcoecientsinaregression.Thus,withthemodely=X +,thecolumnsofXcanoftenbebatchedinareasonableway(forexample,fromtheprevioussection,aconstantterm,4treatmentindicators,and7airportindicators),andthemeansquaresandF-teststhenprovideinformationabouttheamountofvarianceexplainedbyeachbatch.Suchmodelscouldbe twithoutanyreferencetoANOVA,butANOVAtoolscouldthenbeusedtomakesomesenseofthe ttedmodels,andtotesthypothesesaboutbatchesofcoecients.2.3BalancedandunbalanceddataIngeneral,theamountofvarianceexplainedbyabatchofpredictorsinaregressiondependsonwhichothervariableshavealreadybeenincludedinthemodel.Withbalanceddata,however,inwhichallgroupshavethesamenumberofobservations(forexample,eachtreatmentappliedexactlyeighttimes,andeachairportusedforexactly veobservations),thevariancedecompositiondoesnotdependontheorderinwhichthevariablesareentered.ANOVAisthusparticularlyeasytointerpretwithbalanceddata.Theanalysisofvariancecanalsobeappliedtounbalanceddata,butthenthesumsofsquares,meansquares,andF-ratioswilldependontheorderinwhichthesourcesofvariationareconsidered.3ANOVAformoregeneralmodelsAnalysisofvariancerepresentsawayofsummarizingregressionswithlargenumbersofpredictorsthatcanbearrangedinbatches,andawayoftestinghypothesesaboutbatchesofcoecients.Boththeseideascanbeappliedinsettingsmoregeneralthanlinearmodelswithbalanceddata.3.1FtestsInaclassicalbalanceddesign(asintheexamplesoftheprevioussection),eachF-ratiocomparesaparticularbatchofe ectstozero,testingthehypothesisthatthisparticularsourceofvariationisnotnecessaryto tthedata.Moregenerally,theFtestcancomparetwonestedmodels,testingthehypothesisthatthesmallermodel tsthedataadequatelyand(sothatthelargermodelisunnecessary).Inalinearmodel,theF-ratiois(SS2SS1)=(df2df1) SS1=df1,whereSS1;df1andSS2;df2aretheresidualsumsofsquaresanddegreesoffreedomfrom ttingthelargerandsmallermodels,respectively.Forgeneralizedlinearmodels,formulasexistusingthedeviance(thelog-likelihoodmultipliedby2)thatareasymptoticallyequivalenttoF-ratios.Ingeneral,suchmodelsarenotbalanced,andthetestforincludinganotherbatchofcoecientsdependsonwhichothersourcesofvariationhavealreadybeenincludedinthemodel.3.2InferenceforvarianceparametersAdi erentsortofgeneralizationinterpretstheANOVAdisplayasinferenceaboutthevarianceofeachbatchofcoecients,whichwecanthinkofastherelativeimportanceofeachsourceofvariationinpredictingthedata.EveninaclassicalbalancedANOVA,thesumsofsquaresandmeansquaresdonotexactlydothis,buttheinformationcontainedthereincanbeusedtoestimatethevariancecomponents(Corn eldandTukey,1956,Searle,Casella,andMcCulloch,1992).Bayesiansimulationcanthenbeusedtoobtaincon denceintervalsforthevarianceparameters.Asillustratedbelow,wedisplayinferencesforstandarddeviations(ratherthanvariances)becausethesearemoredirectlyinterpretable.ComparedtotheclassicalANOVAdisplay,ourplotsemphasizetheestimatedvarianceparametersratherthantestingthehypothesisthattheyarezero.3 dfEst. sd of coefficients 0 0.5 1 1.5sex1 ethnicity1 sex * ethnicity1 age3 education3 age * education9 region3 region * state46 0 0.5 1 1.5 dfEst. sd of coefficients 0 0.5 1 1.5sex1 ethnicity1 sex * ethnicity1 age3 education3 age * education9 region3 region * state46 ethnicity * region3 ethnicity * region * state46 0 0.5 1 1.5 Figure2:ANOVAdisplayfortwologisticregressionmodelsoftheprobabilitythatasurveyrespondentpreferstheRepublicancandidateforthe1988U.S.Presidentialelection,basedondatafromsevenCBSNewspolls.Pointestimatesanderrorbarsshowmedianestimates,50%intervals,and95%intervalsofthestandarddeviationofeachbatchofcoecients.Thelargecoecientsforethnicity,region,andstatesuggestthatitmightmakesensetoincludeinteractions,hencetheinclusionofethnicityregionandethnicitystateinteractionsinthesecondmodel.3.3GeneralizedlinearmodelsTheideaofestimatingvarianceparametersappliesdirectlytogeneralizedlinearmodelsaswellasunbalanceddatasets.Allthatisneededisthattheparametersofaregressionmodelarebatchedinto\sourcesofvariation."Figure2illustrateswithamultilevellogisticregressionmodel,predictingvotepreferencegivenasetofdemographicandgeographicvariables.3.4MultilevelmodelsandBayesianinferenceAnalysisofvarianceiscloselytiedtomultilevel(hierarchical)modeling,witheachsourceofvariationintheANOVAtablecorrespondingtoavariancecomponentinamultilevelmodel(seeGelman,2005).Inpractice,thiscanmeanthatweperformANOVAby ttingamultilevelmodel,orthatweuseANOVAideastosummarizemultilevelinferences.MultilevelmodelingisinherentlyBayesianinthatitinvolvesapotentiallylargenumberofparametersthataremodeledwithprobabilitydistributions(see,forexample,Goldstein,1995,KreftandDeLeeuw,1998,SnijdersandBosker,1999).Thedi erencesbetweenBayesianandnon-Bayesianmultilevelmodelsaretypicallyminorexceptinsettingswithmanysourcesofvariationandlittleinformationoneach,inwhichcasesomebene tcanbegainedfromafully-Bayesianapproachwhichmodelsthevarianceparameters.4Relatedtopics4.1Finite-populationandsuperpopulationvariancesSofarinthisarticlewehaveconsidered,ateachlevel(thatis,eachsourceofvariation)ofamodel,thestandarddeviationofthecorrespondingsetofcoecients.Wecallthisthe nite-populationstandarddeviation.Anotherquantityofpotentialinterestisthestandarddeviationofthehypotheticalsuperpopulationfromwhichtheseparticularcoecientsweredrawn.Thepointestimatesofthesetwovarianceparametersaresimilar|withtheclassicalmethodofmoments,theestimatesareidentical,becausethesuperpopulationvarianceistheexpectedvalueofthe nite-populationvariance|buttheywillhavedi erentuncertainties.Theinferencesforthe nite-populationstandarddeviationsaremoreprecise,astheycorrespondtoe ectsforwhichweactuallyhavedata.Figure3illustratesthe nite-populationandsuperpopulationinferencesateachlevelofthemodelforthe\right-simulatorexample.Weknowmuchmoreaboutthe5treatmentsand8airportsinourdatasetthanforthegeneralpopulationsoftreatmentsandairports.(Wesimilarlyknowmore4 finite-population s.d.'s 0 0.2 0.4 0.6 0.8sg sd sy 0 0.2 0.4 0.6 0.8 superpopulation s.d.'s 0 0.2 0.4 0.6 0.8sg sd sy 0 0.2 0.4 0.6 0.8 Figure3:Medianestimates,50%intervals,and95%intervalsfor(a) nite-populationand(b)superpopulationstandarddeviationsofthetreatment-level,airport-level,anddata-levelerrorsinthe\right-simulatorexamplefromFigure1.Thetwosortsofstandarddeviationparametershaveessentiallythesameestimates,butthe nite-populationquantitiesareestimatedmuchmoreprecisely.(Wefollowthegeneralpracticeinstatisticalnotation,usingGreekandRomanlettersforpopulationandsamplequantities,respectively.) dfEst. sd of coefficients 0 10 20 30 40row4 column4 treatment4 error12 0 10 20 30 40 dfEst. sd of coefficients 0 10 20 30 40row4 row.linear1 row.error3 column4 column.linear1 column.error3 treatment4 treatment.linear1 treatment.error3 error12 0 10 20 30 40 Figure4:ANOVAdisplaysfora55latinsquareexperiment(anexampleofacrossedthree-waystructure):(a)withnogroup-levelpredictors,(b)contrastanalysisincludinglineartrendsforrows,columns,andtreatments.SeealsotheplotsofcoecientestimatesandtrendsinFigure5.aboutthestandarddeviationofthe40particularerrorsinoutdatasetthanabouttheirhypotheticalsuperpopulation,butthedi erencesherearenotsolarge,becausethesuperpopulationdistributionisfairlywellestimatedfromthe28degreesoffreedomavailablefromthesedata.)Therehasbeenmuchdiscussionabout xedandrandome ectsinthestatisticalliterature(seeEisenhart,1947,GreenandTukey,1960,Plackett,1960,Yates,1967,LaMotte,1983,andNelder,1977,1994,forarangeofviewpoints),andunfortunatelytheterminologyusedinthesediscussionsisincoherent(seeGelman,2005,Section6).Ourresolutiontosomeofthesedicultiesistoalways tamultilevelmodelbuttosummarizeitwiththeappropriateclassofestimand|superpopulationor nite-population|dependingonthecontextoftheproblem.Sometimesweareinterestedintheparticulargroupsathand;othertimestheyareasamplefromalargerpopulationofinterest.Achangeoffocusshouldnotrequireachangeinthemodel,onlyachangeintheinferentialsummaries.4.2ContrastanalysisContrastsareawaytostructuringthee ectswithinasourceofvariation.Inamultilevelmodelingcontext,acontrastissimplyagroup-levelcoecient.IntroducingcontrastsintoanANOVAallowsafurtherdecompositionofvariance.Figure4illustratesfora55latinsquareexperiment(thistime,notasplitplot):theleftplotinthe gureshowsthestandardANOVA,andtherightplotshowsacontrastanalysisincludinglineartrendsfortherow,column,andtreatmente ects.The5 12345 200250300 row effectsrowy 12345 200250300 column effectscolumny 12345 200250300 treatment effectstreatmenty Figure5:Estimates1standarderrorfortherow,column,andtreatmente ectsforthelatinsquareexperimentsummarizedinFigure4.The velevelsofeachfactorareordered,andthelinesdisplaytheestimatedlineartrends.lineartrendsforthecolumnsandtreatmentsarelarge,explainingmostofthevariationateachoftheselevels,butthereisnoevidenceforalineartrendintherowe ects.Figure5showstheestimatede ectsandlineartrendsateachlevel(alongwiththerawdatafromthestudy),asestimatedfromamultilevelmodel.Thisplotshowsinadi erentwaythatthevariationamongcolumnsandtreatments,butnotamongrows,iswellexplainedbylineartrends.4.3NonexchangeablemodelsInalltheANOVAmodelswehavediscussedsofar,thee ectswithinanybatch(sourceofvariation)aremodeledexchangeably,asasetofcoecientswithmean0andsomevariance.Animportantdirectionofgeneralizationistononexchangeablemodels,suchasintimeseries,spatialstructures(BesagandHigdon,1999),correlationsthatariseinparticularapplicationareassuchasgenetics(McCullagh,2005),anddependenceinmulti-waystructures(Aldous,1981,Hodgesetal.,2005).Inthesesettings,boththehypothesis-testingandvariance-estimatingextensionsofANOVAbe-comemoreelaborate.Thecentralideaofclusteringe ectsintobatchesremains,however.Inthissense,\analysisofvariance"representsalle ortstosummarizetherelativeimportanceofdi erentcomponentsofacomplexmodel.5ANOVAcomparedtolinearregressionTheanalysisofvarianceisoftenunderstoodbyeconomistsinrelationtolinearregression(e.g.,Goldberger,1964).Fromtheperspectiveoflinear(orgeneralizedlinear)models,weidentifyANOVAwiththestructuringofcoecientsintobatches,witheachbatchcorrespondingtoa\sourceofvariation"(inANOVAterminology).AsdiscussedbyGelman(2005),therelevantinferencesfromANOVAcanbereproducedusingregression|butnotalwaysleast-squaresregression.Multilevelmodelsareneededforanalyzinghierarchicaldatastructuressuchas\split-plotdesigns,"wherebetween-groupe ectsarecomparedtogroup-levelerrors,andwithin-groupe ectsarecomparedtodata-levelerrors.Giventhatwecanalready tregressionmodels,whatdowegainbythinkingaboutANOVA?Tostartwith,thedisplayoftheimportanceofdi erentsourcesofvariationisahelpfulexploratorysummary.Forexample,thetwoplotsinFigure2allowustoquicklyunderstandandcomparetwomultilevellogisticregressions,withoutgettingoverwhelmedwithdozensofcoecientestimates.Moregenerally,wethinkoftheanalysisofvarianceasawayofunderstandingandstructur-ingmultilevelmodels|notasanalternativetoregressionbutasatoolforsummarizingcomplexhigh-dimensionalinferences,ascanbeseen,forexample,inFigure3( nite-populationandsuper-populationstandarddeviations)andFigures4{5(group-levelcoecientsandtrends).6 ReferencesAldous,D.J.(1981).Representationsforpartiallyexchangeablearraysofrandomvariables.JournalofMultivariateAnalysis11,581{598.Besag,J.,andHigdon,D.(1999).Bayesiananalysisofagricultural eldexperiments(withdiscus-sion).JournaloftheRoyalStatisticalSocietyB61,691{746.Cochran,W.G.,andCox,G.M.(1957).ExperimentalDesigns,secondedition.NewYork:Wiley.Corn eld,J.,andTukey,J.W.(1956).Averagevaluesofmeansquaresinfactorials.AnnalsofMathematicalStatistics27,907{949.Eisenhart,C.(1947).Theassumptionsunderlyingtheanalysisofvariance.Biometrics3,1{21.Fisher,R.A.(1925).StatisticalMethodsforResearchWorkers.Edinburgh:OliverandBoyd.Gawron,V.J.,Berman,B.A.,Dismukes,R.K.,andPeer,J.H.(2003).Newairlinepilotsmaynotreceivesucienttrainingtocopewithairplaneupsets.FlightSafetyDigest(July{August),19{32.Gelman,A.(2005).Analysisofvariance:whyitismoreimportantthanever(withdiscussion).AnnalsofStatistics33,1{53.Gelman,A.,andHill,J.(2006).AppliedRegressionandMultilevel(Hierarchical)Models.CambridgeUniversityPress.Gelman,A.,Pasarica,C.,andDodhia,R.M.(2002).Let'spracticewhatwepreach:usinggraphsinsteadoftables.TheAmericanStatistician56,121{130.Goldberger,A.S.(1964).EconometricTheory.NewYork:Wiley.Goldstein,H.(1995).MultilevelStatisticalModels,secondedition.London:EdwardArnold.Green,B.F.,andTukey,J.W.(1960).Complexanalysesofvariance:generalproblems.Psychome-trika25127{152.Hodges,J.S.,Cui,Y.,Sargent,D.J.,andCarlin,B.P.(2005).SmoothedANOVA.Technicalreport,DepartmentofBiostatistics,UniversityofMinnesota.Kirk,R.E.(1995).ExperimentalDesign:ProceduresfortheBehavioralSciences,thirdedition.Brooks/Cole.Kreft,I.,andDeLeeuw,J.(1998).IntroducingMultilevelModeling.London:Sage.LaMotte,L.R.(1983).Fixed-,random-,andmixed-e ectsmodels.InEncyclopediaofStatisticalSciences,ed.S.Kotz,N.L.Johnson,andC.B.Read,3,137{141.McCullagh,P.(2005).DiscussionofGelman(2005).AnnalsofStatistics33,33{38.Nelder,J.A.(1977).Areformulationoflinearmodels(withdiscussion).JournaloftheRoyalStatisticalSocietyA140,48{76.Nelder,J.A.(1994).Thestatisticsoflinearmodels:backtobasics.StatisticsandComputing4,221{234.Plackett,R.L.(1960).Modelsintheanalysisofvariance(withdiscussion).JournaloftheRoyalStatisticalSocietyB22,195{217.Searle,S.R.,Casella,G.,andMcCulloch,C.E.(1992).VarianceComponents.NewYork:Wiley.Snedecor,G.W.,andCochran,W.G.(1989).StatisticalMethods,eighthedition.IowaStateUniversityPress.Snijders,T.A.B.,andBosker,R.J.(1999).MultilevelAnalysis.London:Sage.Yates,F.(1967).Afreshlookatthebasicprinciplesofthedesignandanalysisofexperiments.ProceedingsoftheFifthBerkeleySymposiumonMathematicalStatisticsandProbability4,777{790.7