37 52 On Differentia ariabilit of Expressio Ratios Improvin Statistica Inferenc abou Gen Expressio Change fro Microarra Dat MA NEWTON CM KENDZIORSKI CS RICHMOND R BL TTNER an K TSU ABSTRACT We conside th proble of inferrin fol change in gen expressi ID: 29378
Download Pdf The PPT/PDF document "JOURNA OF COMPU TIONA BIOLOG olum Numbe..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
JOURNALOFCOMPUTATIONALBIOLOGYVolume8,Number1,2001MaryAnnLiebert,Inc.Pp.3752OnDifferentialVariabilityofExpressionRatios:ImprovingStatisticalInferenceaboutGeneExpressionChangesfromMicroarrayDataM.A.NEWTON,1;2C.M.KENDZIORSKI,2C.S.RICHMOND,3F.R.BLATTNER,3andK.W.TSUI 1DepartmentofStatistics,UniversityofWisconsin,Madison,WI53792.2DepartmentofBiostatisticsandMedicalInformatics,UniversityofWisconsin,Madison,WI53792.3LaboratoryofGenetics,UniversityofWisconsin,Madison,WI53792.37 38NEWTONETAL.sample.Followingconvention,weuseredtoindicatethesampletaggedwiththeCy5dyeandgreentoindicatetheCy3dye.Dugganetal.(1999)orCheungetal.(1999),forexample,providefurtherdetailsonhowtoobtaincDNAmicroarraydata.Similarexpressiondataareobtainedonoligonucleotidearrays(Lipshutzetal.,1999),thoughwefocusonthecDNAmicroarraysinthepresentdevelopment.Thereisreasontoexpectthatthestatisticalmethodologydescribedherewillapplyinbothdomains.Toaccountforintrinsicdifferencesbetweenthehybridizingsamples,intensitymeasurementsarenor-malizedinsomefashion.Onewayistocomparesignalsatasetofhouse-keepinggenes,i.e.,genesthoughttonotpresentsignicantchangesinexpressionbetweensamples.Richmondetal.(1999)usedasimplemethodontheE.colimicroarraysreconsideredhere;theynormalizedbythetotalsignalintensityfromallspots.Otherpossibilitiesincludespikingthepreparedsamplewithknownconcentrationsofspecicgenesorcombiningmeasurementstakeninbothorientationsofthedyes.Acomponentofeachintensitymeasurementatagivenspotisbackgrounduorescence.Anestimateofthiscomponentcanbeobtainedfrompixelsnearthespot,andthereportedintensityisthentheoriginalmeasurementminusthebackgroundmeasurement.Inwhatfollowsweconsiderthemeasurementstobenormalizedandtobeadjustedforbackgroundintensity.Itistypicalthatinferenceaboutdifferentialgeneexpressionbetweentwocelltypesisbasedontheratioofmeasuredexpressionlevels.Abasicstatisticalproblemistoknowwhenthemeasureddifferentialexpressionislikelytoreectarealbiologicalshiftingeneexpression.Thisdependsontheamountofvariationinthesystem,soitisdifculttojustifyaxedrule,suchastofocusongenesexhibitingmorethana3-foldshift,say.Certainlyreplicationwillbecriticalinapplications;asmoreandmoremicroarraysaremeasured,somecondenceintheexpressionproleswillundoubtedlyemerge.Ourimmediateconcerniswithdatafromasinglemicroarray;weseesomeroomforimprovementintheinitialsignalprocessingwhichmayhavebearingondownstreamtaskssuchasclusteringorotherformsofdataanalysis(e.g.,Eisenetal.,1998;Bassettetal.,1999).Agivenfoldchangeinmeasuredexpressionmayhaveadifferentinterpretationforagenewhoseabsoluteexpressionislowascomparedtoagenethatisbrightinbothuorescentchannels(notedinBassettetal.[1999]forexample).Wearguethatanyprocedurewhichusestherawintensityratiosalonetoinferdifferentialexpressionmaybeinefcientandthusmayleadtoexcessiveerrors.Indeed,sourcesofvariationareexpectedtobesuchthattheabsoluteexpressionlevelsprovideinformationonthevariationofintensityratios.Thisinformationisignoredinthestandardtreatment.Onesolutionistoignoreanygeneswhosetranscriptsarepresentatalowtotalabundance.Wemayhavecondenceaboutthedifferentialexpressionofremaininggenes,butatthepriceofthrowingawaypotentiallyvaluabledata.Inanycase,thechoiceofacutoffmaybearbitrary.Agenemaybedeemedbelowthedetectionlevelinonechannelbutnotintheother.Furthermore,thetranscriptabundanceofmanyinterestinggenesmaybeverylow,andsothestrategyseemsfarfromoptimal.Thesolutionwedescribeisbasedonhierarchicalmodelsofmeasuredexpressionlevelsinwhichweaccountfortwoobvioussourcesofvariation.Therstwecallmeasurementerror.Inhypotheticalrepetitionsoftheexperiment,themeasureduorescencesignalwilluctuatearoundsomemeanvaluewhichisitselfapropertyofthecelltype,theparticulargene,andotherfactors.Theseuctuationsareduetomultiplesourcesofvariationthatariseinproducingthemeasurement,suchasvariationinthepreparationofthemRNAsamplesandintheincorporationofuorescenttags,opticalnoise,andcrosshybridization.Importantly,thisvariationmayinclude,butisaboveandbeyond,thebackgroundnoisementionedabove.Thesecondmainsourceofvariationweconsiderisduetothedifferentgenesspottedontothemicroarray.Themeanuorescencevaluearoundwhichmeasuredexpressionuctuateschangesfromgenetogeneandwillserveasarandomeffectinourproposedmodel.ThepopulationofmRNAsfromagivensampleiscomposedofmanydistinctmolecules,butthepartitionisnotuniform;somemRNAsareabundantandothersarerare.AsweseeinSections3and4,byformallycombiningthetwosourcesofvariationwecanreadilyobtainprobabilisticstatementsaboutactualdifferentialexpression.Wendthattheobservedratiosarenotoptimalestimators;wendthatfocusingonfoldchangesaloneisinsufcientandthatcondencestatementsaboutdifferentialexpressiondependontranscriptabundance.PerhapstherststatisticaltreatmentofmicroarraydataanalysisiscontainedinChenetal.(1997).Theseauthorsmaketheinterestingargumentthat,althoughexpectedexpressionlevelsdouctuatefromgeneto ONDIFFERENTIALVARIABILITYOFEXPRESSIONRATIOS39geneacrossthemicroarray,themeasurementsarelinkedbyhavingaconstantcoefcientofvariationc,say.Then,theobserveddifferentialexpression,say,TkDRk=Gk(ratioofredtogreenintensityatgenek),hasasamplingdistributiondependentonlyoncunderthenullhypothesisthatRkandGkhavethesameexpectation,i.e.,thatthereisnorealdifferentialexpression,andcomputationisunderaGaussianmodelforbothmeasurements.Usingasetofhouse-keepinggenes,whicharethoughttonotpresentrealdifferentialexpression,themaximumlikelihoodestimateofcisderivedandcondenceintervalsforactualdifferentialexpressionarecomputedfrompercentilesoftheestimatednulldistributionofTk.Theintervalsareeasytocomputeandareresponsivetotheintrinsicvariationofdataonthemicroarraybecausetheyuseadata-dependentvalueofc.(Bycontrast,theproceduretocallsignicantanygenepresenting,say,a3-foldorgreaterexpressiondifferentialisnotresponsivetosuchvariation.)Themethodhasignoredancillaryinformation;Rk£GkcontainsinformationaboutthevariationofTk.Inotherwords,TkisnotindependentofRk£Gk.Thereistheminortechnicalpoint,too,thatRkandGkaremodeledasGaussianwheninfacttheymustbepositive.WeconsiderasamplingmodelformeasuredintensitiesinSection2.OnecanviewtheChenetal.(1997)methodasproducingasetofhypothesistests,oneforeachgeneonthemicroarray,inwhichthenullhypothesisisthattheexpectationofbothintensitysignalsisequalandthealternativeisthattheyareunequal.WhenanobservedTkfallsinthetailsofthenullsamplingdistribution,werejectthenullanddeclaresignicantdifferentialexpression.Asinotherdomainsofapplication,weknowthatsomebenetcanbeattainedifthethousandsofparametersareconsideredsimultaneously,ratherthaninisolation(EfronandMorris,1973,1977;CarlinandLouis,1996).Thecalculationspresentedhereattempttodemonstratetheutilityoftreatingthegene-specicparametersthemselvesasmembersofancpopulation.InSection3weconsidertheproblemofestimatingandpossiblyformingacondenceintervalfortheactualdifferentialexpressionofagivengene.Thisinvolvescalculationsinatwo-layerhierarchicalmodeltoproduceaposteriorprobabilitydistributionfortheactualdifferentialexpressionandanempiricalBayesestimateofthesame.WeillustratewithapaneloffourE.colimicroarrays,highlightingdistinctionsbetweentheempiricalBayesestimatesandthenaiveestimates.Asimulationstudyshowsthattotalestimationerrorcanbereducedbyusingthisprocedure.TheproblemoftestingforsignicantdifferentialexpressionisthefocusofSection4,andhereathirdlayerisaddedtothehierarchicalmodel.Wederiveafunctionwhichgivestheoddsofactualdifferentialexpressionasafunctionofmeasuredintensities.Thisprovidesaneffectivesummary,asortofqualitynumber,foreachgene.Tosimplifyourdevelopment,wefocusondatafromasinglemicroarray,andwesupposethatthereisonespotforeachgene.WeaddresstheissueofmodelvalidationinSection5,wherepredictionsfromourhierarchicalmodelsarecomparedwithfeaturesintheavailabledata.Possiblemodelextensionsarealsodiscussed.2.ASAMPLINGMODELFORMEASUREDEXPRESSIONMeasuredintensitylevelsR(red)andG(green)approximatetargetmeanvalues¹rand¹gatagivenspotonthemicroarray.(Weavoidusingsubscriptsfordistinguishingspotsunlessitisabsolutelynecessary.)Thegoalistoestimate½D¹r=¹g.Thestandard(naive)estimatorisO½NDR=G.Measurementerrordependsonsignalstrength(Chenetal.,1997).Toaccountforthisexplicitly,RandGaremodeledasindependentsamplesfromdistinctdistributionswithacommoncoefcientofvariation.WenditconvenienttoworkwithGammadistributionshavingaconstantshapeparameter.Theyexhibitconstantcoefcientofvariation,theyareGaussian-like,theyaresupportedonthepositiveline,andtheyareeasytomanipulate.TheGammamodelhasalonghistoryinstatisticalecology(e.g.,Fisheretal.,1943),notonlyforanalyticalconveniencebutalsobecauseitmaypossessadeeperbiologicalinterpretation.DennisandPatil(1984)showedthattheGammaistheapproximatestationarydistributionfortheabundanceofapopulationuctuatingaroundastableequilibrium.InSection6wecommentonthesensitivityofourresultstotheGammamodelassumption.TheprobabilitydensityofaGammavariableRwithscaleµrandshapeparameteraisp.rjµr;a/Dµarra 1exp. rµr/ 0.a/forr0(1) 40NEWTONETAL.where0.a/DR10xa 1e xdx.WedenotethisdensitybyGamma(a;µr).Similarly,wemodelthemeasuredintensityGasGamma(a;µg),andweassumethatRandGareindependent.TheexpectationsofRandGarea=µranda=µg,respectively,andthusthetargetdifferentialexpression½D¹r=¹gDµg=µr.BothRandGhavethesamecoefcientofvariationcD1=p aeventhoughtheymayhavedifferentscales.ByintegratingthejointdistributionofRandG,wecanderivethesamplingdistributionofthemeasureddifferentialexpressionTDR=G,p.tjµr;µg;a/D0.2a/ 02.a/³1 ½´.t=½/a 1 .1Ct=½/2a(2)fort0whereagain½D¹r=¹gistheparameterofinterest.Thetailofthisdistributionisasymptoticto1=taC1,sowerestricta1toensurearstmoment.Theformofthissamplingdistributioniswellknown;itisascalemultipleofaBetadistributionofthesecondkind(KendallandStuart,1969,p.151).AtthispointwecouldfollowthedevelopmentinChenetal.(1997)usingthisGammamodelinsteadofthenormalmodelusedinthatwork.Forinstance,weseethatwhen½D1,i.e.,norealdifferentialexpression,thedistributionofTdependsonthecoefcientofvariation1=p aonly.AproblemwiththisapproachisthatweloseinformationwhenusingtheratioTalonetoassessdifferentialexpression.Toseewhy,considertheconditionalsamplingdistributionofTDR=GgivenSDRGwithintheGammamodel.Thecaseofnoactualdifferentialexpression,½D1,ismostsimple.Denotingbyµthecommonvalueofµrandµg,wehavep.tjs;µ;a//1 texpf µp s.p tC1=p t/g:(3)Onthemultiplicativescaleofintensitymeasurements,Sactsliketotalabundancefrombothchannels.InspectionshowsthatthescaleofthisconditionaldistributiondependsonS;i.e.,thevariationinTissmallerforlargerS.Thevariabilityofdifferentialexpressionisnotconstant,andsoignoringthesechangescanleadtoinefcientstatisticalprocedures.OnewaytomodifytheprocedurefromChenetal.(1997),forexample,wouldbetousesomethinglike(3)asareferencedistributioninsteadoftheanalogto(1).Weinsteadtakeahierarchicalmodelingapproachwhichenablesdirectparameterestimationandhypothesistesting.3.ESTIMATINGDIFFERENTIALEXPRESSIONExceptfortestmicroarraysorhouse-keepinggenes,wecertainlyexpectrealdifferencesingeneexpres-sionbetweencelltypes;andclearlydifferentgenescanexhibitdifferencesinactualexpressionwithinagivencelltype.Akeydistinctionofthepresentapproachfromearliereffortsistheformulationofaspecicprobabilitymodeltocharacterizetheseuctuations.Amongtherangeofpossiblespecications,werstconsiderasimpleGammamodelforthescaleparametersµrandµg.ThisformisconjugatetotheGammasamplingmodelandthuspermitsadetailedanalysis.ItentailsindependenceamongallthescaleparametersonthemicroarrayandassumesthattheyfollowthecommonGammadistributionGamma(a0;º).Modeltcanbeimprovedslightlyifweallowdifferentscaleparameters,say,ºgandºrforthetwodyes,butwetakeacommonparameterinthepresentdevelopment.Thismodelisreasonablyexible,skewedright,andpresentsincreasingvariationwithincreasingmean.Itrepresentsprioruncertaintyinactualexpressionlevels.AnextensionwhichallowscorrelationisdescribedinSection5.OurmainreasonsforchoosingaGammadistributiontogovernthelatentscaleparametersµrandµgareanalyticaltractabilityandmodelexibility.Wenoteinpassing,however,thatsometheoreticaljusticationmayalsoexist.Thetargetexpressionlevels¹r/1=µrand¹g/1=µgeachrepresentsomekindoftrueabundanceofthegiventranscriptinthetwomRNApools.Assuch,theirdistributionconcernsrelativefrequenciesoffrequenciesandthesizefrequencyrelationcharacteristicoftheZipf-Paretolawmayobtain(Johnsonetal.,1994).Ifso,therelativefrequencyofgeneswithtranscriptabundance¹isproportionalto1=¹a0C1forsomepowera0.ThereciprocalGammadistributionhasessentiallythesamedensityformoderatetolargevalues¹. ONDIFFERENTIALVARIABILITYOFEXPRESSIONRATIOS41Withthetwomodelcomponentsinplace,wecanderivesomeinterestingconsequences.Notably,wecancomputetheposteriordistributionofthetruedifferentialexpressionatagivenspotp.½jR;G;´//½ .aCa0C1/»1 ½C.GCº/ .RCº/¼ 2.aCa0/(4)where´D.a;a0;º/denotestheadditionalparametersyettobespecied.ThisisthedistributionoftheratiooftwoindependentGammavariables,anditcanbederivedinthesamewayasthesamplingdistribution(2).Uncertaintyaboutthetruedifferentialexpressionatagivenspotischaracterizedbythisdistribution,andso,dependingonourlossfunction,theBayesestimateof½issomemeasureofitscenter.Wenotethemodeandmeanofthisright-skeweddistributionaremodeD³RCº GCº´³aCa0 1 aCa0C1´andmeanD³RCº GCº´³aCa0 aCa0 1´:Asitisinbetweenthesetwovalues,andissomewhatsimpler,wetakeastheBayesestimateofdifferentialexpressionO½BDRCº GCº:(5)Thishastheclassicformofashrinkageestimator.Forstrongsignals,O½BwillbequiteclosetothenaiveestimatorO½NDR=G,butthereisattenuationoftheBayesestimate,especiallywhentheoverallsignalintensityislow.Thus,theBayesestimatenaturallyaccountsforthedecreasedvariationindifferentialexpressionwithincreasingsignal.Theamountofattenuationisgovernedbytheparameterº,whichisyettobespecied.Beingthescaleparameteroftheexpressionmodelcomponent,wecanrepresentitinmorefamiliarterms.Considerthemarginalexpectationofsignalintensityintheredchannel,say,obtainedbyintegratinguncertaintyinµr,E.R/DE[E.Rjµr/]DE.a=µr/Daº=.a0 1/andsoºD.a0 1/E.R/=a.Thissimpleformulationcontainsthreekeyquantities.Thecoefcientofvariationinthemeasurementerroriscontrolledbya;thecoefcientofvariationdescribinguctuationsofactualexpressionamonggenesiscontrolledbya0;andE.R/istheoverallaverageintensitymeasurementintheredchannel.(Weareassumingthedataarenormalized,andourmodelassertsthatRandGhaveidenticalmarginaldistributions,sowecoulduseGaboveinsteadofR.)Apragmaticapproachtodealingwiththeunknownparameters´D.a;a0;º/istoestimatethembymarginalmaximumlikelihood.Thatis,byintegratinguncertaintyinbothµrandµgweobtainthepredictiveprobabilitydensityofeachmeasurementpair(R;G)aspA.r;g/D»0.aCa0/ 0.a/0.a0/¼2.º/2a0.rg/a 1 [.rCº/.gCº/]aCa0:(6)Duetotheindependenceassumption,thisjointdistributionistheproductofthemarginaldistributionofRandthemarginaldistributionofG.EachmarginisascalemixtureofGammadistributionsandhenceisacompoundGammadistributionbelongingtoTypeVIofPearsonssystem(Johnsonetal.,1994,p.381).Themarginalloglikelihoodl.a;a0;º/isthesumofcontributionsfromallspotsonthemicroarrayl.a;a0;º/DXklogpA.rk;gk/where.rk;gk/areobservedatthekthspot.Wecalll.a;a0;º/amarginalloglikelihoodratherthananordinaryloglikelihoodbecausethegene-specicµparametershavebeenintegratedaway.Weoptimize 42NEWTONETAL.Table1.ParameterEstimatesinGamma-GammaModela MicroarrayNAaa0º Control372.062.1012.84Heatshock822.121.617.13IPTG-a2071.481.8515.29IPTG-b1491.191.5714.71 aNAisthenumberofspots,outof4,290,inwhichback-groundishigherthanspotintensity,andtheremainingcolumnsgivemaximumlikelihoodparametervalues.Thescaleºisfornormalizationtofractionsthentimes105.l.a;a0;º/numericallyusingtheSplusfunctionnlminb(StatisticalSciences,1993).Intheexamplesworkedsofar,wendthataisestimatedtobelargerthana0 1,sotheestimateofºissmallerthanthemeanintensity.TheinferencethatresultsbyestimatingparametersasaboveiscalledempiricalBayes(EB)(EfronandMorris,1973).EffortsatwholegenomeexpressionanalysiswerepioneeredinE.coliK-12(Chuangetal.,1993).Se-quencingoftheE.coliK-12genomehasenabledthefabricationofhighresolutionmicroarrayscontainingtheentirecomplementof4,290openreadingframesinthisgenome(Blattneretal.,1997;Richmondetal.,1999).Todemonstrateourstatisticalmethodology,wereanalyzedapaneloffourmicroarraysfromthisE.coliproject.Oneoftheseisacontrolmicroarray,twoarereplicatesinvolvingtreatmentwithisopropyl-¯-D-thiogalactopyranoside(IPTG)inwhichitisknownthatonlyafewtranscriptsshouldbeinduced,andoneisfromaheatshocktreatment(HS)inwhichmoreglobalchangesareexpected.Forthecontrolmicroarray,totalRNAwasisolatedfromE.coligrowninarichmedium.Thissinglepoolwassplitintwo,separatelylabeled,andthenmixedandcohybridizedtothemicroarray.IntheIPTGreplicates,controlRNA(labeledCy3)wascohybridizedwithRNA(labeledCy5)fromE.colitreatedwithIPTG.Similarlyobtainedwastheheatshockmicroarray.Richmondetal.(1999)providedetails.Followingtheseauthors,weinvokeasimplenormalizationmethod;oneachmicroarray,werstsubtractbackgroundintensityfromeachspot.Thenwedivideeachadjustedintensitybythetotalintensityobtainedbycombiningallpositiveadjustedmeasurements.Weomitfromtheestimationprocessanyspotswherethebackgroundishigherthanthesignal(column1,Table1).MaximumlikelihoodparameterestimatesforthehierarchicalGammaGammamodelarealsogiveninTable1.TheEBestimatesofdifferentialexpressionattenuatethenaiveestimates,assummarizedgraphicallyinFig.1.Pointsindicatethenormalized,background-adjustedintensitiesinthetwodyes,say,.G;R/.Linesegmentsrunfromeach.G;R/to.GCº;RCº/.Owingtothelogarithmicscale,ofcourse,thisshrinkageismostpronouncedforlowintensityspots.Itisinterestingthatonthecontrolmicroarraytheshrinkageconstantisfairlylargeincomparisontotheheat-shockmicroarray,suggestingthatthemethodcandistinguishnoisefromsignicantsignal.TheattenuationinherentintheEBestimatesaffectstherankingofthemosthighlydifferentiallyex-pressedgenes.Figure2illustratestherankingchangesfortwooftheE.colimicroarrays.Weconsiderthetop100mostdifferentiallyexpressedgenes,asmeasuredbyO½NDR=G.Foreachnintherange1to100,weaskhowmanyofthesetopngenesarerankedinthetopnbytheEBprocedure.Ifthemethodsagree(i.e.,ifºisverysmall),thentheanswerisn.Aboutonequarterofthe100mosthighlydifferentiallyexpressedgenesmeasuredviaO½Narenotinthetop100asmeasuredbytheEBprocedure.Dataanalysismethodsoftenfocusonthemostdifferentiallyexpressedgenes,andsoitisquitepossiblethattheuseofthemoreefcientBayesestimationprocedurewillhaveanimpactondownstreamcomputationsusingmeasureddifferentialexpression.TheEBestimate.RCº/=.GCº/isattenuatedcomparedtothenaiveestimatorR=G,butistheestimateanybetter?Wereporttheresultsofasmallsimulationtoaddressthisquestion.Greenandredintensitiesweresimulatedforasyntheticmicroarrayhaving4,000spots.IntensitiesarosefromGammadistributionswithshapeparameteraD2andGammadistributedscalesinwhicha0D2andºD8.Weincorporated ONDIFFERENTIALVARIABILITYOFEXPRESSIONRATIOS43FIG.1.Shrinkageestimation,E.coli:Pointsareplottedatmeasuredintensities(G;R),andlinesegmentsextendto.GCº;RCº/.Onlinesparalleltothediagonal,foldchangeisconstant.somepositivecorrelationbetweengreenandredscaleparametersusingthemodeldescribedinSection5with·D20.Thusthesimulationmodeldifferedfromthemodelusedfortting.Bykeepingtrackofthesimulatedscalesµgandµr,ofcoursewecanmeasuretheerrorinestimating½forboththenaiveprocedureandtheEBprocedure(Fig.3).ThereisafairlysignicanterrorreductionbytheEBprocedureinthiscase.Othercasesnotreportedshowedsimilarerrorreductions.Beyondpointestimatesofdifferentialexpression,wecanuse(4)toobtainBayesiancondenceintervals(credibleintervals).Byachangeofvariables,1=.1C½=O½B/isdistributedaposterioriasasymmetricBetadistributionwithshapeparameteraCa0,andsoendpointsofacredibleintervalmaybecomputedbyback-transformingquantilesfromthissymmetricBeta.Thecredibleintervalprovidesameasureofuncertaintyin½,butwendthatinferencebeyondpointestimationmaybemoreaccurateinamodel,suchastheonedescribednext,inwhichitisrecognizedthattherearenorealchangesinexpressionforsomesubsetofgenes. 44NEWTONETAL.FIG.2.Effectonrankingofgenes.4.IDENTIFYINGSIGNIFICANTDIFFERENTIALEXPRESSIONWeturnattentiontodecidingwhetherornottheobserveddifferencesatagivengenearesufcientlylargetoassertsignicance.Havingathirdlayerinthehierarchicalmodelfacilitatesthecalculations.Thetruemeanintensitiesofsomeproportionpofspotschangebetweenconditions(i.e.,¹r6D¹g),whiletheothersremainxed.¹rD¹g/.Forspotswhichchange,weusethepreviousmodelfromSection3.Inotherwords,scaleparametersµrandµgareindependentGammavariateswithcommonshapea0andscaleº.Forunchangedspots,thecommonscaleparameterµisdeemedtoarisefromthesameGammaAproblempresentedbytheGamma-Gamma-Bernoullispecicationisthattheidentityofthechangedspotsisunknown.Thelikelihoodcalculations(whichwewouldusetoestimatea,a0,º,andp)ap-pearimpossiblycomplex,sincetheyinvolvesummationoverall2ncongurationsofmissingindica-tors,wherenisthenumberofspots.Fortunately,themodeltswellintotheEMalgorithmframework(Dempsteretal.,1977),andsowehaveasimplerecursiontoinfertheseparametersfromthemarginalTherstingredientinthecalculationisthemarginalprobabilityofdataataspotifthereisnorealdifferentialexpression.ThisisobtainedbyintegratingtheGammamodelforthecommonscaleµ.In ONDIFFERENTIALVARIABILITYOFEXPRESSIONRATIOS45FIG.3.Errorreduction:averagejlog.O½N=½/jD0:88whereasaveragejlog.O½B=½/jD0:42.TheempiricalBayesestimategivesa50%reductioninerror.contrastto(6)inwhichtherearedifferentscales,themarginalprobabilityinthenullcaseis:p0.r;g/D0.2aCa0/ 02.a/0.a0/ºa0.rg/a 1 [.rCgCº/]2aCa0:(7)Lettingrkandgkdenotethemeasuredintensitiesatspotkandintroducingthebinaryindicatorvariablezktobe0unlessthereistruedifferentialexpression,thecompletedataloglikelihoodislc.a;a0;º;p/DXkfzklogpA.rk;gk/C.1 zk/logp0.rk;gk/Czklog.p/C.1 zk/log.1 p/g:TheE-stepistoobtaintheconditionalexpectationoflc.a;a0;º;p/,whichsimplyinvolvesreplacingzkbytheposteriorprobabilityofchangeOzkDP.zkD1jrk;gk;p/DppA.rk;gk/ ppA.rk;gk/C.1 p/p0.rk;gk/(8)andwithparametersa,a0,º,andpxedattentativevalues.TheM-stepistomaximizetheresultantforminthefourparameters.Havingbrokenthemixturestructure,thismaximizationissimplied.Immediately,wendtheupdatedestimateofpisthearithmeticmeanoffOzkg.Anoff-the-shelfnumericalprocedure,suchasnlminbinSplus,readilyoptimizestheremainingparametersineachiteration.FortyiterationsofEMwereusedtoobtaintheestimatesreportedinTable2,andresultswerecheckedfromvariousstartingcongurations.Placingapriordistributionoverpstabilizesthecomputationsandenablesaniceinterpretationoftheoutput.WeuseaBeta(2,2)priorinwhatwereportinTable2,whichamountstoapriorassumptionofexchangeabilityofthefzkgandthatP.zkD1/D0:5uponintegratinguncertaintyinp.Itisconvenientto 46NEWTONETAL.Table2.ParameterEstimatesinGammaGammaBernoulliModelviaEMAlgorithm Microarrayaa0ºp Control22.900.940.280.003Heatshock2.751.374.120.052IPTG-a12.530.820.370.007IPTG-b9.690.680.280.004 xotherparameters,a,a0,andºattheirestimatedvaluesratherthanintegratingagainstaprior.Owingtothelargesamplesize,thereshouldnotbesignicanterrorinthepresentexamples.Ourgoalistocomputeposterioroddsofchangeateachspot.Theoddssummarizeourinferenceaboutactualdifferentialexpressionateachspotusingallthedataonthemicroarray.WithDDfrk;gkgdenotingexpressionmeasurementsonthewholemicroarray,theposterioroddsofchangeatspotkare:oddsDP.zkD1jD/ P.zkD0jD/;whereP.zkD1jD/DZ10P.zkD1jp;rk;gk/P.pjD/dp(9)byconditionalindependenceofthedataatdifferentspotsgiventheparameterp.TheBayesruledeterminesP.zkD1jp;rk;gk/intermsofp0.rk;gk/andpA.rk;gk/;see(8).Also,theEM-algorithmndstheposteriormodeofP.pjD/,say,Op.Toarstapproximation,theintegral(9)equalstheintegrandP.zkD1jrk;gk;p/evaluatedatitsmodalvaluepDOp.Therefore,odds¼pA.r;g/ p0.r;g/Op 1 Op:(10)TheseposterioroddsmayalsobecalledBayesfactorsbecausetheprioroddsforchangeequalunity.AninspectionoftheproleloglikelihoodcurveforpindicatedthatP.pjD/ishighlyconcentratedfortheE.coliexamples,andso(10)providesagoodapproximation.Figure4showscontoursoftheposterioroddsoftruedifferentialexpressioncomputedintheGamma-Gamma-Bernoullimodelusing(10).Thecontourmapprovidesaninterestingsummaryofsignicantchange.Thegreyareaineachpanelindicatesthattheoddsfavornochange.Animportantfeatureofthemapisthatthecontourlinesarenotstraightonthisloglogscale,indicatingthattoconsiderfoldchangesaloneisnotenough.Asweexpected,wearelesscondentaboutnaivefoldchangeatlowsignalintensity.Dottedlinesoneachpanelcorrespondtothe99%rulefromChenetal.(1997).(Weusedallthespotstoestimatethecoefcientofvariation,andthiswassolargeintheheat-shockdatathattheChenetal.approximationtotheupperbandbrokedown.)Theselinesaredesignedsothattheywillbeexceededonlyfor1%ofspotswhichinfacthavenotchanged.Withinthepresentmodel,mostofthis1%canbeexpectedtooccuratlowtotalabundance,asignofinefciency.TheBayesfactorcomputationallowsustorankordergenesbytheirprobabilityofrealdifferentialexpression.WehavereplicatemicroarraysfortheIPTGtreatment,soitisinstructivetochecktherepro-ducibilityoftheseassessments.Asatechnicalnote,wehadremovedintheestimationphaseanyspotsforwhichthemeasuredintensitywasbelowtheestimatedbackground.Forassessingchanges,weincludethesespotsanddeemthemtopresentzerosignalinthatchannel.(Interestingly,thelogBayesfactoriscontinuousatzerointensitysothereisnocomputationalproblemindoingso.)OnIPTG-a,20geneshaveoddsbetterthan10:1favoringchangedexpression.ThereplicateIPTG-bshowssevenchangedgenes.FivevalidgenesareincommonbetweenthereplicatesandareinducedbyIPTG.TheseveareexactlythegenesidentiedbyRichmondetal.(1999)asbeinginducedonthebasisofthesamemicroarraydataandotherradioactivitydata. ONDIFFERENTIALVARIABILITYOFEXPRESSIONRATIOS47FIG.4.OddsofrealdifferentialexpressionE.colidata:Inshadedregionoddsarefornochange.Contoursareatoddsofchangeof1:1,10:1,and100:1,respectively.Theheat-shockdataprovideaninterestingdemonstrationofthemethodology.UsingBayesfactors,wend38genesexhibitdifferentialexpression(25inducedand13repressed).Thisrepresentsabout1%ofthegenesonthemicroarray,andpriorworksuggeststhatmoregeneshavechanged(Richmondetal.,1999).Ifwelookattheparameterestimatesforthiscase(Table2),weseethattheestimatedproportionofchangedspots,Op,isabout5%(about200genes),andthisnumberisinlinewiththeearlierreport.Beingtheoptimalparametervalue,OpsatisestheequationOpD[PnkD1Ozk.Op/]=nwhereOzk.p/istheposteriorprobabilityofchangeatspotk,asin(8).Atthesametime,theBayesfactor(oddsforchange)istheratioofOzk.Op/to1 Ozk.Op/.Somethinginterestingisgoingon.Onaverageoverspots,theposteriorprobabilityofchangeisabout5%,andthisleadsustoinferthatabout5%ofspotshavechanged.Whenitcomestodecidingwhichspotshavechanged,however,only38spots,about1%,haveOzk.Op/0:5andthushaveoddsfavoringchange.Ourinferenceabouttheproportionofchangedspotsdoesnotneedtobethesameastheproportionofspotswhichwecancondentlysayhavechanged.Infact,byMarkovsinequality,theproportionofspotsinwhichOzk.Op/0:5isnogreaterthan2Op,andsomerenementstothisboundmaybepossible. 48NEWTONETAL.Table3.ErrorRates,SimulatedGammaGammaBernoulliModela Odds-Bigger-Than-1RuleLessStringentRule Ozk0:5Ozk·0:5Ozk0:236Ozk·0:236Truechange577423694306Nochange7329273112689 Total650335010053995 aTwodecisionrulesareexaminedbelowfora4,000spotmicroarrayinwhich1,000geneshavechangesintruegeneexpression.Signicantchangeisinferrediftheposteriorprobabilityofchange,Ozk,exceedsacutoff.Tostudyourmethodologyfurther,weperformedasmallsimulation.WeconsideredasinglemicroarraywithnD4000spotsandinwhichdatafor1,000spotsarosefromtheGamma-GammamodelwithaD12,a0D1,andºD1.Theremaining3,000spotshadvariableexpressionlevels,buttherewasnochangeintrueexpressionfromgreentored.ThesameGammamodelwasusedtogeneratethesecommonexpressionlevelsandthenalsotogeneratethemeasuredintensities.SobasicallywesimulatedtheGammaGammaBernoullimodel,butweforcedexactly1,000spotstochange.Parameterestimates,obtainedbytheEMalgorithm,wereOaD12:5,Oa0D1:0,OºD0:95,andOpD0:26,andthuswerecoveredthemodelparametersextremelywell.Table3recordserrorratesbytwodecisionrules.Takingourstandardrule,tocallaspotaschangediftheoddsexceedunity,weinferredonly650changedspots,muchlessthanthe1,000orsowhichweconcludehavechangedfromOp.Thisunderestimationmirrorswhathappenedwiththeheat-shockdata.Intotal,wemake496incorrectcallsusingtheodds-bigger-than-onerule.Asecond,muchlessstringentruleistocallatotalofnOpspotsaschanged,rank-orderedbytheirindividualposteriorprobabilities.Forthesesimulateddata,wethusloweredthebarfromachangeprobabilityof0.5toachangeprobabilityof0.236,andbysodoingweproducedalistofabout1,000genes.Thisrulehastheadvantagethatourconclusionaboutthenumberofchangedgenesisinlinewithourreportingofparticulargenes.But,atleastinthissimulation,weampliedtheoverallerrorrate;withthisrule,617spotswerecalledincorrectly.Thesimulationhighlightsdifcultieswithinferringdifferentialgeneexpression,butwepointoutthatspecicerrorratesmaydependontheapplication.5.MODELVALIDATIONANDDISCUSSIONBoththeGammaGammamodelandtheGammaGammaBernoullimodelattempttocapturesomestructuralfeaturesexpectedinmicroarraydata,butofcoursetheyarehighlyparameterizedanditisimportanttocheckwhetherpredictionsimpliedbythemareinlinewithavailabledata.Wehaveconsideredseveralsimplechecks.Forinstance,wecancompareahistogramofmeasuredintensitiestothettedmarginalmodel(Fig.5).Plottedoneachhistogram(onthelogscale)isthettedmarginaldensityfromtheGammaGammamodelusedforshrinkageestimation(dottedline)andthetteddensityfromtheGammaGamma-Bernoullimodel(solidline).Clearlythereisroomforimprovementinthet,buttheprimaryfeaturesofthedataarecaptured.Asecondinterestingcheckisbasedonawell-knownpropertyoftheGammadistribution.IfRandGaremeasurementsononespot,andthereisnorealdifferentialexpression,thenbothmeasurementsarisefromacommonGammadistributionwithshapeaandscaleµ.TherenormalizeddifferenceBDf1C.R G/=.RCG/g=2hasasymmetricBetadistributionwithshapeparametersaanda;i.e.,itsdensityisproportionaltoba 1.1 b/a 1forb2.0;1/.Notably,thisdensitydoesnotdependonthescaleparameter,soitisthecommondistributionforallunchangedspotsonthemicroarray.Figure6comparesthehistogramofBvaluestothettedBetadensity.Forthetreatmentmicroarrays,wefocusedonlyonBvaluesfromspotsdeemedtobeprobablyunchangedbytheBayesfactorcomputation.Thisisatruemodelcheck;thettingproceduredoesnotattempttocapturevariationsontheshownscale.Indeed,thetispoorinsomerespects,butagainprimaryfeaturesofvariationarecaptured. ONDIFFERENTIALVARIABILITYOFEXPRESSIONRATIOS49FIG.5.Diagnosticcheck:Histogramsareofintensities(bothcolorspooled)onthenaturallogscale.DashedcurveistfromGammaGammamodel;solidcurveistfromGammaGammaBernoullimodel.Weareusingjustfourparameterstodescribemarginalvariationanddependencebetweentheredandgreenchannels,andimprovementsmaycomebyincreasingthenumberofparameters.Wedidletthescaleparameterºbecolorspecic,andthisimprovedthetssomewhat,especiallyontheIPTGmicroarrays,astheestimatedvalueofagoesup.Wealsolettheshapeparameterabecolorspecic,butthisdidnotsignicantlyimprovets.TheuseofadditionalscaleparametersleadtoproblemswiththeBayesfactorcontoursinFig.4,andsocurrentlyweareinvestigatingmodelelaborationsandidentiabilityquestions.Weconjecturethatimprovementsmayariseifsomepositivecorrelationisaddedtotheexpressionmodel.WecandothisbyretainingtheGammasamplingmodelbutaddingcorrelationbetweenµrandµgintheexpressionmodel.Forexample,wemightsayûGamma(a0;º)andtwomultipliersmr;mg»iidGamma(·;·)forapositivedependenceparameter·.ThenwriteµrDmrÃandµgDmgÃ.Themultipliersarecenteredonunity,andwillbeclosetounityif·islarge.Thismodelisanintermediatebetweenthenullandalternativemodelssofarstudied,anditrequiresmoresophisticatedmachinerytot,butitmaybeeffectiveatidentifyingsubtleexpressionchanges.OurmethodsuseGammadistributionsbutotherparametricformscanbeconsidered.Variousparametricmodelsentailconstantcoefcientofvariationonthepositiveline.Thelog-normalmodelhasbeenusedforthispurpose,andacomparisonofthedifferentformulationswillbeuseful(Wiens,1999).Onthe 50NEWTONETAL.FIG.6.Diagnosticcheck:HistogramsareofrenormalizeddifferencesBD.1=2/[.R G/=.RCG/C1]forspotsdeemedtohavenotchanged.CurvesarepredictedBetadensities.basisofpreliminarycomputations,wecansaythatthesamequalitativeanalysisfeaturescarryovertothelog-normal,inparticulartheshapeofcontoursintheBayesfactorplotsissimilar.Wehaveusedoneparticularmethodofnormalizationandbackgroundnoiseadjustment.Probablysomeadvantagecanbegainedbycombiningthesetaskswiththepresentmodelingmethodtobetteraccountforthesesourcesofvariation.Forinstance,onnormalization,wecouldsaythatthescaleparametersinonesampleareaglobalconstantmultipleofthoseintheothersample,andthentreatthisconstantasanothermodelparametertobeestimatedfromunnormalizedintensities.ThisissimilartothecalibrationproceduredescribedinChenetal.,(1997),butinthecontextofahierarchicalmodel.Ourmethodologydealswithasinglemicroarrayatatime,anddoesnotattempttocombinedata,thoughthemodelingframeworkcertainlyallowsthiselaboration.Oneapproachwouldbetodecomposesaylogµ,theexpressionscaleparameter,intocontributionsfromdifferentgenes,differentRNApreparations,anddifferentgrowthconditions.Combininginformationfrommultiplemicroarraysmaybeaneffectivewaytoobtainaccurateestimatesofthecontributionofdifferentsourcesofvariation.Kerretal.(2000)providedetailsofarelatedmethodwhichexpressestheexpectedvalueoflog-transformedintensitymeasurementsintermsofcontributionsfromsuchfactors. ONDIFFERENTIALVARIABILITYOFEXPRESSIONRATIOS51Hierarchicalstatisticalmodelingallowsforefcientdataprocessinginlarge-scaleexpressionstudies.Thisprovidesmorepreciseestimatesofdifferentialgeneexpressionandmoreaccurateassessmentsofsignicantchangesthanstandardmethodsbyaccountingfordifferentialvariabilityindata.Calculationsaccountforthemeasurementerrorprocessandfornaturaluctuationsinabsoluteexpressionlevels.Preprocessingimagedataviathesemethodsmayreduceerrorsindownstreamtasks,suchasclusteranalysisorclassication.TheauthorsthankBobMauforhiscriticalreviewofanearlierdraftandarefereeforhelpfulcomments.TheyalsoacknowledgeinterestingdiscussionswithGaryChurchillwho,independently,hasconsideredtheuseofshrinkageestimationfordifferentialexpression,andtheythankAlexLoguinovwhoidentiedabuginoneoftheplottingprograms.Codeanddatausedinthisarticlearefreelyavailableattherstauthorswebsite www.stat.wisc.edu/ » newton/. ThisresearchwasfundedinpartbytheNationalCancerInstitute,grantR29CA64364-01toM.A.N.andtraininggrantTA-CA09565forC.M.K.NIHgrantR01GM35682supportedC.S.R.andF.R.B.REFERENCESBassettJr.,D.E.,Eisen,M.B.,andBoguski,M.S.1999.Geneexpressioninformaticsitsallinyourmine.NatureGenet.Suppl.21,5155.Blattner,F.R.,Plunkett,G.III.,Bloch,C.A.,Perna,N.T.,Buland,V.,Riley,M.,Collado-Vides,J.,Glasner,J.D.,Rode,C.K.,Mayhew,G.F.,Gregor,J.,Davis,N.W.,Kirkpatrick,H.A.,Goeden,M.A.,Rose,D.J.,Mau,R.,andShao,Y.1997.ThecompletegenomesequenceofEscherichiacoliK-12.Science277,14531474.Brown,P.O.,andBotstein,D.1999.ExploringthenewworldofthegenomewithDNAmicroarrays.NatureGenet.Suppl.21,3337.Carlin,B.P.,andLouis,T.A.1996.BayesandEmpiricalBayesMethodsforDataAnalysis.ChapmanandHall,NewYork.Chen,Y.,Dougherty,E.R.,andBittner,M.L.1997.Ratio-baseddecisionsandthequantitativeanalysisofcDNAmicroarrayimages.J.Biomed.Optics2(4),364374.Cheung,V.G.,Morley,M.,Aguilar,F.,Massimi,A.,Kucherlapti,R.,andChilds,G.1999.MakingandreadingNatureGenet.Suppl.21,1519.Chuang,S.-E.,Daniels,D.L.,Blattner,F.R.1993.GlobalregulationofgeneexpressioninEscherichiacoli.J.Bacteriol.175(7),20262036.Dempster,A.P.,Laird,N.M.,andRubin,D.B.1977.MaximumlikelihoodfromincompletedataviatheEMalgorithm(withdiscussion).J.RoyalStatisticalSociety,SeriesB39,138.Dennis,B.,andPatil,G.P.1984.Thegammadistributionandweightedmultimodalgammadistributionsasmodelsofpopulationabundance.MathematicalBiosciences,68,187212.Duggan,D.J.,Bittner,M.,Chen,Y.,Meltzer,P.,andTrent,J.M.1999.ExpressionprolingusingcDNAmicroarrays.NatureGeneticsSupplement21,1014.Efron,B.,andMorris,C.1973.Combiningpossiblyrelatedestimationproblems(withdiscussion).JournaloftheRoyalStatisticalSociety,SeriesB35,379421.Efron,B.,andMorris,C.1977.Steinsparadoxinstatistics.Sci.Am.236,119127.Eisen,M.B.,Spellman,P.T.,Brown,P.O.,andBotstein,D.1998.Clusteranalysisanddisplayofgenome-wideexpres-sionpatterns.Proc.Natl.Acad.Sci.USA95,1486314868.Fisher,R.A.,Corbet,A.S.,andWilliams,C.B.1943.Therelationbetweenthenumberofspeciesandthenumberofindividualsinarandomsampleofananimalpopulation.J.AnimalEcology12,4258.Johnson,N.L.,Kotz,S.,andBalakrishnan,N.1994.ContinuousUnivariateDistributions,vol.1,2nded.Wiley,NewYork.Kendall,M.G.,andStuart,A.1969.TheAdvancedTheoryofStatistics,vol.1,3rded.Hafner,NewYork.Kerr,M.K.,Martin,M.,andChurchill,G.A.2000.Analysisofvarianceforgeneexpressionmicroarraydata.Manuscript. http://www.jax.org/research/churchill/. Lander,E.S.1999.Arrayofhope.NatureGenet.Suppl.21,34.Lipshutz,R.J.,Fodor,S.P.A.,Gingeras,T.R.,andLockhart,D.J.1999.Highdensitysyntheticoligonucleotidearrays.NatureGenet.Suppl.21,2024. 52NEWTONETAL.Richmond,C.S.,Glasner,J.D.,Mau,R.,Jin,H.,andBlattner,F.R.1999.Genome-wideexpressionprolinginEs-cherichiacoliK-12.Nucl.AcidsRes.27(19),38213835.StatisticalSciences,1993.S-PLUSGuidetoStatisticalandMathematicalAnalysis,Version3.2.StatSci,adivisionofMathSoft,Inc.,Seattle.Wiens,B.L.1999.Whenlog-normalandgammamodelsgivedifferentresults:Acasestudy.Am.Statistician53,8993.Addresscorrespondenceto:MichaelA.NewtonDepartmentofStatistics1210W.DaytonStreetUniversityofWisconsinMadison,WI53706-1685E-mail: