/
JOURNA OF COMPU TIONA BIOLOG olum  Numbe   Mar An Liebert Inc Pp JOURNA OF COMPU TIONA BIOLOG olum  Numbe   Mar An Liebert Inc Pp

JOURNA OF COMPU TIONA BIOLOG olum Numbe Mar An Liebert Inc Pp - PDF document

alida-meadow
alida-meadow . @alida-meadow
Follow
447 views
Uploaded On 2014-12-25

JOURNA OF COMPU TIONA BIOLOG olum Numbe Mar An Liebert Inc Pp - PPT Presentation

37 52 On Differentia ariabilit of Expressio Ratios Improvin Statistica Inferenc abou Gen Expressio Change fro Microarra Dat MA NEWTON CM KENDZIORSKI CS RICHMOND R BL TTNER an K TSU ABSTRACT We conside th proble of inferrin fol change in gen expressi ID: 29378

Differentia

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JOURNA OF COMPU TIONA BIOLOG olum Numbe..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JOURNALOFCOMPUTATIONALBIOLOGYVolume8,Number1,2001MaryAnnLiebert,Inc.Pp.37–52OnDifferentialVariabilityofExpressionRatios:ImprovingStatisticalInferenceaboutGeneExpressionChangesfromMicroarrayDataM.A.NEWTON,1;2C.M.KENDZIORSKI,2C.S.RICHMOND,3F.R.BLATTNER,3andK.W.TSUI 1DepartmentofStatistics,UniversityofWisconsin,Madison,WI53792.2DepartmentofBiostatisticsandMedicalInformatics,UniversityofWisconsin,Madison,WI53792.3LaboratoryofGenetics,UniversityofWisconsin,Madison,WI53792.37 38NEWTONETAL.sample.Followingconvention,weuseredtoindicatethesampletaggedwiththeCy5dyeandgreentoindicatetheCy3dye.Dugganetal.(1999)orCheungetal.(1999),forexample,providefurtherdetailsonhowtoobtaincDNAmicroarraydata.Similarexpressiondataareobtainedonoligonucleotidearrays(Lipshutzetal.,1999),thoughwefocusonthecDNAmicroarraysinthepresentdevelopment.Thereisreasontoexpectthatthestatisticalmethodologydescribedherewillapplyinbothdomains.Toaccountforintrinsicdifferencesbetweenthehybridizingsamples,intensitymeasurementsarenor-malizedinsomefashion.Onewayistocomparesignalsatasetofhouse-keepinggenes,i.e.,genesthoughttonotpresentsigniŽcantchangesinexpressionbetweensamples.Richmondetal.(1999)usedasimplemethodontheE.colimicroarraysreconsideredhere;theynormalizedbythetotalsignalintensityfromallspots.OtherpossibilitiesincludespikingthepreparedsamplewithknownconcentrationsofspeciŽcgenesorcombiningmeasurementstakeninbothorientationsofthedyes.Acomponentofeachintensitymeasurementatagivenspotisbackgrounduorescence.Anestimateofthiscomponentcanbeobtainedfrompixelsnearthespot,andthereportedintensityisthentheoriginalmeasurementminusthebackgroundmeasurement.Inwhatfollowsweconsiderthemeasurementstobenormalizedandtobeadjustedforbackgroundintensity.Itistypicalthatinferenceaboutdifferentialgeneexpressionbetweentwocelltypesisbasedontheratioofmeasuredexpressionlevels.Abasicstatisticalproblemistoknowwhenthemeasureddifferentialexpressionislikelytoreectarealbiologicalshiftingeneexpression.Thisdependsontheamountofvariationinthesystem,soitisdifŽculttojustifyaŽxedrule,suchastofocusongenesexhibitingmorethana3-foldshift,say.Certainlyreplicationwillbecriticalinapplications;asmoreandmoremicroarraysaremeasured,someconŽdenceintheexpressionproŽleswillundoubtedlyemerge.Ourimmediateconcerniswithdatafromasinglemicroarray;weseesomeroomforimprovementintheinitialsignalprocessingwhichmayhavebearingondownstreamtaskssuchasclusteringorotherformsofdataanalysis(e.g.,Eisenetal.,1998;Bassettetal.,1999).Agivenfoldchangeinmeasuredexpressionmayhaveadifferentinterpretationforagenewhoseabsoluteexpressionislowascomparedtoagenethatisbrightinbothuorescentchannels(notedinBassettetal.[1999]forexample).WearguethatanyprocedurewhichusestherawintensityratiosalonetoinferdifferentialexpressionmaybeinefŽcientandthusmayleadtoexcessiveerrors.Indeed,sourcesofvariationareexpectedtobesuchthattheabsoluteexpressionlevelsprovideinformationonthevariationofintensityratios.Thisinformationisignoredinthestandardtreatment.Onesolutionistoignoreanygeneswhosetranscriptsarepresentatalowtotalabundance.WemayhaveconŽdenceaboutthedifferentialexpressionofremaininggenes,butatthepriceofthrowingawaypotentiallyvaluabledata.Inanycase,thechoiceofacutoffmaybearbitrary.Agenemaybedeemedbelowthedetectionlevelinonechannelbutnotintheother.Furthermore,thetranscriptabundanceofmanyinterestinggenesmaybeverylow,andsothestrategyseemsfarfromoptimal.Thesolutionwedescribeisbasedonhierarchicalmodelsofmeasuredexpressionlevelsinwhichweaccountfortwoobvioussourcesofvariation.TheŽrstwecallmeasurementerror.Inhypotheticalrepetitionsoftheexperiment,themeasureduorescencesignalwilluctuatearoundsomemeanvaluewhichisitselfapropertyofthecelltype,theparticulargene,andotherfactors.Theseuctuationsareduetomultiplesourcesofvariationthatariseinproducingthemeasurement,suchasvariationinthepreparationofthemRNAsamplesandintheincorporationofuorescenttags,opticalnoise,andcrosshybridization.Importantly,thisvariationmayinclude,butisaboveandbeyond,thebackgroundnoisementionedabove.Thesecondmainsourceofvariationweconsiderisduetothedifferentgenesspottedontothemicroarray.Themeanuorescencevaluearoundwhichmeasuredexpressionuctuateschangesfromgenetogeneandwillserveasarandomeffectinourproposedmodel.ThepopulationofmRNAsfromagivensampleiscomposedofmanydistinctmolecules,butthepartitionisnotuniform;somemRNAsareabundantandothersarerare.AsweseeinSections3and4,byformallycombiningthetwosourcesofvariationwecanreadilyobtainprobabilisticstatementsaboutactualdifferentialexpression.WeŽndthattheobservedratiosarenotoptimalestimators;weŽndthatfocusingonfoldchangesaloneisinsufŽcientandthatconŽdencestatementsaboutdifferentialexpressiondependontranscriptabundance.PerhapstheŽrststatisticaltreatmentofmicroarraydataanalysisiscontainedinChenetal.(1997).Theseauthorsmaketheinterestingargumentthat,althoughexpectedexpressionlevelsdouctuatefromgeneto ONDIFFERENTIALVARIABILITYOFEXPRESSIONRATIOS39geneacrossthemicroarray,themeasurementsarelinkedbyhavingaconstantcoefŽcientofvariationc,say.Then,theobserveddifferentialexpression,say,TkDRk=Gk(ratioofredtogreenintensityatgenek),hasasamplingdistributiondependentonlyoncunderthenullhypothesisthatRkandGkhavethesameexpectation,i.e.,thatthereisnorealdifferentialexpression,andcomputationisunderaGaussianmodelforbothmeasurements.Usingasetofhouse-keepinggenes,whicharethoughttonotpresentrealdifferentialexpression,themaximumlikelihoodestimateofcisderivedand“conŽdenceintervals”foractualdifferentialexpressionarecomputedfrompercentilesoftheestimatednulldistributionofTk.Theintervalsareeasytocomputeandareresponsivetotheintrinsicvariationofdataonthemicroarraybecausetheyuseadata-dependentvalueofc.(Bycontrast,theproceduretocallsigniŽcantanygenepresenting,say,a3-foldorgreaterexpressiondifferentialisnotresponsivetosuchvariation.)Themethodhasignoredancillaryinformation;Rk£GkcontainsinformationaboutthevariationofTk.Inotherwords,TkisnotindependentofRk£Gk.Thereistheminortechnicalpoint,too,thatRkandGkaremodeledasGaussianwheninfacttheymustbepositive.WeconsiderasamplingmodelformeasuredintensitiesinSection2.OnecanviewtheChenetal.(1997)methodasproducingasetofhypothesistests,oneforeachgeneonthemicroarray,inwhichthenullhypothesisisthattheexpectationofbothintensitysignalsisequalandthealternativeisthattheyareunequal.WhenanobservedTkfallsinthetailsofthenullsamplingdistribution,werejectthenullanddeclaresigniŽcantdifferentialexpression.Asinotherdomainsofapplication,weknowthatsomebeneŽtcanbeattainedifthethousandsofparametersareconsideredsimultaneously,ratherthaninisolation(EfronandMorris,1973,1977;CarlinandLouis,1996).Thecalculationspresentedhereattempttodemonstratetheutilityoftreatingthegene-speciŽcparametersthemselvesasmembersofanŽcpopulation.InSection3weconsidertheproblemofestimatingandpossiblyformingaconŽdenceintervalfortheactualdifferentialexpressionofagivengene.Thisinvolvescalculationsinatwo-layerhierarchicalmodeltoproduceaposteriorprobabilitydistributionfortheactualdifferentialexpressionandanempiricalBayesestimateofthesame.WeillustratewithapaneloffourE.colimicroarrays,highlightingdistinctionsbetweentheempiricalBayesestimatesandthenaiveestimates.Asimulationstudyshowsthattotalestimationerrorcanbereducedbyusingthisprocedure.TheproblemoftestingforsigniŽcantdifferentialexpressionisthefocusofSection4,andhereathirdlayerisaddedtothehierarchicalmodel.Wederiveafunctionwhichgivestheoddsofactualdifferentialexpressionasafunctionofmeasuredintensities.Thisprovidesaneffectivesummary,asortofqualitynumber,foreachgene.Tosimplifyourdevelopment,wefocusondatafromasinglemicroarray,andwesupposethatthereisonespotforeachgene.WeaddresstheissueofmodelvalidationinSection5,wherepredictionsfromourhierarchicalmodelsarecomparedwithfeaturesintheavailabledata.Possiblemodelextensionsarealsodiscussed.2.ASAMPLINGMODELFORMEASUREDEXPRESSIONMeasuredintensitylevelsR(red)andG(green)approximatetargetmeanvalues¹rand¹gatagivenspotonthemicroarray.(Weavoidusingsubscriptsfordistinguishingspotsunlessitisabsolutelynecessary.)Thegoalistoestimate½D¹r=¹g.Thestandard(naive)estimatorisO½NDR=G.Measurementerrordependsonsignalstrength(Chenetal.,1997).Toaccountforthisexplicitly,RandGaremodeledasindependentsamplesfromdistinctdistributionswithacommoncoefŽcientofvariation.WeŽnditconvenienttoworkwithGammadistributionshavingaconstantshapeparameter.TheyexhibitconstantcoefŽcientofvariation,theyareGaussian-like,theyaresupportedonthepositiveline,andtheyareeasytomanipulate.TheGammamodelhasalonghistoryinstatisticalecology(e.g.,Fisheretal.,1943),notonlyforanalyticalconveniencebutalsobecauseitmaypossessadeeperbiologicalinterpretation.DennisandPatil(1984)showedthattheGammaistheapproximatestationarydistributionfortheabundanceofapopulationuctuatingaroundastableequilibrium.InSection6wecommentonthesensitivityofourresultstotheGammamodelassumption.TheprobabilitydensityofaGammavariableRwithscaleµrandshapeparameteraisp.rjµr;a/Dµarra 1exp. rµr/ 0.a/forr�0(1) 40NEWTONETAL.where0.a/DR10xa 1e xdx.WedenotethisdensitybyGamma(a;µr).Similarly,wemodelthemeasuredintensityGasGamma(a;µg),andweassumethatRandGareindependent.TheexpectationsofRandGarea=µranda=µg,respectively,andthusthetargetdifferentialexpression½D¹r=¹gDµg=µr.BothRandGhavethesamecoefŽcientofvariationcD1=p aeventhoughtheymayhavedifferentscales.ByintegratingthejointdistributionofRandG,wecanderivethesamplingdistributionofthemeasureddifferentialexpressionTDR=G,p.tjµr;µg;a/D0.2a/ 02.a/³1 ½´.t=½/a 1 .1Ct=½/2a(2)fort�0whereagain½D¹r=¹gistheparameterofinterest.Thetailofthisdistributionisasymptoticto1=taC1,sowerestricta�1toensureaŽrstmoment.Theformofthissamplingdistributioniswellknown;itisascalemultipleofaBetadistributionofthesecondkind(KendallandStuart,1969,p.151).AtthispointwecouldfollowthedevelopmentinChenetal.(1997)usingthisGammamodelinsteadofthenormalmodelusedinthatwork.Forinstance,weseethatwhen½D1,i.e.,norealdifferentialexpression,thedistributionofTdependsonthecoefŽcientofvariation1=p aonly.AproblemwiththisapproachisthatweloseinformationwhenusingtheratioTalonetoassessdifferentialexpression.Toseewhy,considertheconditionalsamplingdistributionofTDR=GgivenSDRGwithintheGammamodel.Thecaseofnoactualdifferentialexpression,½D1,ismostsimple.Denotingbyµthecommonvalueofµrandµg,wehavep.tjs;µ;a//1 texpf µp s.p tC1=p t/g:(3)Onthemultiplicativescaleofintensitymeasurements,Sactsliketotalabundancefrombothchannels.InspectionshowsthatthescaleofthisconditionaldistributiondependsonS;i.e.,thevariationinTissmallerforlargerS.Thevariabilityofdifferentialexpressionisnotconstant,andsoignoringthesechangescanleadtoinefŽcientstatisticalprocedures.OnewaytomodifytheprocedurefromChenetal.(1997),forexample,wouldbetousesomethinglike(3)asareferencedistributioninsteadoftheanalogto(1).Weinsteadtakeahierarchicalmodelingapproachwhichenablesdirectparameterestimationandhypothesistesting.3.ESTIMATINGDIFFERENTIALEXPRESSIONExceptfortestmicroarraysorhouse-keepinggenes,wecertainlyexpectrealdifferencesingeneexpres-sionbetweencelltypes;andclearlydifferentgenescanexhibitdifferencesinactualexpressionwithinagivencelltype.AkeydistinctionofthepresentapproachfromearliereffortsistheformulationofaspeciŽcprobabilitymodeltocharacterizetheseuctuations.AmongtherangeofpossiblespeciŽcations,weŽrstconsiderasimpleGammamodelforthescaleparametersµrandµg.ThisformisconjugatetotheGammasamplingmodelandthuspermitsadetailedanalysis.ItentailsindependenceamongallthescaleparametersonthemicroarrayandassumesthattheyfollowthecommonGammadistributionGamma(a0;º).ModelŽtcanbeimprovedslightlyifweallowdifferentscaleparameters,say,ºgandºrforthetwodyes,butwetakeacommonparameterinthepresentdevelopment.Thismodelisreasonablyexible,skewedright,andpresentsincreasingvariationwithincreasingmean.Itrepresentsprioruncertaintyinactualexpressionlevels.AnextensionwhichallowscorrelationisdescribedinSection5.OurmainreasonsforchoosingaGammadistributiontogovernthelatentscaleparametersµrandµgareanalyticaltractabilityandmodelexibility.Wenoteinpassing,however,thatsometheoreticaljustiŽcationmayalsoexist.Thetargetexpressionlevels¹r/1=µrand¹g/1=µgeachrepresentsomekindoftrueabundanceofthegiventranscriptinthetwomRNApools.Assuch,theirdistributionconcernsrelativefrequenciesoffrequenciesandthesize–frequencyrelationcharacteristicoftheZipf-Paretolawmayobtain(Johnsonetal.,1994).Ifso,therelativefrequencyofgeneswithtranscriptabundance¹isproportionalto1=¹a0C1forsomepowera0.ThereciprocalGammadistributionhasessentiallythesamedensityformoderatetolargevalues¹. ONDIFFERENTIALVARIABILITYOFEXPRESSIONRATIOS41Withthetwomodelcomponentsinplace,wecanderivesomeinterestingconsequences.Notably,wecancomputetheposteriordistributionofthetruedifferentialexpressionatagivenspotp.½jR;G;´//½ .aCa0C1/»1 ½C.GCº/ .RCº/¼ 2.aCa0/(4)where´D.a;a0;º/denotestheadditionalparametersyettobespeciŽed.ThisisthedistributionoftheratiooftwoindependentGammavariables,anditcanbederivedinthesamewayasthesamplingdistribution(2).Uncertaintyaboutthetruedifferentialexpressionatagivenspotischaracterizedbythisdistribution,andso,dependingonourlossfunction,theBayesestimateof½issomemeasureofitscenter.Wenotethemodeandmeanofthisright-skeweddistributionaremodeD³RCº GCº´³aCa0 1 aCa0C1´andmeanD³RCº GCº´³aCa0 aCa0 1´:Asitisinbetweenthesetwovalues,andissomewhatsimpler,wetakeastheBayesestimateofdifferentialexpressionO½BDRCº GCº:(5)Thishastheclassicformofashrinkageestimator.Forstrongsignals,O½BwillbequiteclosetothenaiveestimatorO½NDR=G,butthereisattenuationoftheBayesestimate,especiallywhentheoverallsignalintensityislow.Thus,theBayesestimatenaturallyaccountsforthedecreasedvariationindifferentialexpressionwithincreasingsignal.Theamountofattenuationisgovernedbytheparameterº,whichisyettobespeciŽed.Beingthescaleparameteroftheexpressionmodelcomponent,wecanrepresentitinmorefamiliarterms.Considerthemarginalexpectationofsignalintensityintheredchannel,say,obtainedbyintegratinguncertaintyinµr,E.R/DE[E.Rjµr/]DE.a=µr/Daº=.a0 1/andsoºD.a0 1/E.R/=a.Thissimpleformulationcontainsthreekeyquantities.ThecoefŽcientofvariationinthemeasurementerroriscontrolledbya;thecoefŽcientofvariationdescribinguctuationsofactualexpressionamonggenesiscontrolledbya0;andE.R/istheoverallaverageintensitymeasurementintheredchannel.(Weareassumingthedataarenormalized,andourmodelassertsthatRandGhaveidenticalmarginaldistributions,sowecoulduseGaboveinsteadofR.)Apragmaticapproachtodealingwiththeunknownparameters´D.a;a0;º/istoestimatethembymarginalmaximumlikelihood.Thatis,byintegratinguncertaintyinbothµrandµgweobtainthepredictiveprobabilitydensityofeachmeasurementpair(R;G)aspA.r;g/D»0.aCa0/ 0.a/0.a0/¼2.º/2a0.rg/a 1 [.rCº/.gCº/]aCa0:(6)Duetotheindependenceassumption,thisjointdistributionistheproductofthemarginaldistributionofRandthemarginaldistributionofG.EachmarginisascalemixtureofGammadistributionsandhenceisacompoundGammadistributionbelongingtoTypeVIofPearson’ssystem(Johnsonetal.,1994,p.381).Themarginalloglikelihoodl.a;a0;º/isthesumofcontributionsfromallspotsonthemicroarrayl.a;a0;º/DXklogpA.rk;gk/where.rk;gk/areobservedatthekthspot.Wecalll.a;a0;º/amarginalloglikelihoodratherthananordinaryloglikelihoodbecausethegene-speciŽcµparametershavebeenintegratedaway.Weoptimize 42NEWTONETAL.Table1.ParameterEstimatesinGamma-GammaModela MicroarrayNAaa0º Control372.062.1012.84Heatshock822.121.617.13IPTG-a2071.481.8515.29IPTG-b1491.191.5714.71 aNAisthenumberofspots,outof4,290,inwhichback-groundishigherthanspotintensity,andtheremainingcolumnsgivemaximumlikelihoodparametervalues.Thescaleºisfornormalizationtofractionsthentimes105.l.a;a0;º/numericallyusingtheSplusfunctionnlminb(StatisticalSciences,1993).Intheexamplesworkedsofar,weŽndthataisestimatedtobelargerthana0 1,sotheestimateofºissmallerthanthemeanintensity.TheinferencethatresultsbyestimatingparametersasaboveiscalledempiricalBayes(EB)(EfronandMorris,1973).EffortsatwholegenomeexpressionanalysiswerepioneeredinE.coliK-12(Chuangetal.,1993).Se-quencingoftheE.coliK-12genomehasenabledthefabricationofhighresolutionmicroarrayscontainingtheentirecomplementof4,290openreadingframesinthisgenome(Blattneretal.,1997;Richmondetal.,1999).Todemonstrateourstatisticalmethodology,wereanalyzedapaneloffourmicroarraysfromthisE.coliproject.Oneoftheseisacontrolmicroarray,twoarereplicatesinvolvingtreatmentwithisopropyl-¯-D-thiogalactopyranoside(IPTG)inwhichitisknownthatonlyafewtranscriptsshouldbeinduced,andoneisfromaheatshocktreatment(HS)inwhichmoreglobalchangesareexpected.Forthecontrolmicroarray,totalRNAwasisolatedfromE.coligrowninarichmedium.Thissinglepoolwassplitintwo,separatelylabeled,andthenmixedandcohybridizedtothemicroarray.IntheIPTGreplicates,controlRNA(labeledCy3)wascohybridizedwithRNA(labeledCy5)fromE.colitreatedwithIPTG.Similarlyobtainedwastheheatshockmicroarray.Richmondetal.(1999)providedetails.Followingtheseauthors,weinvokeasimplenormalizationmethod;oneachmicroarray,weŽrstsubtractbackgroundintensityfromeachspot.Thenwedivideeachadjustedintensitybythetotalintensityobtainedbycombiningallpositiveadjustedmeasurements.Weomitfromtheestimationprocessanyspotswherethebackgroundishigherthanthesignal(column1,Table1).MaximumlikelihoodparameterestimatesforthehierarchicalGamma–GammamodelarealsogiveninTable1.TheEBestimatesofdifferentialexpressionattenuatethenaiveestimates,assummarizedgraphicallyinFig.1.Pointsindicatethenormalized,background-adjustedintensitiesinthetwodyes,say,.G;R/.Linesegmentsrunfromeach.G;R/to.GCº;RCº/.Owingtothelogarithmicscale,ofcourse,thisshrinkageismostpronouncedforlowintensityspots.Itisinterestingthatonthecontrolmicroarraytheshrinkageconstantisfairlylargeincomparisontotheheat-shockmicroarray,suggestingthatthemethodcandistinguishnoisefromsigniŽcantsignal.TheattenuationinherentintheEBestimatesaffectstherankingofthemosthighlydifferentiallyex-pressedgenes.Figure2illustratestherankingchangesfortwooftheE.colimicroarrays.Weconsiderthetop100mostdifferentiallyexpressedgenes,asmeasuredbyO½NDR=G.Foreachnintherange1to100,weaskhowmanyofthesetopngenesarerankedinthetopnbytheEBprocedure.Ifthemethodsagree(i.e.,ifºisverysmall),thentheanswerisn.Aboutonequarterofthe100mosthighlydifferentiallyexpressedgenesmeasuredviaO½Narenotinthetop100asmeasuredbytheEBprocedure.Dataanalysismethodsoftenfocusonthemostdifferentiallyexpressedgenes,andsoitisquitepossiblethattheuseofthemoreefŽcientBayesestimationprocedurewillhaveanimpacton“downstream”computationsusingmeasureddifferentialexpression.TheEBestimate.RCº/=.GCº/isattenuatedcomparedtothenaiveestimatorR=G,butistheestimateanybetter?Wereporttheresultsofasmallsimulationtoaddressthisquestion.Greenandredintensitiesweresimulatedforasyntheticmicroarrayhaving4,000spots.IntensitiesarosefromGammadistributionswithshapeparameteraD2andGammadistributedscalesinwhicha0D2andºD8.Weincorporated ONDIFFERENTIALVARIABILITYOFEXPRESSIONRATIOS43FIG.1.Shrinkageestimation,E.coli:Pointsareplottedatmeasuredintensities(G;R),andlinesegmentsextendto.GCº;RCº/.Onlinesparalleltothediagonal,foldchangeisconstant.somepositivecorrelationbetweengreenandredscaleparametersusingthemodeldescribedinSection5with·D20.ThusthesimulationmodeldifferedfromthemodelusedforŽtting.Bykeepingtrackofthesimulatedscalesµgandµr,ofcoursewecanmeasuretheerrorinestimating½forboththenaiveprocedureandtheEBprocedure(Fig.3).ThereisafairlysigniŽcanterrorreductionbytheEBprocedureinthiscase.Othercasesnotreportedshowedsimilarerrorreductions.Beyondpointestimatesofdifferentialexpression,wecanuse(4)toobtainBayesianconŽdenceintervals(credibleintervals).Byachangeofvariables,1=.1C½=O½B/isdistributedaposterioriasasymmetricBetadistributionwithshapeparameteraCa0,andsoendpointsofacredibleintervalmaybecomputedbyback-transformingquantilesfromthissymmetricBeta.Thecredibleintervalprovidesameasureofuncertaintyin½,butweŽndthatinferencebeyondpointestimationmaybemoreaccurateinamodel,suchastheonedescribednext,inwhichitisrecognizedthattherearenorealchangesinexpressionforsomesubsetofgenes. 44NEWTONETAL.FIG.2.Effectonrankingofgenes.4.IDENTIFYINGSIGNIFICANTDIFFERENTIALEXPRESSIONWeturnattentiontodecidingwhetherornottheobserveddifferencesatagivengenearesufŽcientlylargetoassertsigniŽcance.Havingathirdlayerinthehierarchicalmodelfacilitatesthecalculations.Thetruemeanintensitiesofsomeproportionpofspotschangebetweenconditions(i.e.,¹r6D¹g),whiletheothersremainŽxed.¹rD¹g/.Forspotswhichchange,weusethepreviousmodelfromSection3.Inotherwords,scaleparametersµrandµgareindependentGammavariateswithcommonshapea0andscaleº.Forunchangedspots,thecommonscaleparameterµisdeemedtoarisefromthesameGammaAproblempresentedbytheGamma-Gamma-BernoullispeciŽcationisthattheidentityofthechangedspotsisunknown.Thelikelihoodcalculations(whichwewouldusetoestimatea,a0,º,andp)ap-pearimpossiblycomplex,sincetheyinvolvesummationoverall2nconŽgurationsofmissingindica-tors,wherenisthenumberofspots.Fortunately,themodelŽtswellintotheEMalgorithmframework(Dempsteretal.,1977),andsowehaveasimplerecursiontoinfertheseparametersfromthemarginalTheŽrstingredientinthecalculationisthemarginalprobabilityofdataataspotifthereisnorealdifferentialexpression.ThisisobtainedbyintegratingtheGammamodelforthecommonscaleµ.In ONDIFFERENTIALVARIABILITYOFEXPRESSIONRATIOS45FIG.3.Errorreduction:averagejlog.O½N=½/jD0:88whereasaveragejlog.O½B=½/jD0:42.TheempiricalBayesestimategivesa50%reductioninerror.contrastto(6)inwhichtherearedifferentscales,themarginalprobabilityinthenullcaseis:p0.r;g/D0.2aCa0/ 02.a/0.a0/ºa0.rg/a 1 [.rCgCº/]2aCa0:(7)Lettingrkandgkdenotethemeasuredintensitiesatspotkandintroducingthebinaryindicatorvariablezktobe0unlessthereistruedifferentialexpression,thecompletedataloglikelihoodislc.a;a0;º;p/DXkfzklogpA.rk;gk/C.1 zk/logp0.rk;gk/Czklog.p/C.1 zk/log.1 p/g:TheE-stepistoobtaintheconditionalexpectationoflc.a;a0;º;p/,whichsimplyinvolvesreplacingzkbytheposteriorprobabilityofchangeOzkDP.zkD1jrk;gk;p/DppA.rk;gk/ ppA.rk;gk/C.1 p/p0.rk;gk/(8)andwithparametersa,a0,º,andpŽxedattentativevalues.TheM-stepistomaximizetheresultantforminthefourparameters.Havingbrokenthemixturestructure,thismaximizationissimpliŽed.Immediately,weŽndtheupdatedestimateofpisthearithmeticmeanoffOzkg.Anoff-the-shelfnumericalprocedure,suchasnlminbinSplus,readilyoptimizestheremainingparametersineachiteration.FortyiterationsofEMwereusedtoobtaintheestimatesreportedinTable2,andresultswerecheckedfromvariousstartingconŽgurations.Placingapriordistributionoverpstabilizesthecomputationsandenablesaniceinterpretationoftheoutput.WeuseaBeta(2,2)priorinwhatwereportinTable2,whichamountstoapriorassumptionofexchangeabilityofthefzkgandthatP.zkD1/D0:5uponintegratinguncertaintyinp.Itisconvenientto 46NEWTONETAL.Table2.ParameterEstimatesinGamma–Gamma–BernoulliModelviaEMAlgorithm Microarrayaa0ºp Control22.900.940.280.003Heatshock2.751.374.120.052IPTG-a12.530.820.370.007IPTG-b9.690.680.280.004 Žxotherparameters,a,a0,andºattheirestimatedvaluesratherthanintegratingagainstaprior.Owingtothelargesamplesize,thereshouldnotbesigniŽcanterrorinthepresentexamples.Ourgoalistocomputeposterioroddsofchangeateachspot.Theoddssummarizeourinferenceaboutactualdifferentialexpressionateachspotusingallthedataonthemicroarray.WithDDfrk;gkgdenotingexpressionmeasurementsonthewholemicroarray,theposterioroddsofchangeatspotkare:oddsDP.zkD1jD/ P.zkD0jD/;whereP.zkD1jD/DZ10P.zkD1jp;rk;gk/P.pjD/dp(9)byconditionalindependenceofthedataatdifferentspotsgiventheparameterp.TheBayesruledeterminesP.zkD1jp;rk;gk/intermsofp0.rk;gk/andpA.rk;gk/;see(8).Also,theEM-algorithmŽndstheposteriormodeofP.pjD/,say,Op.ToaŽrstapproximation,theintegral(9)equalstheintegrandP.zkD1jrk;gk;p/evaluatedatitsmodalvaluepDOp.Therefore,odds¼pA.r;g/ p0.r;g/Op 1 Op:(10)TheseposterioroddsmayalsobecalledBayesfactorsbecausetheprioroddsforchangeequalunity.AninspectionoftheproŽleloglikelihoodcurveforpindicatedthatP.pjD/ishighlyconcentratedfortheE.coliexamples,andso(10)providesagoodapproximation.Figure4showscontoursoftheposterioroddsoftruedifferentialexpressioncomputedintheGamma-Gamma-Bernoullimodelusing(10).ThecontourmapprovidesaninterestingsummaryofsigniŽcantchange.Thegreyareaineachpanelindicatesthattheoddsfavornochange.Animportantfeatureofthemapisthatthecontourlinesarenotstraightonthislog–logscale,indicatingthattoconsiderfoldchangesaloneisnotenough.Asweexpected,wearelessconŽdentaboutnaivefoldchangeatlowsignalintensity.Dottedlinesoneachpanelcorrespondtothe99%rulefromChenetal.(1997).(WeusedallthespotstoestimatethecoefŽcientofvariation,andthiswassolargeintheheat-shockdatathattheChenetal.approximationtotheupperbandbrokedown.)Theselinesaredesignedsothattheywillbeexceededonlyfor1%ofspotswhichinfacthavenotchanged.Withinthepresentmodel,mostofthis1%canbeexpectedtooccuratlowtotalabundance,asignofinefŽciency.TheBayesfactorcomputationallowsustorankordergenesbytheirprobabilityofrealdifferentialexpression.WehavereplicatemicroarraysfortheIPTGtreatment,soitisinstructivetochecktherepro-ducibilityoftheseassessments.Asatechnicalnote,wehadremovedintheestimationphaseanyspotsforwhichthemeasuredintensitywasbelowtheestimatedbackground.Forassessingchanges,weincludethesespotsanddeemthemtopresentzerosignalinthatchannel.(Interestingly,thelogBayesfactoriscontinuousatzerointensitysothereisnocomputationalproblemindoingso.)OnIPTG-a,20geneshaveoddsbetterthan10:1favoringchangedexpression.ThereplicateIPTG-bshowssevenchangedgenes.FivevalidgenesareincommonbetweenthereplicatesandareinducedbyIPTG.TheseŽveareexactlythegenesidentiŽedbyRichmondetal.(1999)asbeinginducedonthebasisofthesamemicroarraydataandotherradioactivitydata. ONDIFFERENTIALVARIABILITYOFEXPRESSIONRATIOS47FIG.4.OddsofrealdifferentialexpressionE.colidata:Inshadedregionoddsarefornochange.Contoursareatoddsofchangeof1:1,10:1,and100:1,respectively.Theheat-shockdataprovideaninterestingdemonstrationofthemethodology.UsingBayesfactors,weŽnd38genesexhibitdifferentialexpression(25inducedand13repressed).Thisrepresentsabout1%ofthegenesonthemicroarray,andpriorworksuggeststhatmoregeneshavechanged(Richmondetal.,1999).Ifwelookattheparameterestimatesforthiscase(Table2),weseethattheestimatedproportionofchangedspots,Op,isabout5%(about200genes),andthisnumberisinlinewiththeearlierreport.Beingtheoptimalparametervalue,OpsatisŽestheequationOpD[PnkD1Ozk.Op/]=nwhereOzk.p/istheposteriorprobabilityofchangeatspotk,asin(8).Atthesametime,theBayesfactor(oddsforchange)istheratioofOzk.Op/to1 Ozk.Op/.Somethinginterestingisgoingon.Onaverageoverspots,theposteriorprobabilityofchangeisabout5%,andthisleadsustoinferthatabout5%ofspotshavechanged.Whenitcomestodecidingwhichspotshavechanged,however,only38spots,about1%,haveOzk.Op/�0:5andthushaveoddsfavoringchange.OurinferenceabouttheproportionofchangedspotsdoesnotneedtobethesameastheproportionofspotswhichwecanconŽdentlysayhavechanged.Infact,byMarkov’sinequality,theproportionofspotsinwhichOzk.Op/�0:5isnogreaterthan2Op,andsomereŽnementstothisboundmaybepossible. 48NEWTONETAL.Table3.ErrorRates,SimulatedGamma–Gamma–BernoulliModela Odds-Bigger-Than-1RuleLessStringentRule Ozk�0:5Ozk·0:5Ozk�0:236Ozk·0:236Truechange577423694306Nochange7329273112689 Total650335010053995 aTwodecisionrulesareexaminedbelowfora4,000spotmicroarrayinwhich1,000geneshavechangesintruegeneexpression.SigniŽcantchangeisinferrediftheposteriorprobabilityofchange,Ozk,exceedsacutoff.Tostudyourmethodologyfurther,weperformedasmallsimulation.WeconsideredasinglemicroarraywithnD4000spotsandinwhichdatafor1,000spotsarosefromtheGamma-GammamodelwithaD12,a0D1,andºD1.Theremaining3,000spotshadvariableexpressionlevels,buttherewasnochangeintrueexpressionfromgreentored.ThesameGammamodelwasusedtogeneratethesecommonexpressionlevelsandthenalsotogeneratethemeasuredintensities.SobasicallywesimulatedtheGamma–Gamma–Bernoullimodel,butweforcedexactly1,000spotstochange.Parameterestimates,obtainedbytheEMalgorithm,wereOaD12:5,Oa0D1:0,OºD0:95,andOpD0:26,andthuswerecoveredthemodelparametersextremelywell.Table3recordserrorratesbytwodecisionrules.Takingourstandardrule,tocallaspotaschangediftheoddsexceedunity,weinferredonly650changedspots,muchlessthanthe1,000orsowhichweconcludehavechangedfromOp.Thisunderestimationmirrorswhathappenedwiththeheat-shockdata.Intotal,wemake496incorrectcallsusingtheodds-bigger-than-onerule.Asecond,muchlessstringentruleistocallatotalofnOpspotsaschanged,rank-orderedbytheirindividualposteriorprobabilities.Forthesesimulateddata,wethusloweredthebarfromachangeprobabilityof0.5toachangeprobabilityof0.236,andbysodoingweproducedalistofabout1,000genes.Thisrulehastheadvantagethatourconclusionaboutthenumberofchangedgenesisinlinewithourreportingofparticulargenes.But,atleastinthissimulation,weampliŽedtheoverallerrorrate;withthisrule,617spotswerecalledincorrectly.ThesimulationhighlightsdifŽcultieswithinferringdifferentialgeneexpression,butwepointoutthatspeciŽcerrorratesmaydependontheapplication.5.MODELVALIDATIONANDDISCUSSIONBoththeGamma–GammamodelandtheGamma–Gamma–Bernoullimodelattempttocapturesomestructuralfeaturesexpectedinmicroarraydata,butofcoursetheyarehighlyparameterizedanditisimportanttocheckwhetherpredictionsimpliedbythemareinlinewithavailabledata.Wehaveconsideredseveralsimplechecks.Forinstance,wecancompareahistogramofmeasuredintensitiestotheŽttedmarginalmodel(Fig.5).Plottedoneachhistogram(onthelogscale)istheŽttedmarginaldensityfromtheGamma–Gammamodelusedforshrinkageestimation(dottedline)andtheŽtteddensityfromtheGamma–Gamma-Bernoullimodel(solidline).ClearlythereisroomforimprovementintheŽt,buttheprimaryfeaturesofthedataarecaptured.Asecondinterestingcheckisbasedonawell-knownpropertyoftheGammadistribution.IfRandGaremeasurementsononespot,andthereisnorealdifferentialexpression,thenbothmeasurementsarisefromacommonGammadistributionwithshapeaandscaleµ.TherenormalizeddifferenceBDf1C.R G/=.RCG/g=2hasasymmetricBetadistributionwithshapeparametersaanda;i.e.,itsdensityisproportionaltoba 1.1 b/a 1forb2.0;1/.Notably,thisdensitydoesnotdependonthescaleparameter,soitisthecommondistributionforallunchangedspotsonthemicroarray.Figure6comparesthehistogramofBvaluestotheŽttedBetadensity.Forthetreatmentmicroarrays,wefocusedonlyonBvaluesfromspotsdeemedtobeprobablyunchangedbytheBayesfactorcomputation.Thisisatruemodelcheck;theŽttingproceduredoesnotattempttocapturevariationsontheshownscale.Indeed,theŽtispoorinsomerespects,butagainprimaryfeaturesofvariationarecaptured. ONDIFFERENTIALVARIABILITYOFEXPRESSIONRATIOS49FIG.5.Diagnosticcheck:Histogramsareofintensities(bothcolorspooled)onthenaturallogscale.DashedcurveisŽtfromGamma–Gammamodel;solidcurveisŽtfromGamma–Gamma–Bernoullimodel.Weareusingjustfourparameterstodescribemarginalvariationanddependencebetweentheredandgreenchannels,andimprovementsmaycomebyincreasingthenumberofparameters.WedidletthescaleparameterºbecolorspeciŽc,andthisimprovedtheŽtssomewhat,especiallyontheIPTGmicroarrays,astheestimatedvalueofagoesup.WealsolettheshapeparameterabecolorspeciŽc,butthisdidnotsigniŽcantlyimproveŽts.TheuseofadditionalscaleparametersleadtoproblemswiththeBayesfactorcontoursinFig.4,andsocurrentlyweareinvestigatingmodelelaborationsandidentiŽabilityquestions.Weconjecturethatimprovementsmayariseifsomepositivecorrelationisaddedtotheexpressionmodel.WecandothisbyretainingtheGammasamplingmodelbutaddingcorrelationbetweenµrandµgintheexpressionmodel.Forexample,wemightsayûGamma(a0;º)andtwomultipliersmr;mg»iidGamma(·;·)forapositivedependenceparameter·.ThenwriteµrDmrÃandµgDmgÃ.Themultipliersarecenteredonunity,andwillbeclosetounityif·islarge.Thismodelisanintermediatebetweenthenullandalternativemodelssofarstudied,anditrequiresmoresophisticatedmachinerytoŽt,butitmaybeeffectiveatidentifyingsubtleexpressionchanges.OurmethodsuseGammadistributionsbutotherparametricformscanbeconsidered.VariousparametricmodelsentailconstantcoefŽcientofvariationonthepositiveline.Thelog-normalmodelhasbeenusedforthispurpose,andacomparisonofthedifferentformulationswillbeuseful(Wiens,1999).Onthe 50NEWTONETAL.FIG.6.Diagnosticcheck:HistogramsareofrenormalizeddifferencesBD.1=2/[.R G/=.RCG/C1]forspotsdeemedtohavenotchanged.CurvesarepredictedBetadensities.basisofpreliminarycomputations,wecansaythatthesamequalitativeanalysisfeaturescarryovertothelog-normal,inparticulartheshapeofcontoursintheBayesfactorplotsissimilar.Wehaveusedoneparticularmethodofnormalizationandbackgroundnoiseadjustment.Probablysomeadvantagecanbegainedbycombiningthesetaskswiththepresentmodelingmethodtobetteraccountforthesesourcesofvariation.Forinstance,onnormalization,wecouldsaythatthescaleparametersinonesampleareaglobalconstantmultipleofthoseintheothersample,andthentreatthisconstantasanothermodelparametertobeestimatedfromunnormalizedintensities.ThisissimilartothecalibrationproceduredescribedinChenetal.,(1997),butinthecontextofahierarchicalmodel.Ourmethodologydealswithasinglemicroarrayatatime,anddoesnotattempttocombinedata,thoughthemodelingframeworkcertainlyallowsthiselaboration.Oneapproachwouldbetodecomposesaylogµ,theexpressionscaleparameter,intocontributionsfromdifferentgenes,differentRNApreparations,anddifferentgrowthconditions.Combininginformationfrommultiplemicroarraysmaybeaneffectivewaytoobtainaccurateestimatesofthecontributionofdifferentsourcesofvariation.Kerretal.(2000)providedetailsofarelatedmethodwhichexpressestheexpectedvalueoflog-transformedintensitymeasurementsintermsofcontributionsfromsuchfactors. ONDIFFERENTIALVARIABILITYOFEXPRESSIONRATIOS51HierarchicalstatisticalmodelingallowsforefŽcientdataprocessinginlarge-scaleexpressionstudies.ThisprovidesmorepreciseestimatesofdifferentialgeneexpressionandmoreaccurateassessmentsofsigniŽcantchangesthanstandardmethodsbyaccountingfordifferentialvariabilityindata.Calculationsaccountforthemeasurementerrorprocessandfornaturaluctuationsinabsoluteexpressionlevels.Preprocessingimagedataviathesemethodsmayreduceerrorsindownstreamtasks,suchasclusteranalysisorclassiŽcation.TheauthorsthankBobMauforhiscriticalreviewofanearlierdraftandarefereeforhelpfulcomments.TheyalsoacknowledgeinterestingdiscussionswithGaryChurchillwho,independently,hasconsideredtheuseofshrinkageestimationfordifferentialexpression,andtheythankAlexLoguinovwhoidentiŽedabuginoneoftheplottingprograms.CodeanddatausedinthisarticlearefreelyavailableattheŽrstauthor’swebsite www.stat.wisc.edu/ » newton/. ThisresearchwasfundedinpartbytheNationalCancerInstitute,grantR29CA64364-01toM.A.N.andtraininggrantTA-CA09565forC.M.K.NIHgrantR01GM35682supportedC.S.R.andF.R.B.REFERENCESBassettJr.,D.E.,Eisen,M.B.,andBoguski,M.S.1999.Geneexpressioninformatics—it’sallinyourmine.NatureGenet.Suppl.21,51–55.Blattner,F.R.,Plunkett,G.III.,Bloch,C.A.,Perna,N.T.,Buland,V.,Riley,M.,Collado-Vides,J.,Glasner,J.D.,Rode,C.K.,Mayhew,G.F.,Gregor,J.,Davis,N.W.,Kirkpatrick,H.A.,Goeden,M.A.,Rose,D.J.,Mau,R.,andShao,Y.1997.ThecompletegenomesequenceofEscherichiacoliK-12.Science277,1453–1474.Brown,P.O.,andBotstein,D.1999.ExploringthenewworldofthegenomewithDNAmicroarrays.NatureGenet.Suppl.21,33–37.Carlin,B.P.,andLouis,T.A.1996.BayesandEmpiricalBayesMethodsforDataAnalysis.ChapmanandHall,NewYork.Chen,Y.,Dougherty,E.R.,andBittner,M.L.1997.Ratio-baseddecisionsandthequantitativeanalysisofcDNAmicroarrayimages.J.Biomed.Optics2(4),364–374.Cheung,V.G.,Morley,M.,Aguilar,F.,Massimi,A.,Kucherlapti,R.,andChilds,G.1999.MakingandreadingNatureGenet.Suppl.21,15–19.Chuang,S.-E.,Daniels,D.L.,Blattner,F.R.1993.GlobalregulationofgeneexpressioninEscherichiacoli.J.Bacteriol.175(7),2026–2036.Dempster,A.P.,Laird,N.M.,andRubin,D.B.1977.MaximumlikelihoodfromincompletedataviatheEMalgorithm(withdiscussion).J.RoyalStatisticalSociety,SeriesB39,1–38.Dennis,B.,andPatil,G.P.1984.Thegammadistributionandweightedmultimodalgammadistributionsasmodelsofpopulationabundance.MathematicalBiosciences,68,187–212.Duggan,D.J.,Bittner,M.,Chen,Y.,Meltzer,P.,andTrent,J.M.1999.ExpressionproŽlingusingcDNAmicroarrays.NatureGeneticsSupplement21,10–14.Efron,B.,andMorris,C.1973.Combiningpossiblyrelatedestimationproblems(withdiscussion).JournaloftheRoyalStatisticalSociety,SeriesB35,379–421.Efron,B.,andMorris,C.1977.Stein’sparadoxinstatistics.Sci.Am.236,119–127.Eisen,M.B.,Spellman,P.T.,Brown,P.O.,andBotstein,D.1998.Clusteranalysisanddisplayofgenome-wideexpres-sionpatterns.Proc.Natl.Acad.Sci.USA95,14863–14868.Fisher,R.A.,Corbet,A.S.,andWilliams,C.B.1943.Therelationbetweenthenumberofspeciesandthenumberofindividualsinarandomsampleofananimalpopulation.J.AnimalEcology12,42–58.Johnson,N.L.,Kotz,S.,andBalakrishnan,N.1994.ContinuousUnivariateDistributions,vol.1,2nded.Wiley,NewYork.Kendall,M.G.,andStuart,A.1969.TheAdvancedTheoryofStatistics,vol.1,3rded.Hafner,NewYork.Kerr,M.K.,Martin,M.,andChurchill,G.A.2000.Analysisofvarianceforgeneexpressionmicroarraydata.Manuscript. http://www.jax.org/research/churchill/. Lander,E.S.1999.Arrayofhope.NatureGenet.Suppl.21,3–4.Lipshutz,R.J.,Fodor,S.P.A.,Gingeras,T.R.,andLockhart,D.J.1999.Highdensitysyntheticoligonucleotidearrays.NatureGenet.Suppl.21,20–24. 52NEWTONETAL.Richmond,C.S.,Glasner,J.D.,Mau,R.,Jin,H.,andBlattner,F.R.1999.Genome-wideexpressionproŽlinginEs-cherichiacoliK-12.Nucl.AcidsRes.27(19),3821–3835.StatisticalSciences,1993.S-PLUSGuidetoStatisticalandMathematicalAnalysis,Version3.2.StatSci,adivisionofMathSoft,Inc.,Seattle.Wiens,B.L.1999.Whenlog-normalandgammamodelsgivedifferentresults:Acasestudy.Am.Statistician53,89–93.Addresscorrespondenceto:MichaelA.NewtonDepartmentofStatistics1210W.DaytonStreetUniversityofWisconsinMadison,WI53706-1685E-mail: