multiplepriormeanswiththelocationsofthesemeansunknownByplacingaDirichletprocessDPpriorFerguson1974ontheunknownmeanandscaleparametersweinduceclusteringintoasmallnumberofgroupswiththedegreeofsh ID: 168500
Download Pdf The PPT/PDF document "1.IntroductionResearchersinmanydisciplin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1.IntroductionResearchersinmanydisciplinesroutinelycollectdataforhigh-dimensional,andpo-tentiallyhighly-correlated,predictors.Instudiesrelatinggeneticpolymorphismstophenotypicoutcomes,advancesingenotypingtechnologieshavemadelargedimen-sionaldatacommonplace,withthenumberofpredictorstypicallyexceedingthenum-berofobservations.Inexpensivemicroarraychipswiththeabilitytogenotypeevenlargernumbersofsinglenucleotidepolymorphisms(SNPs)willmakethisproblemmuchmoresevereinthenearfuture(Thomasetal.,2005).Often,theeectsofthepredictorswillnotbeestimablewithouttheincorporationofpriorinformation.Thefocusofthisarticleisontheuseofshrinkagetoaddressthisproblem.Ithaslongbeenknownthatshrinkage,orregularization,canimproveestimationperformance,reducingmeansquareerrorwhileintroducingbias(HoerlandKennard,1970b).Thisistrueeveninlowdimensions,thoughtheimpactisparticularlyap-parentinhigherdimensionalmodels.ShrinkageestimatorstypicallyhaveaBayesianinterpretation,withdierentestimatorscorrespondingtodierentpriors.Ridgere-gression(HoerlandKennard,1970a,b)isobtainedusingindependentnormalpriorscenteredatzero,withthedegreeofshrinkagecontrolledbythepriorvariance.Re-placingthenormalpriorwithadoubleexponential(Laplace)distributioncenteredatzeroresultsinthelassoprocedureofTibshirani(1996).Thedoubleexponentialpriorconcentratesmoreofitsmassnearzero,butalsohasheaviertails.Thisfa-vorsasparsestructure,withmanyofthecoecientshavingvaluesclosetozeroandfewwithlargevalues.Inaddition,maximumaposteriori(MAP)estimatesundertheLaplacepriorcantakezerovalues,allowingvariableselection,thoughposteriormeansormediansdonotexhibitthisproperty(ParkandCasella,2005).Thereisarichrecentliteratureonshrinkagemethodsforhighdimensionalpredic-tors.GrinandBrown(2006)extendthelassobyexpressingthedoubleexponential2 multiplepriormeans,withthelocationsofthesemeansunknown.ByplacingaDirichletprocess(DP)prior(Ferguson,1974)ontheunknownmeanandscalepa-rameters,weinduceclusteringintoasmallnumberofgroupswiththedegreeofshrinkagevaryingacrossgroups.Inordertodevelopanecientapproachforpos-teriorcomputation,whichcanbefeasiblyimplementedevenforverylargenumbersofpredictors,werelyonaretrospectiveMCMCalgorithmrelatedtothatproposedrecentlybyPapaspiliopoulosandRoberts(2007).Insection2weintroducetheproposedhierarchicalstructure.Insection3weoutlinetheMCMCalgorithm.Section4presentssimulateddataresults,whileSection5implementsthemodelinacommonly-useddiabetesdataset.Section6appliestheapproachtoadatasetwithpn:ananalysisofSNPdataonearlyversuslateonsetmultiplemyeloma.Section7containsadiscussion.2.Semi-ParametricMultipleShrinkagePriors2.1ModelandPriorFormulationSupposewecollectdata(yi;xi),i=1:::n,wherexiisap1vectorofpredictors(withppossiblymuchlargerthann)andyiisabinaryoutcome.Astandardapproachistoestimatethecoecients=(1;:::;p)0inaregressionmodel(wefocusonthemostcommonmodelinepidemiologicstudies,alogisticregressionforabinaryoutcome;however,extensionstoothergeneralizedlinearmodelsarestraightforward):LogitfPr(yi=1jxi)g= +x0i:(1)Forlargep,maximumlikelihoodestimateswilltendtohavehighvarianceandmaynotbeunique.However,toregularizeexpression1,wecouldincorporateapenaltybyusingalassoprior()=Qpj=1DE(jj0;).Here,DEdenotesadoubleexpo-nentialdistributionwithlocationparameter0andscaleparameter.ThismodeliseasilyimplementedinaBayesiansettingbyrecognizingthatthedoubleexponential4 withallitsmassat0.Thus,withprobabilityacoecientisshrunktowardzeroasinstandardlassoestimation.Withprobability1thecoecientisshrunktowardnon-zeromean,j.Theamountofshrinkageacoecientexhibitstowarditspriormeanisdeterminedbyj,withlargervaluesresultingingreatershrinkage.TheGammadistributionisparameterizedasGamma(ja;b)=1=[ba(a)]a1exp[=b]andhasmeanofab.Wecanspecifya0andb0togivesupporttolargervaluesofjinordertoallowstrongshrinkageto0anda1andb1togivesupporttosmallervaluesofjtoallowlessshrinkagetowardnon-zeropriorlocations.BecausetheDPpriorimpliesthatDisalmostsurelydiscrete,thepriorwillau-tomaticallygroupthepcoecient-specichyperparametervalues,fj;jgpj=1,intopclusters,fj;jgpj=1,withpp.Oneoftheseclusterswillmostlikelycorrespondtoj=0,whiletheotherclusterswillnothavezeromeansandwillvaryintheprecisionparameterandhencethedegreeofshrinkage.Theprioronthenumberofclustersiscontrolledby,withsmallervaluesfavoringfewerclusters.However,thedataarestronglyinformativeaboutthenumberofclustersandthecluster-specichyperparameters,soweobtainaprocedurethatadaptivelyshrinkscoecientsto-wardnon-zerolocationstoanextentsuggestedbytheavailabledata.TheclusteringpropertyoftheDPpriorinexpression3canseenmoreclearlywhenexpressedinequivalentstickbreaking(Sethuraman,1994)form:jN(jjkj;j)jExp(jj2=kj)kj1Xt=1t(kjjt)(t;t)8]TJ ; -2;.51; Td; [00;]TJ ; -2;.51; Td; [00;:(tj0)Gamma(tja0;b0)ift=1N(tjc;d)Gamma(tja1;b1)ift]TJ ; -2;.51; Td; [00;1(4)6 situations,includingthosewithlittlepriorinformation.Clusteringcoecientsforpredictorshavingdierentscalesisunappealing,sowesuggeststandardizingpre-dictors.However,inmostbiomedicalapplicationswithverymanypredictorsthesepredictorswilltendtobecollectedonthesamescale.Forexample,indicatorsofgenotypesatdierentlociintrinsicallyhavethesamescale.Tospecifya0andb0,thehyperpriorsforDoubleExponentialpriorwithmeanxedatzero,weassumethatanyjfallingwithinsomeofzerowillbeviewedashavingnomeaningfulbiologiceect.Thatis,wetreatthedoubleexponentialpriordistributionwithmeanxedatzeroasanullclusterandcoecientsassignedtothatclustershouldhavevaluessucientlyclosetozerotobetreatedasanullresult.Withthisinmind,wechoosea0andb0suchthatRDE(jj0;j)=zwherezisthepriorprobabilitythatacoecientchosenrandomlyfromthenullbinhasanulleect.Forinstance,ifz=0:95and=0:1,thenvaluesofa0=b0=30guaranteethat95%ofcoecientsdrawnfromthisbinwillhaveaneectthatisviewedasindistinguishablefromthenull.Valuesofa1andb1needtobespeciedforthepriorsonthescaleparameterforthenon-nullbins.Werecommendchoosingsmallervaluesforthesehyperparametersthatarelargeenoughtoencourageshrinkagebutnotsolargeastooverwhelmthedataandarbitrarilyforceahugenumberofbinstobegenerated.WespecifydefaultpriorsfortheGammadistributionasa1=b1=6:5,sotheDEpriorhaspriorcredibleintervalsofunitwidth.Thisisarobustchoiceallowingthedatatoinformabouttheamountofshrinkage.Weset=1,acommondefaultinDPmodels.Wesuggestsettingtheparameterscandd,thepriormeanandvarianceforthelocationparameterportionofthebasedistributionfortheDP,sothatthephavesupportoverawiderangeofvalues.Settingc=0,wechoosed=4suchthatweassign95%probabilitytoaverywiderangeofreasonableprioreects.Insome8 gorithmbyassumingtheoutcomeyi=1occurswhenalatentvariable,gi0.Weassumegi=X0i+i=i,whereiN(0;2)andigamma(=2;2=).Thisscalemixtureofnormalswith2=2(2)=3and=7:3isanearexactrepresentationofthelogisticdistribution.Thedataaugmentationapproachallowsthisalgorithmtobeeasilymodiedforotherregressionmodels,suchasprobitorlinear.WeproposeaMetropoliswithinGibbssamplingalgorithmthatproceedsthroughthefollowingsteps:1a.Augmentthedatawithg=(g1:::gn)0sampledfromf(gijyi;;i)f(gijyi=1;;i)=N+(gijx0i;2=i)f(gijyi=0;;i)=N(gijx0i;2=i)(5)whereN+isanormaldistributiontruncatedtotheleftof0andNistruncatedtotherightof0.1b.Updateibysamplingfromf(ij;gi).f(ij;gi)=Gammaij+1 2;2 +(gixi)2=2(6)2.Usethecurrentestimatesof=(k1:::kP)0toupdatetheregressioncoef-cients.Assumetheintercept, haspriordistributionN( j 0;0)with 0and0xedhyperpriors.Let0=( 0;0)0,0=( ;0)0andXbethen(p+1)designmatrixwithrstcolumnequalto1.Thenweupdatebysamplingfromf(0j;;0;g)=N(0jE;V)(7)whereV=(X01X+1)1andE=V(X01g+10).Thematrixisa(p+1)(p+1)diagonalmatrixwithjthelementj1,andisannndiagonalmatrixwithithelement2=i.10 binisgivenby:f(1j;K)/Gamma(1ja0;b0)Yj:kj=1Exp(jj2 1)/Gamma1jm1+a0;1 Pj:kj=1j=2+1=b0(9)Forsubsequentbins,t1,theconditionalposteriorsaregivenby:f(tjt;;K)/N(tjc;d)Yj:kj=tN(jjt;j)/N(tjdEt;cVt)(10)f(tj;K)/Gammatjmt+a1;1 Pj:kj=tj=2+1=b1(11)wherecVt=(1=d+Pj:kj=t1=j)1anddEt=cVt(c=d+Pj:kj=tj=j):Potentially,nocoecientswillbelongtooneofthebinsandinthiscase,sampling(t;t)amountstosamplingfromthepriordistribution,N(tjc;d)Gamma(tja1;b1).4b.SampleVtfromf(VtjK;)andgeneratetusingthestickbreakingformulaabove.f(VtjK;)Beta(mt+1;ptXl=1ml+)t=t1Yh=1(1h)Vt(12)4c.Updatethevectorofcoecientcongurations,K,usingaMetropolisstep.Togenerateaproposalconguration,wecalculatetheposteriorprobabilityofthejthpredictorfallinginthelthbinas:qj(l)/8]TJ ; -2;.51; Td; [00;]TJ ; -2;.51; Td; [00;:lN(jjl;j)Exp(jj2 l)forlmax(K)lMj(K)forlmax(K)(13)FollowingtherecommendationofPapaspiliopoulosandRoberts(2007)wespecifyMj(p)=max(N(jjl;j)Exp(jj2 l);lmax(K))andthenormalizingconstantforqjisnj(K)=Pmax(K)l=1qj(l)+Mj(K)(1Pmax(K)l=1l).12 usingmeansquarederror(MSE),bias,falsepositiveandtruepositiverates.Weimplementedthemultipleshrinkagemodelusingthedefaultspecicationof=1,a0=b0=30,a1=b1=6:5,c=0,andd=4forthesimulations.EachMCMCalgorithmwasrunfor100,000iterationswiththerst10,000discardedasburn-in.Outputofthealgorithmswasexaminedfortherstfewsimulationstodetermineconvergence.ThealgorithmsranquicklyinMatlabv7.5onaDelldesktopwitha2.99GhZXeonchipand3GbRAM,takingapproximately5minuteswhenp=20and45minuteswhenp=200.ForcomparisonweimplementedastandardBayesianlassothroughaGibbssam-plersimilartothatinParkandCasella(2005).Inparticular,weassumejN(0;2j)withExp(2=)andGamma(a;b).HyperparametersaandbintheBayesianlassoaresetequaltoa1andb1inexpression4.ResultsfromtherstsetofsimulationsindicatethatthemultipleshrinkageprioroersimprovementoverthestandardBayesianlasso.TheMSEestimatedoverallsimulateddatasetsaresmallerforthemultipleshrinkageprior.ThereductioninMSEislargelyaresultofdecreasedbias.Themultipleshrinkagepriormodeltendstoidentifythecorrectcoecientclustering.Forinstance,priorlocationandscaleparametersareoftengroupedintooneclusterfortherst10coecientsandasecondclusterforthelast10coecients.Eachofthecoecientsisshrunktowardsaclusterspecicpriormean.Therst10coecientsareshrunktowardapriormeanthatiscloseto2,resultingindramaticallydecreasedbiasfor1:::10inthemultipleshrinkagemodel(MSE=0.03)comparedtothestandardlassomodel(MSE=1.08).Additionally,thosecoecients(11:::20)whosepriormeansareclusteredintothenullbinandassignedapriormeanofzeroareshrunkmorestronglytowardthatmeaninthemultipleshrinkagepriormodel(MSE=0.01)thanintheBayesianlassomodel(MSE=0.04),asaresultofourpriorspecicationthata0=b0a1=b1.14 interactionsandstandardizeallpredictorstohavemean0andvariance1.Weusethepriordistributiononpasinexpression4andusethedefaultspec-icationthatwerecommendinsection2.Thesemi-parametricmultipleshrinkagemodelwasimplementedasinsection3for20,000iterations.Thechainconvergedrapidlyandshowedlittleautocorrelation.Weexcludedtheinitial5,000iterationsasaburnin.Forcomparison,weimplementedaBayesianversionofthelassobyassum-ingjN(0;2j)withjExp(2=j)andjGamma(a;b).AGibbssamplingalgorithmforthismodelcanbefoundinParkandCasella(2005).Weranthismodeltwice,withhyperpriorsa=a0;b=b0andoncewitha=a1;b=b1,wherea1andb1arethehyperpriortermsfromthemultipleshrinkagepriormodel.Posteriormedianand90%credibleintervalsfromthethreemodelsareshowninFigure2.TheBayesianlassomodelwithaveryconcentratedpriordistribution(a=a0;b=b0)shrinksallcoecientsstronglytowardzerowhilethemodelwithalessconcentratedprior(a=a1;b=b1)providedlessshrinkage.Themultipleshrinkagepriormodelretainedtheabilitytoshrinksomeestimatesstronglytowardzerowhileallowingotherestimatestobeshrunktowardnon-zerolocations.Forinstance,priorlocationandscaleparametersfortheeectofthe12thpredictorisclusteredwiththepriorforthe16thpredictor51%ofthetime.Thesecoecientswereshrunktowardtheirgroupspecichyperpriormeanratherthan0,asinthestandardBayesianlassoresultinginlargereectsfortheseparameters.ThePimaIndiandatahasbeenwidelyusedforpurposesofprediction.Wegen-eratedposteriorpredictivedistributionsfortheoutcome,~yi,ofthenewobservationsinthetestingdatasetbyintegratingovermodelparametersusingtheoutputoftheMCMCalgorithm.WhenPr(~yi=1)0:5wepredicttheoutcometobe1,and0otherwise.ThemultipleshrinkagepriorapproachwascomparedtothetwoBayesianlassosandasupportvectormachine(SVM),amaximummarginclassierdescribedin16 for600,000iterations,keepingevery10thsampleanddiscardingtheinitial10,000iterationsasaburn-in.Ingeneticapplications,interestoftenfocuseslessoneectestimationthanonsignicancetesting.Aspreviously,weassumethatjjjcanbetreatedasnullinsubstantiveterms.Weestimatej=PgI(jgjj-347;)=Gastheposteriorprobabilityofthejthgenotypehavinganeect.Coecientswhosepriorlocationandscalefallintothenullbinfrequentlywilltypicallyhaveaverysmallposteriorprobabilityofhavinganeect,asshowninFigure3.ThethresholdimpliedbychoosingtoguaranteesaFDR50%is0.46and72genotypeeectswithpos-teriorprobabilityabovethatthresholdare aggedassignicant.IfaFDR30%isspecied,athresholdof0.65ischosenand2genotypesare aggedassignicant.Thetwogenotypes(896198and912651) aggedunderthiscriterionarebothlocatedontherstchromosome.Previousresearchhaslinkedchromosomalabnormalitiesinthe1parmwithpoorerprognosisfollowingdiagnosisofmultiplemyeloma(Wuetal.,2007)andbothoftheseSNPsfallinthe1parm.BothSNPsfallinintergenicregionsanddonotfallonknowngenes;however,theymaybeinhighlinkagedisequilibriumwithaSNPinagenethatisrelatedtomultiplemyelomas.SNP896198isfoundnearanumberofamylasecodingandregulatinggenes(AMY1A,AMY1BandAMY1C,amongothers)andsomeresearchhassuggestedhyperamylasaemiaamongmulitplemyelomapatients(Hataetal.,2006).WeanalyzethesesamedatausingtwoversionsofthestandardBayesianlasso.Asbefore,therstBayesianlassoallowsforgreatershrinkage(a=a0;b=b0)whilethesecondallowslessshrinkage(a=a1;b=b1).MCMCsamplingalgorithmsforeachmodelarerunfor20,000iterationswiththeinitial5,000discardedasaburnin.TheBayesianlassowitha=a0;b=b0providedsuchstrongshrinkagetowardzerothatitdidnotresultinanypredictorsbeing aggedaswarrantingfurtherinvestigation,underFDR30%orFDR50%.TheBayesianlassowitha=a1;b=b1 agged18 ReferencesJHAlbertandSChib.Bayesiananalysisofbinaryandpolychotomousresponsedata.JournaloftheAmericanStatisticalAssociation,88:669{679,1993.CMBishop.PatternRecognitionandMachineLearning,chapter7:SparseKernelMachines,pages325{357.Springer,2006.RSChhikaraandLFolks.TheInverseGaussianDistribution.MarcelDekker,Inc.,1989.DBDahlandMANewton.Multiplehypothesistestingbyclusteringtreatmenteects.JournaloftheAmericanStatisticalAssociation,102:517{526,2007.KADo,PMuller,andFTang.ABayesianmixturemodelfordierentialgeneexpression.JournaloftheRoyalStatisticalSociety,SeriesC:AppliedStatistics,54(3):627{644,2005.DBDunson,AHHerring,andSMMulherin-Engel.Bayesianselectionandclusteringofpolymorphismsinfunctionallyrelatedgenes.JASA,toappear,2007.MDEscobarandMWest.Bayesiandensityestimationandinferenceusingmixtures.JournaloftheAmericanStatisticalAssociation,90:577{588,1995.TSFerguson.Priordistributionsonspacesofprobabilitymeasures.TheAnnalsofStatistics,2:615{29,1974.AGelman,AJakulin,MGPittau,andYSSu.Adefaultpriordistributionforlogisticandotherregressionmodels.Technicalreport,ColumbiaUniversity,2006.EIGeorge.Minimaxmultipleshrinkageestimation.TheAnnalsofStatistics,14:188{205,1986a.20 ILoennstedtandTBritton.Hierarchicalbayesmodelsforcdnamicroarraygeneexpression.Biostatistics,6(2):279{291,Apr2005.RFMacLehose,DBDunson,AHHerring,andJAHoppin.Bayesianmethodsforhighlycorrelatedexposuredata.Epidemiology,18(2):199{207,2007.PMuller,GParmigiani,CRobert,andJRousseau.Optimalsamplesizeformultipletesting:thecaseofgeneexpressionmicroarrays.JournaloftheAmericanStatisticalAssociation,99:990{1001,2004.SMO'BrienandDBDunson.Bayesianmultivariatelogisticregression.Biometrics,60(3):739{46,2004.OPapaspiliopoulosandG.O.Roberts.Retrospectivemarkovchainmontecarlometh-odsfordirichletprocesshierarchicalmodels.Biometrika-inpress,2007.TParkandGCasella.Thebayesianlasso.Technicalreport,UniversityofFlorida,2005.LAGRies,DMelbert,MKraphco,AMariotto,BAMiller,EJFeuer,LClegg,MJHorner,NHowlader,MPEisner,MReichman,andBKEdwards,editors.SEERCancerStatisticsReview,1975-2004.NationalCancerInstitute,Bethesda,MD,2007.J.Sethuraman.Aconstructivedenitionofthedirichletprocessprior.StatisticaSinica,2:639{650,1994.DCThomas,RWHaile,andDDuggan.Recentdevelopmentsingenomwideassoci-ationscans:aworkshopsummaryandreview.AmJHumGenet,77(3):337{45,2005.22 Figure1:Multipleshrinkagepriordistributionwith=1,c=0andd=10.24 Figure3:Posteriorprobabilityofgenotypeeectsonearlyonsetmultiplemyeloma.26