/
1.IntroductionResearchersinmanydisciplinesroutinelycollectdataforhigh- 1.IntroductionResearchersinmanydisciplinesroutinelycollectdataforhigh-

1.IntroductionResearchersinmanydisciplinesroutinelycollectdataforhigh- - PDF document

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
364 views
Uploaded On 2015-10-22

1.IntroductionResearchersinmanydisciplinesroutinelycollectdataforhigh- - PPT Presentation

multiplepriormeanswiththelocationsofthesemeansunknownByplacingaDirichletprocessDPpriorFerguson1974ontheunknownmeanandscaleparametersweinduceclusteringintoasmallnumberofgroupswiththedegreeofsh ID: 168500

multiplepriormeans withthelocationsofthesemeansunknown.ByplacingaDirichletprocess(DP)prior(Ferguson 1974)ontheunknownmeanandscalepa-rameters weinduceclusteringintoasmallnumberofgroupswiththedegreeofsh

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "1.IntroductionResearchersinmanydisciplin..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1.IntroductionResearchersinmanydisciplinesroutinelycollectdataforhigh-dimensional,andpo-tentiallyhighly-correlated,predictors.Instudiesrelatinggeneticpolymorphismstophenotypicoutcomes,advancesingenotypingtechnologieshavemadelargedimen-sionaldatacommonplace,withthenumberofpredictorstypicallyexceedingthenum-berofobservations.Inexpensivemicroarraychipswiththeabilitytogenotypeevenlargernumbersofsinglenucleotidepolymorphisms(SNPs)willmakethisproblemmuchmoresevereinthenearfuture(Thomasetal.,2005).Often,thee ectsofthepredictorswillnotbeestimablewithouttheincorporationofpriorinformation.Thefocusofthisarticleisontheuseofshrinkagetoaddressthisproblem.Ithaslongbeenknownthatshrinkage,orregularization,canimproveestimationperformance,reducingmeansquareerrorwhileintroducingbias(HoerlandKennard,1970b).Thisistrueeveninlowdimensions,thoughtheimpactisparticularlyap-parentinhigherdimensionalmodels.ShrinkageestimatorstypicallyhaveaBayesianinterpretation,withdi erentestimatorscorrespondingtodi erentpriors.Ridgere-gression(HoerlandKennard,1970a,b)isobtainedusingindependentnormalpriorscenteredatzero,withthedegreeofshrinkagecontrolledbythepriorvariance.Re-placingthenormalpriorwithadoubleexponential(Laplace)distributioncenteredatzeroresultsinthelassoprocedureofTibshirani(1996).Thedoubleexponentialpriorconcentratesmoreofitsmassnearzero,butalsohasheaviertails.Thisfa-vorsasparsestructure,withmanyofthecoecientshavingvaluesclosetozeroandfewwithlargevalues.Inaddition,maximumaposteriori(MAP)estimatesundertheLaplacepriorcantakezerovalues,allowingvariableselection,thoughposteriormeansormediansdonotexhibitthisproperty(ParkandCasella,2005).Thereisarichrecentliteratureonshrinkagemethodsforhighdimensionalpredic-tors.GrinandBrown(2006)extendthelassobyexpressingthedoubleexponential2 multiplepriormeans,withthelocationsofthesemeansunknown.ByplacingaDirichletprocess(DP)prior(Ferguson,1974)ontheunknownmeanandscalepa-rameters,weinduceclusteringintoasmallnumberofgroupswiththedegreeofshrinkagevaryingacrossgroups.Inordertodevelopanecientapproachforpos-teriorcomputation,whichcanbefeasiblyimplementedevenforverylargenumbersofpredictors,werelyonaretrospectiveMCMCalgorithmrelatedtothatproposedrecentlybyPapaspiliopoulosandRoberts(2007).Insection2weintroducetheproposedhierarchicalstructure.Insection3weoutlinetheMCMCalgorithm.Section4presentssimulateddataresults,whileSection5implementsthemodelinacommonly-useddiabetesdataset.Section6appliestheapproachtoadatasetwithp��n:ananalysisofSNPdataonearlyversuslateonsetmultiplemyeloma.Section7containsadiscussion.2.Semi-ParametricMultipleShrinkagePriors2.1ModelandPriorFormulationSupposewecollectdata(yi;xi),i=1:::n,wherexiisap1vectorofpredictors(withppossiblymuchlargerthann)andyiisabinaryoutcome.Astandardapproachistoestimatethecoecients =( 1;:::; p)0inaregressionmodel(wefocusonthemostcommonmodelinepidemiologicstudies,alogisticregressionforabinaryoutcome;however,extensionstoothergeneralizedlinearmodelsarestraightforward):LogitfPr(yi=1jxi)g= +x0i :(1)Forlargep,maximumlikelihoodestimateswilltendtohavehighvarianceandmaynotbeunique.However,toregularizeexpression1,wecouldincorporateapenaltybyusingalassoprior( )=Qpj=1DE( jj0;).Here,DEdenotesadoubleexpo-nentialdistributionwithlocationparameter0andscaleparameter.ThismodeliseasilyimplementedinaBayesiansettingbyrecognizingthatthedoubleexponential4 withallitsmassat0.Thus,withprobabilityacoecientisshrunktowardzeroasinstandardlassoestimation.Withprobability1�thecoecientisshrunktowardnon-zeromean,j.Theamountofshrinkageacoecientexhibitstowarditspriormeanisdeterminedbyj,withlargervaluesresultingingreatershrinkage.TheGammadistributionisparameterizedasGamma(ja;b)=1=[ba�(a)]a�1exp[�=b]andhasmeanofab.Wecanspecifya0andb0togivesupporttolargervaluesofjinordertoallowstrongshrinkageto0anda1andb1togivesupporttosmallervaluesofjtoallowlessshrinkagetowardnon-zeropriorlocations.BecausetheDPpriorimpliesthatDisalmostsurelydiscrete,thepriorwillau-tomaticallygroupthepcoecient-speci chyperparametervalues,fj;jgpj=1,intopclusters,fj;jgpj=1,withpp.Oneoftheseclusterswillmostlikelycorrespondtoj=0,whiletheotherclusterswillnothavezeromeansandwillvaryintheprecisionparameterandhencethedegreeofshrinkage.Theprioronthenumberofclustersiscontrolledby ,withsmallervaluesfavoringfewerclusters.However,thedataarestronglyinformativeaboutthenumberofclustersandthecluster-speci chyperparameters,soweobtainaprocedurethatadaptivelyshrinkscoecientsto-wardnon-zerolocationstoanextentsuggestedbytheavailabledata.TheclusteringpropertyoftheDPpriorinexpression3canseenmoreclearlywhenexpressedinequivalentstickbreaking(Sethuraman,1994)form: jN( jjkj;j)jExp(jj2=kj)kj1Xt=1t(kjjt)(t;t)8��&#x]TJ ;� -2;.51; Td;&#x [00;&#x]TJ ;� -2;.51; Td;&#x [00;:(tj0)Gamma(tja0;b0)ift=1N(tjc;d)Gamma(tja1;b1)ift&#x]TJ ;� -2;.51; Td;&#x [00;1(4)6 situations,includingthosewithlittlepriorinformation.Clusteringcoecientsforpredictorshavingdi erentscalesisunappealing,sowesuggeststandardizingpre-dictors.However,inmostbiomedicalapplicationswithverymanypredictorsthesepredictorswilltendtobecollectedonthesamescale.Forexample,indicatorsofgenotypesatdi erentlociintrinsicallyhavethesamescale.Tospecifya0andb0,thehyperpriorsforDoubleExponentialpriorwithmean xedatzero,weassumethatany jfallingwithinsomeofzerowillbeviewedashavingnomeaningfulbiologice ect.Thatis,wetreatthedoubleexponentialpriordistributionwithmean xedatzeroasanullclusterandcoecientsassignedtothatclustershouldhavevaluessucientlyclosetozerotobetreatedasanullresult.Withthisinmind,wechoosea0andb0suchthatR�DE( jj0;j)=zwherezisthepriorprobabilitythatacoecientchosenrandomlyfromthenullbinhasanulle ect.Forinstance,ifz=0:95and=0:1,thenvaluesofa0=b0=30guaranteethat95%ofcoecientsdrawnfromthisbinwillhaveane ectthatisviewedasindistinguishablefromthenull.Valuesofa1andb1needtobespeci edforthepriorsonthescaleparameterforthenon-nullbins.Werecommendchoosingsmallervaluesforthesehyperparametersthatarelargeenoughtoencourageshrinkagebutnotsolargeastooverwhelmthedataandarbitrarilyforceahugenumberofbinstobegenerated.WespecifydefaultpriorsfortheGammadistributionasa1=b1=6:5,sotheDEpriorhaspriorcredibleintervalsofunitwidth.Thisisarobustchoiceallowingthedatatoinformabouttheamountofshrinkage.Weset =1,acommondefaultinDPmodels.Wesuggestsettingtheparameterscandd,thepriormeanandvarianceforthelocationparameterportionofthebasedistributionfortheDP,sothatthephavesupportoverawiderangeofvalues.Settingc=0,wechoosed=4suchthatweassign95%probabilitytoaverywiderangeofreasonablepriore ects.Insome8 gorithmbyassumingtheoutcomeyi=1occurswhenalatentvariable,gi�0.Weassumegi=X0i +i=i,whereiN(0;2)andigamma(=2;2=).Thisscalemixtureofnormalswith2=2(�2)=3and=7:3isanearexactrepresentationofthelogisticdistribution.Thedataaugmentationapproachallowsthisalgorithmtobeeasilymodi edforotherregressionmodels,suchasprobitorlinear.WeproposeaMetropoliswithinGibbssamplingalgorithmthatproceedsthroughthefollowingsteps:1a.Augmentthedatawithg=(g1:::gn)0sampledfromf(gijyi; ;i)f(gijyi=1; ;i)=N+(gijx0i ;2=i)f(gijyi=0; ;i)=N�(gijx0i ;2=i)(5)whereN+isanormaldistributiontruncatedtotheleftof0andN�istruncatedtotherightof0.1b.Updateibysamplingfromf(ij ;gi).f(ij ;gi)=Gammaij+1 2;2 +(gi�xi )2=2(6)2.Usethecurrentestimatesof=(k1:::kP)0toupdatetheregressioncoef- cients.Assumetheintercept, haspriordistributionN( j 0;0)with 0and0 xedhyperpriors.Let0=( 0;0)0, 0=( ; 0)0andXbethen(p+1)designmatrixwith rstcolumnequalto1.Thenweupdatebysamplingfromf( 0j�;;0;g)=N( 0jE ;V )(7)whereV =(X0��1X+�1)�1andE =V (X0��1g+�10).Thematrixisa(p+1)(p+1)diagonalmatrixwithjthelementj�1,and�isannndiagonalmatrixwithithelement2=i.10 binisgivenby:f(1j;K)/Gamma(1ja0;b0)Yj:kj=1Exp(jj2 1)/Gamma1jm1+a0;1 Pj:kj=1j=2+1=b0(9)Forsubsequentbins,t�1,theconditionalposteriorsaregivenby:f(tjt; ;K)/N(tjc;d)Yj:kj=tN( jjt;j)/N(tjdEt;cVt)(10)f(tj;K)/Gammatjmt+a1;1 Pj:kj=tj=2+1=b1(11)wherecVt=(1=d+Pj:kj=t1=j)�1anddEt=cVt(c=d+Pj:kj=t j=j):Potentially,nocoecientswillbelongtooneofthebinsandinthiscase,sampling(t;t)amountstosamplingfromthepriordistribution,N(tjc;d)Gamma(tja1;b1).4b.SampleVtfromf(VtjK; )andgeneratetusingthestickbreakingformulaabove.f(VtjK; )Beta(mt+1;p�tXl=1ml+ )t=t�1Yh=1(1�h)Vt(12)4c.Updatethevectorofcoecientcon gurations,K,usingaMetropolisstep.Togenerateaproposalcon guration,wecalculatetheposteriorprobabilityofthejthpredictorfallinginthelthbinas:qj(l)/8��&#x]TJ ;� -2;.51; Td;&#x [00;&#x]TJ ;� -2;.51; Td;&#x [00;:lN( jjl;j)Exp(jj2 l)forlmax(K)lMj(K)forl�max(K)(13)FollowingtherecommendationofPapaspiliopoulosandRoberts(2007)wespecifyMj(p)=max(N( jjl;j)Exp(jj2 l);lmax(K))andthenormalizingconstantforqjisnj(K)=Pmax(K)l=1qj(l)+Mj(K)(1�Pmax(K)l=1l).12 usingmeansquarederror(MSE),bias,falsepositiveandtruepositiverates.Weimplementedthemultipleshrinkagemodelusingthedefaultspeci cationof =1,a0=b0=30,a1=b1=6:5,c=0,andd=4forthesimulations.EachMCMCalgorithmwasrunfor100,000iterationswiththe rst10,000discardedasburn-in.Outputofthealgorithmswasexaminedforthe rstfewsimulationstodetermineconvergence.ThealgorithmsranquicklyinMatlabv7.5onaDelldesktopwitha2.99GhZXeonchipand3GbRAM,takingapproximately5minuteswhenp=20and45minuteswhenp=200.ForcomparisonweimplementedastandardBayesianlassothroughaGibbssam-plersimilartothatinParkandCasella(2005).Inparticular,weassume jN(0;2j)withExp(2=)andGamma(a;b).HyperparametersaandbintheBayesianlassoaresetequaltoa1andb1inexpression4.Resultsfromthe rstsetofsimulationsindicatethatthemultipleshrinkageprioro ersimprovementoverthestandardBayesianlasso.TheMSEestimatedoverallsimulateddatasetsaresmallerforthemultipleshrinkageprior.ThereductioninMSEislargelyaresultofdecreasedbias.Themultipleshrinkagepriormodeltendstoidentifythecorrectcoecientclustering.Forinstance,priorlocationandscaleparametersareoftengroupedintooneclusterforthe rst10coecientsandasecondclusterforthelast10coecients.Eachofthe coecientsisshrunktowardsaclusterspeci cpriormean.The rst10coecientsareshrunktowardapriormeanthatiscloseto2,resultingindramaticallydecreasedbiasfor 1::: 10inthemultipleshrinkagemodel(MSE=0.03)comparedtothestandardlassomodel(MSE=1.08).Additionally,thosecoecients( 11::: 20)whosepriormeansareclusteredintothenullbinandassignedapriormeanofzeroareshrunkmorestronglytowardthatmeaninthemultipleshrinkagepriormodel(MSE=0.01)thanintheBayesianlassomodel(MSE=0.04),asaresultofourpriorspeci cationthata0=b0�a1=b1.14 interactionsandstandardizeallpredictorstohavemean0andvariance1.Weusethepriordistributionon pasinexpression4andusethedefaultspec-i cationthatwerecommendinsection2.Thesemi-parametricmultipleshrinkagemodelwasimplementedasinsection3for20,000iterations.Thechainconvergedrapidlyandshowedlittleautocorrelation.Weexcludedtheinitial5,000iterationsasaburnin.Forcomparison,weimplementedaBayesianversionofthelassobyassum-ing jN(0;2j)withjExp(2=j)andjGamma(a;b).AGibbssamplingalgorithmforthismodelcanbefoundinParkandCasella(2005).Weranthismodeltwice,withhyperpriorsa=a0;b=b0andoncewitha=a1;b=b1,wherea1andb1arethehyperpriortermsfromthemultipleshrinkagepriormodel.Posteriormedianand90%credibleintervalsfromthethreemodelsareshowninFigure2.TheBayesianlassomodelwithaveryconcentratedpriordistribution(a=a0;b=b0)shrinksallcoecientsstronglytowardzerowhilethemodelwithalessconcentratedprior(a=a1;b=b1)providedlessshrinkage.Themultipleshrinkagepriormodelretainedtheabilitytoshrinksomeestimatesstronglytowardzerowhileallowingotherestimatestobeshrunktowardnon-zerolocations.Forinstance,priorlocationandscaleparametersforthee ectofthe12thpredictorisclusteredwiththepriorforthe16thpredictor51%ofthetime.Thesecoecientswereshrunktowardtheirgroupspeci chyperpriormeanratherthan0,asinthestandardBayesianlassoresultinginlargere ectsfortheseparameters.ThePimaIndiandatahasbeenwidelyusedforpurposesofprediction.Wegen-eratedposteriorpredictivedistributionsfortheoutcome,~yi,ofthenewobservationsinthetestingdatasetbyintegratingovermodelparametersusingtheoutputoftheMCMCalgorithm.WhenPr(~yi=1)0:5wepredicttheoutcometobe1,and0otherwise.ThemultipleshrinkagepriorapproachwascomparedtothetwoBayesianlassosandasupportvectormachine(SVM),amaximummarginclassi erdescribedin16 for600,000iterations,keepingevery10thsampleanddiscardingtheinitial10,000iterationsasaburn-in.Ingeneticapplications,interestoftenfocuseslessone ectestimationthanonsigni cancetesting.Aspreviously,weassumethatj jjcanbetreatedasnullinsubstantiveterms.Weestimatej=PgI(j gjj&#x-347;)=Gastheposteriorprobabilityofthejthgenotypehavingane ect.Coecientswhosepriorlocationandscalefallintothenullbinfrequentlywilltypicallyhaveaverysmallposteriorprobabilityofhavingane ect,asshowninFigure3.ThethresholdimpliedbychoosingtoguaranteesaFDR50%is0.46and72genotypee ectswithpos-teriorprobabilityabovethatthresholdare aggedassigni cant.IfaFDR30%isspeci ed,athresholdof0.65ischosenand2genotypesare aggedassigni cant.Thetwogenotypes(896198and912651) aggedunderthiscriterionarebothlocatedonthe rstchromosome.Previousresearchhaslinkedchromosomalabnormalitiesinthe1parmwithpoorerprognosisfollowingdiagnosisofmultiplemyeloma(Wuetal.,2007)andbothoftheseSNPsfallinthe1parm.BothSNPsfallinintergenicregionsanddonotfallonknowngenes;however,theymaybeinhighlinkagedisequilibriumwithaSNPinagenethatisrelatedtomultiplemyelomas.SNP896198isfoundnearanumberofamylasecodingandregulatinggenes(AMY1A,AMY1BandAMY1C,amongothers)andsomeresearchhassuggestedhyperamylasaemiaamongmulitplemyelomapatients(Hataetal.,2006).WeanalyzethesesamedatausingtwoversionsofthestandardBayesianlasso.Asbefore,the rstBayesianlassoallowsforgreatershrinkage(a=a0;b=b0)whilethesecondallowslessshrinkage(a=a1;b=b1).MCMCsamplingalgorithmsforeachmodelarerunfor20,000iterationswiththeinitial5,000discardedasaburnin.TheBayesianlassowitha=a0;b=b0providedsuchstrongshrinkagetowardzerothatitdidnotresultinanypredictorsbeing aggedaswarrantingfurtherinvestigation,underFDR30%orFDR50%.TheBayesianlassowitha=a1;b=b1 agged18 ReferencesJHAlbertandSChib.Bayesiananalysisofbinaryandpolychotomousresponsedata.JournaloftheAmericanStatisticalAssociation,88:669{679,1993.CMBishop.PatternRecognitionandMachineLearning,chapter7:SparseKernelMachines,pages325{357.Springer,2006.RSChhikaraandLFolks.TheInverseGaussianDistribution.MarcelDekker,Inc.,1989.DBDahlandMANewton.Multiplehypothesistestingbyclusteringtreatmente ects.JournaloftheAmericanStatisticalAssociation,102:517{526,2007.KADo,PMuller,andFTang.ABayesianmixturemodelfordi erentialgeneexpression.JournaloftheRoyalStatisticalSociety,SeriesC:AppliedStatistics,54(3):627{644,2005.DBDunson,AHHerring,andSMMulherin-Engel.Bayesianselectionandclusteringofpolymorphismsinfunctionallyrelatedgenes.JASA,toappear,2007.MDEscobarandMWest.Bayesiandensityestimationandinferenceusingmixtures.JournaloftheAmericanStatisticalAssociation,90:577{588,1995.TSFerguson.Priordistributionsonspacesofprobabilitymeasures.TheAnnalsofStatistics,2:615{29,1974.AGelman,AJakulin,MGPittau,andYSSu.Adefaultpriordistributionforlogisticandotherregressionmodels.Technicalreport,ColumbiaUniversity,2006.EIGeorge.Minimaxmultipleshrinkageestimation.TheAnnalsofStatistics,14:188{205,1986a.20 ILoennstedtandTBritton.Hierarchicalbayesmodelsforcdnamicroarraygeneexpression.Biostatistics,6(2):279{291,Apr2005.RFMacLehose,DBDunson,AHHerring,andJAHoppin.Bayesianmethodsforhighlycorrelatedexposuredata.Epidemiology,18(2):199{207,2007.PMuller,GParmigiani,CRobert,andJRousseau.Optimalsamplesizeformultipletesting:thecaseofgeneexpressionmicroarrays.JournaloftheAmericanStatisticalAssociation,99:990{1001,2004.SMO'BrienandDBDunson.Bayesianmultivariatelogisticregression.Biometrics,60(3):739{46,2004.OPapaspiliopoulosandG.O.Roberts.Retrospectivemarkovchainmontecarlometh-odsfordirichletprocesshierarchicalmodels.Biometrika-inpress,2007.TParkandGCasella.Thebayesianlasso.Technicalreport,UniversityofFlorida,2005.LAGRies,DMelbert,MKraphco,AMariotto,BAMiller,EJFeuer,LClegg,MJHorner,NHowlader,MPEisner,MReichman,andBKEdwards,editors.SEERCancerStatisticsReview,1975-2004.NationalCancerInstitute,Bethesda,MD,2007.J.Sethuraman.Aconstructivede nitionofthedirichletprocessprior.StatisticaSinica,2:639{650,1994.DCThomas,RWHaile,andDDuggan.Recentdevelopmentsingenomwideassoci-ationscans:aworkshopsummaryandreview.AmJHumGenet,77(3):337{45,2005.22 Figure1:Multipleshrinkagepriordistributionwith =1,c=0andd=10.24 Figure3:Posteriorprobabilityofgenotypee ectsonearlyonsetmultiplemyeloma.26

Related Contents


Next Show more