mpisbmpgde alexa Contents 1 Introduction 2 Instalation 3 Quick start guide 31 Data preparation 3 32 Performing the enrichment tests 4 33 Analysis of resul ID: 83632
Download Pdf The PPT/PDF document "Gene set enrichment analysis with topGO ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
GenesetenrichmentanalysiswithtopGOAdrianAlexa,JorgRahnenfuhrerOctober30,2020http://www.mpi-sb.mpg.de/alexaContents1Introduction22Instalation23Quickstartguide33.1Datapreparation...........................................33.2Performingtheenrichmenttests..................................43.3Analysisofresults..........................................54Loadinggenesandannotationsdata84.1Gettingstarted............................................84.2ThetopGOdataobject........................................84.3Customannotations.........................................94.4Predenedlistofinterestinggenes.................................104.5Usingthegenesscore.........................................114.6FilteringandmissingGOannotations...............................125WorkingwiththetopGOdataobject146Runningtheenrichmenttests166.1Deningandrunningthetest....................................176.2Theadjustmentofp-values.....................................196.3Addinganewtest..........................................196.4runTest:ahigh-levelinterfacefortesting.............................197Interpretationandvisualizationofresults207.1ThetopGOresultobject.......................................207.2Summarisingtheresults.......................................217.3AnalysingindividualGOs......................................227.4VisualisingtheGOstructure....................................238SessionInformation26References261 1IntroductionThetopGOpackageisdesignedtofacilitatesemi-automatedenrichmentanalysisforGeneOntology(GO)terms.Theprocessconsistsofinputofnormalisedgeneexpressionmeasurements,gene-wisecorrelationordierentialexpressionanalysis,enrichmentanalysisofGOterms,interpretationandvisualisationoftheresults.OneofthemainadvantagesoftopGOistheuniedgenesettestingframeworkitoers.BesidesprovidinganeasytousesetoffunctionsforperformingGOenrichmentanalysis,italsoenablestheusertoeasilyimplementnewstatisticaltestsornewalgorithmsthatdealwiththeGOgraphstructure.ThisuniedframeworkalsofacilitatesthecomparisonbetweendierentGOenrichmentmethodologies.ThereareanumberofteststatisticsandalgorithmsdealingwiththeGOgraphstructuredreadytouseintopGO.Table1presentsthecompatibilitytablebetweentheteststatisticsandGOgraphmethods. sherkstglobaltestsum classic elim weight weight01 lea parentchild Table1:AlgorithmscurrentlysupportedbytopGO.TheelimandweightalgorithmswereintroducedinAlexaetal.(2006).ThedefaultalgorithmusedbythetopGOpackageisamixturebetweentheelimandtheweightalgorithmsanditwillbereferredasweight01.TheparentChildalgorithmwasintroducedbyGrossmannetal.(2007).WeassumetheuserhasagoodunderstandingofGO,seeConsortium(2001),andisfamiliarwithgenesetenrichmenttests.AlsothisdocumentrequiresbasicknowledgeofRlanguage.ThenextsectionpresentsaquicktourintotopGOandisthoughttobeindependentoftherestofthismanuscript.TheremainingsectionsprovidedetailsonthefunctionsusedinthesamplesectionaswellasshowingmoreadvancefunctionalityimplementedinthetopGOpackage.2InstalationThissectionbrie ydescribethenecessarytogettopGOrunningonyoursystem.WeassumethattheuserhavetheRprogram(seetheRprojectathttp://www.r-project.org)alreadyinstalledanditsfamiliarwithit.YouwillneedtohaveR2.10.0orlatertobeabletoinstallandruntopGO.ThetopGOpackageisavailablefromtheBioconductorrepositoryathttp://www.bioconductor.orgTobeabletoinstallthepackageoneneedsrsttoinstallthecoreBioconductorpackages.IfyouhavealreadyinstalledBioconductorpackagesonyoursystemthenyoucanskipthetwolinesbelow.if(!requireNamespace("BiocManager",quietly=TRUE))+install.packages("BiocManager")BiocManager::install()OncethecoreBioconductorpackagesareinstalled,wecaninstallthetopGOpackagebyif(!requireNamespace("BiocManager",quietly=TRUE))+install.packages("BiocManager")BiocManager::install("topGO") 3QuickstartguideThissectiondescribesasimpleworkingsessionusingtopGO.Thereareonlyahandfulofcommandsnecessarytoperformagenesetenrichmentanalysiswhichwillbebrie ypresentedbelow.Atypicalsessioncanbedividedintothreesteps:1.Datapreparation:Listofgenesidentiers,genescores,listofdierentiallyexpressedgenesoracriteriaforselectinggenesbasedontheirscores,aswellasgene-to-GOannotationsareallcollectedandstoredinasingleRobject.2.Runningtheenrichmenttests:UsingtheobjectcreatedintherststeptheusercanperformenrichmentanalysisusinganyfeasiblemixtureofstatisticaltestsandmethodsthatdealwiththeGOtopology.3.Analysisoftheresults:Theresultsobtainedinthesecondstepareanalysedusingsummaryfunctionsandvisualisationtools.Beforegoingthrougheachofthosestepstheuserneedstodecidewhichbiologicalquestionhewouldliketoinvestigate.Theaimofthestudy,aswellasthenatureoftheavailabledata,willdictatewhichteststatistic/methodsneedtoused.InthissectionwewilltesttheenrichmentofGOtermswithdierentiallyexpressedgenesusingtwostatisticaltests,namelyKolmogorov-SmirnovtestandFisher'sexacttest.3.1DatapreparationIntherststepaconvenientRobjectofclasstopGOdataiscreatedcontainingalltheinformationrequiredfortheremainingtwosteps.Theuserneedstoprovidethegeneuniverse,GOannotationsandeitheracriteriaforselectinginterestinggenes(e.g.dierentiallyexpressedgenes)fromthegeneuniverseorascoreassociatedwitheachgene.InthissessionwewilltesttheenrichmentofGOtermswithdierentiallyexpressedgenes.Thus,thestartingpointisalistofgenesandtherespectivep-valuesfordierentialexpression.Atoyexampleofalistofgenep-valuesisprovidedbythegeneListobject.library(topGO)library(ALL)data(ALL)data(geneList)ThegeneListdataisbasedonadierentialexpressionanalysisoftheALL(AcuteLymphoblasticLeukemia)datasetthatwasextensivelystudiedintheliteratureonmicroarrayanalysisChiaretti,S.,etal.(2004).Ourtoyexamplecontainsjustasmallamount,323,ofgenesandtheircorrespondingp-values.Thenextdataoneneedsarethegenegroupsitself,theGOtermsinourcase,andthemappingthatassociateeachgenewithoneormoreGOterm(s).TheinformationonwheretondtheGOannotationsisstoredintheALLobjectanditiseasilyaccessible.affyLib-paste(annotation(ALL),"db",sep=".")ကlibrary(package=affyLib,character.only=TRUE)Themicroarrayusedintheexperimentisthehgu95av2fromAymetrix,aswecanseefromtheaffyLibobject.WhenweloadedthegeneListobjectaselectionfunctionusedfordeningthelistofdierentiallyexpressedgenesisalsoloadedunderthenameoftopDiffGenes.Thefunctionassumesthattheprovidedargumentisanamedvectorofp-values.Withthehelpofthisfunctionwecanseethatthereare50geneswitharawp-valuelessthan0:01outofatotalof323genes.ကsum(topDiffGenes(geneList))[1]50 WenowhavealldatanecessarytobuildanobjectoftypetopGOdata.Thisobjectwillcontainallgeneidentiersandtheirscores,theGOannotations,theGOhierarchicalstructureandallotherinformationneededtoperformthedesiredenrichmentanalysis.sampleGOdata-new("topGOdata",+description="Simplesession",ontology="BP",+allGenes=geneList,geneSel=topDiffGenes,+nodeSize=10,+annot=annFUN.db,affyLib=affyLib)ThenamesoftheargumentsusedforbuildingthetopGOdataobjectshouldbeself-explanatory.WequicklymentionthatnodeSize=10isusedtoprunetheGOhierarchyfromthetermswhichhavelessthan10annotatedgenesandthatannFUN.dbfunctionisusedtoextractthegene-to-GOmappingsfromtheaffyLibobject.Section4.2describestheparametersusedtobuildthetopGOdataindetails.AsummaryofthesampleGOdataobjectcanbeseenbytypingtheobjectnameattheRprompt.Havingallthedatastoredintothisobjectfacilitatestheaccesstoidentiers,annotationsandtobasicdatastatistics.ကsampleGOdata3.2PerformingtheenrichmenttestsOncewehaveanobjectofclasstopGOdatawecanstartwiththeenrichmentanalysis.Wewillusetwotypesofteststatistics:Fisher'sexacttestwhichisbasedongenecounts,andaKolmogorov-Smirnovliketestwhichcomputesenrichmentbasedongenescores.Wecanuseboththesetestssinceeachgenehasascore(representinghowdierentiallyexpressedageneis)andbythemeansoftopDiffGenesfunctionsthegenesarecategorizedintodierentiallyexpressedornotdierentiallyexpressedgenes.AllthesearestoredintosampleGOdataobject.ThefunctionrunTestisusedtoapplythespeciedteststatisticandmethodtothedata.Ithasthreemainarguments.TherstargumentneedstobeanobjectofclasstopGOdata.Thesecondandthirdargumentareoftypecharacter;theyspecifythemethodfordealingwiththeGOgraphstructureandtheteststatistic,respectively.First,weperformaclassicalenrichmentanalysisbytestingtheover-representationofGOtermswithinthegroupofdierentiallyexpressedgenes.ForthemethodclassiceachGOcategoryistestedindependently.ကresultFisher-runTest(sampleGOdata,algorithm="classic",statistic="fisher")runTestreturnsanobjectofclasstopGOresult.Ashortsummaryofthisobjectisshownbelow.ကresultFisherDescription:SimplesessionOntology:BP'classic'algorithmwiththe'fisher'test1093GOtermsscored:54termswithp0.01Annotationdata:Annotatedgenes:310Significantgenes:46Min.no.ofgenesannotatedtoaGO:10Nontrivialnodes:998NextwewilltesttheenrichmentusingtheKolmogorov-Smirnovtest.Wewillusetheboththeclassicandtheelimmethod.-523;resultKS-runTest(sampleGOdata,algorithm="classic",statistic="ks")ကresultKS.elim-runTest(sampleGOdata,algorithm="elim",statistic="ks") GO.IDTermAnnotatedSignicantExpectedRankinclassicFisherclassicFisherclassicKSelimKS 1GO:0051301celldivision1461621.669770.976577.8e-082.4e-072GO:0031668cellularresponsetoextracellularstimu...1181.6311.6e-054.0e-054.0e-053GO:0090068positiveregulationofcellcycleproces...3955.796510.722890.000180.000184GO:0010389regulationofG2/Mtransitionofmitotic...3074.452860.135350.000190.000195GO:0140014mitoticnucleardivision94713.959910.996800.001180.001186GO:0051726regulationofcellcycle1321819.596580.748808.4e-050.001527GO:0045931positiveregulationofmitoticcellcycl...3745.497990.836610.001710.001718GO:0050851antigenreceptor-mediatedsignalingpath...1171.6370.000210.002080.002089GO:1900221regulationofamyloid-betaclearance1051.48430.008270.002870.0028710GO:0031667responsetonutrientlevels1371.93200.000880.002920.00292 Table2:SignicanceofGOtermsaccordingtoclassicandelimmethods.Pleasenotethatnotallstatisticaltestsworkwitheverymethod.ThecompatibilitymatrixbetweenthemethodsandstatisticaltestsisshowninTable1.Thep-valuescomputedbytherunTestfunctionareunadjustedformultipletesting.Wedonotadvocateagainstadjustingthep-valuesofthetestedgroups,howeverinmanycasesadjustedp-valuesmightbemisleading.3.3AnalysisofresultsAftertheenrichmenttestsareperformedtheresearcherneedstoolsforanalysingandinterpretingtheresults.GenTableisaneasytousefunctionforanalysingthemostsignicantGOtermsandthecorrespondingp-values.Inthefollowingexample,welistthetop10signicantGOtermsidentiedbytheelimmethod.Atthesametimewealsocomparetheranksandthep-valuesoftheseGOtermswiththeonesobtainedbytheclassicmethod.allRes-GenTable(sampleGOdata,classicFisher=resultFisher,+classicKS=resultKS,elimKS=resultKS.elim,+orderBy="elimKS",ranksOf="classicFisher",topNodes=10)TheGenTablefunctionreturnsadataframecontainingthetoptopNodesGOtermsidentiedbytheelimalgorithm,seeorderByargument.ThedataframeincludessomestatisticsontheGOtermsandthep-valuescorrespondingtoeachofthetopGOresultobjectspeciedasarguments.Table2showstheresults.ForaccessingtheGOterm'sp-valuesfromatopGOresultobjecttheusershouldusethescorefunctions.Asasimpleexample,welookatthedierencesbetweentheresultsoftheclassicandtheelimmethodsinthecaseoftheKolmogorov-Smirnovtest.Theelimmethodwasdesigntobemoreconservativethentheclassicmethodandthereforeoneexpectsthep-valuesreturnedbytheformermethodarelowerboundedbythep-valuesreturnedbythelatermethod.Theeasiestwaytovisualizethispropertyistoscatterplotthetwosetsofp-valuesagainsteachother.ကpValue.classic-score(resultKS)ကpValue.elim-score(resultKS.elim)[names(pValue.classic)]ကgstat-termStat(sampleGOdata,names(pValue.classic))ကgSize-gstat$Annotated/max(gstat$Annotated)*4ကgCol-colMap(gstat$Significant)ကplot(pValue.classic,pValue.elim,xlab="p-valueclassic",ylab="p-valueelim",+pch=19,cex=gSize,col=gCol)WecanseeinFigure1thatthereareindeeddierencesbetweenthetwomethods.SomeGOtermsfoundsignicantbytheclassicmethodarelesssignicantintheelim,asexpected.However,insomecases,wecanvisibleidentifyafewGOtermsforwhichtheelimp-valueislessconservativethentheclassicp-value.IfsuchGOtermsexist,wecanidentifythemandndthenumberofannotatedgenes:ကsel.go-names(pValue.classic)[pValue.elimpValue.classic]-523;cbind(termStat(sampleGOdata,sel.go),+elim=pValue.elim[sel.go],+classic=pValue.classic[sel.go]) Figure1:p-valuesscatterplotfortheclassic(xaxis)andelim(yaxis)methods.Ontherightpanelthep-valuesareplottedonalinearscale.Theleftpannedplotsthesamep-valuesonalogarithmicscale.ThesizeofthedotisproportionalwiththenumberofannotatedgenesfortherespectiveGOtermanditscoloringrepresentsthenumberofsignicantlydierentiallyexpressedgenes,withthedarkredpointshavingmoregenesthentheyellowones.AnnotatedSignificantExpectedelimclassicGO:00081522534037.540.372688250.43290774GO:00090581753025.970.123740370.20607285GO:00090591212417.950.096318260.22106711GO:00098891051815.580.270043550.33891272GO:00192221712725.370.363613610.48920146GO:00313231622524.040.383081420.43204977GO:00346451172117.360.205798100.31818382GO:00442372463936.500.318906560.38029888GO:00442491712725.370.252852570.37807808GO:0050793841912.460.039153930.05828868GO:0051094691610.240.055902690.06214595GO:00717042524037.390.364694360.42515582GO:19015761753025.970.123740370.20607285Itisquiteinterestingthatsuchcasesappear-theabovetablewillnotbeempty.TheseGOtermsarerathergeneral(havingmanyannotatedgenes)andtheirp-valuesarenotsignicantatthe0:05level.ThereforetheseGOtermswouldnotappearinthelistoftopsignicantterms.MoresignicantGOtermsarelesslikelytobein uencedbythisnonmonotonicbehavior.AnotherinsightfulwayoflookingattheresultsoftheanalysisistoinvestigatehowthesignicantGOtermsaredistributedovertheGOgraph.Figure2showsthethesubgraphinducedbythe5mostsignicantGOtermsasidentiedbytheelimalgorithm.Signicantnodesarerepresentedasrectangles.Theplottedgraphistheupperinducedgraphgeneratedbythesesignicantnodes.showSigOfNodes(sampleGOdata,score(resultKS.elim),firstSigNodes=5,useInfo='all') Figure2:Thesubgraphinducedbythetop5GOtermsidentiedbytheelimalgorithmforscoringGOtermsforenrichment.Rectanglesindicatethe5mostsignicantterms.Rectanglecolorrepresentstherelativesignicance,rangingfromdarkred(mostsignicant)tobrightyellow(leastsignicant).Foreachnode,somebasicinformationisdisplayed.ThersttwolinesshowtheGOidentierandatrimmedGOname.Inthethirdlinetherawp-valueisshown.TheforthlineisshowingthenumberofsignicantgenesandthetotalnumberofgenesannotatedtotherespectiveGOterm. 4Loadinggenesandannotationsdata4.1GettingstartedTodemonstratethepackagefunctionalitywewillusetheALL(AcuteLymphoblasticLeukemia)geneexpres-siondatafromChiaretti,S.,etal.(2004).Thedatasetconsistsof128microarraysfromdierentpatientswithALLmeasuredusingtheHGU95aV2Aymetrixchip.Additionally,customannotationsandarticialdatasetswillbeusedtodemonstratespecicfeatures.Werstloadtherequiredlibrariesanddata:library(topGO)library(ALL)data(ALL)WhenthetopGOpackageisloadedthreeenvironmentsGOBPTerm,GOMFTermandGOCCTermarecreatedandboundtothepackageenvironment.TheseenvironmentsarebuildbasedontheGOTERMenvironmentfrompackageGO.db.Theyareusedforfastrecoveringoftheinformationspecictoeachofthethreeontologies:BP,MFandCC.InordertoaccessallGOgroupsthatbelongtoaspecicontology,e.g.BiologicalProcess(BP),onecantype:BPterms-ls(GOBPTerm)ကhead(BPterms)[1]"GO:0000001""GO:0000002""GO:0000003""GO:0000011""GO:0000012""GO:0000017"Usuallyoneneedstoremoveprobes/geneswithlowexpressionvalueaswellasprobeswithverysmallvariabilityacrosssamples.Packagegenefilterprovidestoolsforlteringgenes.Inthisanalysiswechoosetolterasmanygenesaspossibleforcomputationalreasons;workingwithasmallergeneuniverseallowsustoexemplifymoreofthefunctionalitiesimplementedinthetopGOpackageandatthesametimeallowsthisdocumenttobecompiledinarelativelyshorttime.TheeectofgenelteringisdiscussedinmoredetailsinSection4.6.ကlibrary(genefilter)ကselProbes-genefilter(ALL,filterfun(pOverA(0.20,log2(100)),function(x)(IQR(x)က0.25)))ကeset-ALL[selProbes,]Thelterselectsonly4101probesetsoutof12625probesetsavailableonthehgu95av2array.ThegeneuniverseandthesetofinterestinggenesThesetofallgenesfromthearray/studywillbereferredfromnowonasthegeneuniverse.Havingthegeneuniverse,theusercandenealistofinterestinggenesortocomputegene-wisescoresthatquantifythesignicanceofeachgene.Whengene-wisescoresareavailablethelistofinterestinggenesisdenedtobethesetofgenewithasignicantscore.ThetopGOpackagedealswiththesetwocasesinauniedwayoncethemaindatacontainer,thetopGOdataobject,isconstructed.Theonlytimetheuserneedstodistinguishbetweenthesetwocasesisduringtheconstructionofthedatacontainer.Usually,thegeneuniverseisdenedasallfeasiblegenesmeasuredbythemicroarray.InthecaseoftheALLdatasetwehave4101feasiblegenes,theonesthatwerenotremovedbythelteringprocedure.4.2ThetopGOdataobjectThecentralstepinusingthetopGOpackageistocreateatopGOdataobject.ThisobjectwillcontainallinformationnecessaryfortheGOanalysis,namelythelistofgenes,thelistofinterestinggenes,thegenescores(ifavailable)andthepartoftheGOontology(theGOgraph)whichneedstobeusedintheanalysis.ThetopGOdataobjectwillbetheinputofthetestingprocedures,theevaluationandvisualisationfunctions.Tobuildsuchanobjecttheuserneedsthefollowing: Alistofgeneidentiersandoptionallythegene-wisescores.Thescorecanbethet-teststatistic(orthep-value)fordierentialexpression,correlationwithaphenotype,oranyotherrelevantscore.AmappingbetweengeneidentiersandGOterms.InmostcasesthismappingisdirectlyavailableinBioconductorasamicroarrayspecicannotationpackage.Inthiscasetheuserjustneedstospecifythenameoftheannotationtobeused.Forexample,theannotationpackageneededfortheALLdatasetishgu95av2.db.Ofcourse,Bioconductordoesnotincludeup-to-dateannotationpackagesforallplatforms.UserswhoworkwithcustomarraysorwishtouseaspecicmappingbetweengenesandGOterms,havethepossibilitytoloadcustomannotations.ThisisdescribedinSection4.3.TheGOhierarchicalstructure.ThisstructureisobtainedfromtheGO.dbpackage.AtthemomenttopGOsupportsonlytheontologydenitionprovidedbyGO.db.Wefurtherdescribetheargumentsoftheinitializefunction(new)usedtoconstructaninstanceofthisdatacontainerobject.ontology:characterstringspecifyingtheontologyofinterest(BP,MForCC)description:characterstringcontainingashortdescriptionofthestudy[optional].allGenes:namedvectoroftypenumericorfactor.Thenamesattributecontainsthegenesidentiers.Thegeneslistedinthisobjectdenethegeneuniverse.geneSelectionFun:functiontospecifywhichgenesareinterestingbasedonthegenescores.ItshouldbepresentitheallGenesobjectisoftypenumeric.nodeSize:anintegerlargerorequalto1.ThisparameterisusedtoprunetheGOhierarchyfromthetermswhichhavelessthannodeSizeannotatedgenes(afterthetruepathruleisapplied).annotationFun:functionwhichmapsgenesidentierstoGOterms.Thereareacoupleofannotationfunctionincludedinthepackagetryingtoaddresstheuser'sneeds.Theannotationfunctionstakethreearguments.Oneofthoseargumentsisspecifyingwherethemappingscanbefound,andneedstobeprovidedbytheuser.Herewegiveashortdescriptionofeach:annFUN.dbthisfunctionisintendedtobeusedaslongasthechipusedbytheuserhasanannotationpackageavailableinBioconductor.annFUN.orgthisfunctionisusingthemappingsfromthe"org.XX.XX"annotationpackages.Currently,thefunctionsupportsthefollowinggeneidentiers:Entrez,GenBank,Alias,Ensembl,GeneSymbol,GeneNameandUniGene.annFUN.gene2GOthisfunctionisusedwhentheannotationsareprovidedasagene-to-GOsmapping.annFUN.GO2genethisfunctionisusedwhentheannotationsareprovidedasaGO-to-genesmapping.annFUN.filethisfunctionwillreadtheannotationsofthetypegene2GOorGO2genesfromatextle....:listofargumentstobepassedtotheannotationFun4.3CustomannotationsThissectiondescribeshowcustomGOannotationscanbeusedforbuildingatopGOdataobject.Annotationsneedtobeprovidedeitherasgene-to-GOsorasGO-to-genesmappings.Anexampleofsuchmappingcanbefoundinthe"topGO/examples"directory.Thele"geneid2go.map"containsgene-to-GOsmappings.ForeachgeneidentierarelistedtheGOtermstowhichthisgeneisspecicallyannotated.WeusethereadMappingsfunctiontoparsethisle.geneID2GO-readMappings(file=system.file("examples/geneid2go.map",package="topGO"))ကstr(head(geneID2GO)) Listof6$068724:chr[1:5]"GO:0005488""GO:0003774""GO:0001539""GO:0006935"...$119608:chr[1:6]"GO:0005634""GO:0030528""GO:0006355""GO:0045449"...$049239:chr[1:13]"GO:0016787""GO:0017057""GO:0005975""GO:0005783"...$067829:chr[1:16]"GO:0045926""GO:0016616""GO:0000287""GO:0030145"...$106331:chr[1:10]"GO:0043565""GO:0000122""GO:0003700""GO:0005634"...$214717:chr[1:7]"GO:0004803""GO:0005634""GO:0008270""GO:0003677"...TheobjectreturnedbyreadMappingsisanamedlistofcharactervectors.Thelistnamesgivethegenesidentiers.EachelementofthelistisacharactervectorandcontainstheGOidentiersannotatedtothespecicgene.ItissucientforthemappingtocontainonlythemostspecicGOannotations.However,topGOcanalsotakeasaninputlesinwhichallorsomeancestorsofthemostspecicGOannotationsareincluded.Thisredundancyisnotmakingforafasterrunningtimeandifpossibleitshouldbeavoided.TheusercanreadtheannotationsfromtextlesortheycanbuildanobjectsuchasgeneID2GOdirectlyintoR.ThetextleformatrequiredbythereadMappingsfunctionisverysimple.Itconsistsofonelineforeachgenewiththefollowingsyntax:gene_IDTABကGO_ID1,GO_ID2,GO_ID3,....ReadingGO-to-genesmappingsfromaleisalsopossibleusingthereadMappingsfunction.However,itistheuserresponsibilitytoknowthedirectionofthemappings.Theusercaneasilytransformamappingfromgene-to-GOstoGO-to-genes(orvice-versa)usingthefunctioninverseList:ကGO2geneID-inverseList(geneID2GO)ကstr(head(GO2geneID))Listof6$GO:0000122:chr"106331"$GO:0000139:chr[1:6]"133103""111846""109956""161395"...$GO:0000166:chr[1:10]"067829""157764""100302""074582"...$GO:0000186:chr"181104"$GO:0000209:chr"159461"$GO:0000228:chr"214717"4.4PredenedlistofinterestinggenesIftheuserhassomeaprioriknowledgeaboutasetofinterestinggenes,hecantesttheenrichmentofGOtermswithregardtothislistofinterestinggenes.Inthisscenario,whenonlyalistofinterestinggenesisprovided,theusercanuseonlytestsstatisticsthatarebasedongenecounts,likeFisher'sexacttest,Zscoreandalike.Todemonstratehowcustomannotationcanbeusedthissectionisbasedonthetoydataset,thegeneID2GOdata,fromSection4.3.Thegeneuniverseinthiscaseisgivenbythelistnames:ကgeneNames-names(geneID2GO)ကhead(geneNames)Sincefortheavailablegeneswedonothaveanymeasurementandthusnocriteriatoselectinterestinggenes,werandomlyselect10%genesfromthegeneuniverseandconsiderthemasinterestinggenes.ကmyInterestingGenes-sample(geneNames,length(geneNames)/10)ကgeneList-factor(as.integer(geneNames%in%myInterestingGenes))ကnames(geneList)-geneNamesကstr(geneList)Factorw/2levels"0","1":1111111111...-attr(*,"names")=chr[1:100]"068724""119608""049239""067829"... ThegeneListobjectisanamedfactorthatindicateswhichgenesareinterestingandwhichnot.Itshouldbestraightforwardtocomputesuchanamedvectorinarealcasesituation,wheretheuserhashisownpredenedlistofinterestinggenes.WenowhavealltheelementstoconstructatopGOdataobject.TobuildthetopGOdataobject,wewillusetheMFontology.ThemappingisgivenbythegeneID2GOlistwhichwillbeusedwiththeannFUN.gene2GOfunction.GOdata-new("topGOdata",ontology="MF",allGenes=geneList,+annot=annFUN.gene2GO,gene2GO=geneID2GO)ThebuildingoftheGOdataobjectcantakesometime,dependingonthenumberofannotatedgenesandonthechosenontology.InourexampletherunningtimeisquitefastgiventhatwehavearathersmallsizegeneuniversewhichalsoimplyamoderatesizeGOontology,especiallysinceweareusingtheMFontology.Theadvantageofhaving(informationon)thegenescores(orbettergenesmeasurements)aswellasawaytodenewhicharetheinterestinggenes,inthetopGOdataobjectisthatonecanapplyvariousgrouptestingprocedure,whichletustestmultiplehypothesisortunewithdierentparameters.BytypingGOdataattheRprompt,theusercanseeasummaryofthedata.ကGOdata-------------------------topGOdataobject-------------------------Description:-Ontology:-MF100availablegenes(allgenesfromthearray):-symbol:068724119608049239067829106331...-10significantgenes.87feasiblegenes(genesthatcanbeusedintheanalysis):-symbol:068724119608049239067829106331...-9significantgenes.GOgraph(nodeswithatleast1genes):-agraphwithdirectededges-numberofnodes=240-numberofedges=318-------------------------topGOdataobject-------------------------OneimportantpointtonoticeisthatnotallthegenesthatareprovidedbygeneList,theinitialgeneuniverse,canbeannotatedtotheGO.Thiscanbeseenbycomparingthenumberofallavailablegenes,thegenespresentingeneList,withthenumberoffeasiblegenes.Wearethereforeforcedatthispointtorestrictthegeneuniversetothesetoffeasiblegenesfortherestoftheanalysis.ThesummaryontheGOgraphshowsthenumberofGOtermsandtherelationsbetweenthemofthespeciedGOontology.ThisgraphcontainsonlyGOtermswhichhaveatleastonegeneannotatedtothem.4.5UsingthegenesscoreInmanycasesthesetofinterestinggenescanbecomputedbasedonascoreassignedtoallgenes,forexamplebasedonthep-valuereturnedbyastudyofdierentialexpression.Inthiscase,thetopGOdataobjectcanstorethegenesscoreandarulespecifyingthelistofinterestinggenes.Theadvantageofhavingboththe scoresandtheproceduretoselectinterestinggenesencapsulatedinthetopGOdataobjectisthattheusercanchoosedierenttypesoftestsstatisticsfortheGOanalysiswithoutmodifyingtheinputdata.AtypicalexamplefortheALLdatasetisthestudywhereweneedtodiscriminatebetweenALLcellsdeliveredfromeitherB-cellorT-cellprecursors.y-as.integer(sapply(eset$BT,function(x)return(substr(x,1,1)=='T')))ကtable(y)Thereare95B-cellALLsamplesand95T-cellALLsamplesinthedataset.Atwo-sidedt-testcanbyappliedusingthefunctiongetPvalues(awrapingfunctionforthemt.teststatfromthemulttestpackage).BydefaultthefunctioncomputesFDR(falsediscoveryrate)adjustedp-valueinordertoaccountformultipletesting.Adierenttypeofcorrectioncanbespeciedusingthecorrectionargument.ကgeneList-getPvalues(exprs(eset),classlabel=y,alternative="greater")geneListisanamednumericvector.Thegeneidentiersarestoredinthenamesattributeofthevector.Thissetofgenesdenesthegeneuniverse.Next,afunctionforspecifyingthelistofinterestinggenesmustbedened.Thisfunctionneedstoselectgenesbasedontheirscores(inourcasetheadjustedp-values)andmustreturnalogicalvectorspecifyingwhichgeneisselectedandwhichnot.Thefunctionmusthaveoneargument,namedallScoreandmustnotdependonanyattributesofthisobject.Inthisexamplewewillconsiderasinterestinggenesallgeneswithanadjustedp-valuelowerthan0:01.Thiscriteriaisimplementedinthefollowingfunction:ကtopDiffGenes-function(allScore){+return(allScore0.01)+}-523;x-topDiffGenes(geneList)ကsum(x)##thenumberofselectedgenesWithallthesestepsdone,theusercannowbuildthetopGOdataobject.ForashortdescriptionoftheargumentsusedbytheinitializefunctionseeSection4.4ကGOdata-new("topGOdata",+description="GOanalysisofALLdata;B-cellvsT-cell",+ontology="BP",+allGenes=geneList,+geneSel=topDiffGenes,+annot=annFUN.db,+nodeSize=5,+affyLib=affyLib)ItisoftenthecasethatmanyGOtermswhichhavefewannotatedgenesaredetectedtobesignicantlyenrichedduetoartifactsinthestatisticaltest.ThesesmallsizedGOtermsareoflessimportancefortheanalysisandinmanycasestheycanbeomitted.ByusingthenodeSizeargumenttheusercancontrolthesizeoftheGOtermsusedintheanalysis.OncethegenesareannotatedtotheeachGOtermandthetruepathruleisappliedthenodeswithlessthannodeSizeannotatedgenesareremovedfromtheGOhierarchy.Wefoundthatvaluesbetween5and10forthenodeSizeparameteryieldmorestableresults.ThedefaultvalueforthenodeSizeparameteris1,meaningthatnopruningisperformed.NotethattheonlydierenceintheinitialisationofanobjectofclasstopGOdatatothecaseinwhichwestartwithapredenedlistofinterestinggenesistheuseofthegeneSelargument.AllfurtheranalysisdependsonlyontheGOdataobject.4.6FilteringandmissingGOannotationsBeforegoingfurtherwiththeenrichmentanalysisweanalysewhichoftheprobesavailableonthearraycanbeusedintheanalysis. Figure3:ScatterplotofFDRadjustedp-valuesagainstvarianceofprobes.Pointsbelowthehorizontallinearesignicantprobes.WewanttoseeifthelteringperformedinSection4.1removesimportantprobes.Thereareatotalof12625probesonthehgu95av2chip.Oneassumesthatonlythenoisyprobes,probeswithlowexpressionvaluesorsmallvarianceacrosssamplesarelteredoutfromtheanalysis.Thenumberofprobeshaveadirecteectonthemultipletestingadjustmentofp-values.Toomanyprobeswillresultintooconservativeadjustedp-valueswhichcanbiastheresultoftestslikeFisher'sexacttest.Thusitisimportanttocarefullyanalysethisstep.allProb-featureNames(ALL)ကgroupProb-integer(length(allProb))+1ကgroupProb[allProb%in%genes(GOdata)]-0ကgroupProb[!selProbes]-2ကgroupProb-factor(groupProb,labels=c("Used","Notannotated","Filtered"))ကtt-table(groupProb)ကttgroupProbUsedNotannotatedFiltered38962058524Outofthelteredprobesonly96%haveannotationtoGOterms.Thelteringprocedureremoves8524probeswhichisaverylargepercentageofprobes(morethan50%),butwedidthisintentionallytoreducetheexpressionsetforcomputationalpurposes.Weperformadierentialexpressionanalysisonallavailableprobesandwecheckifdierentiallyexpressedgenesareleavedoutfromtheenrichmentanalysis.ကpValue-getPvalues(exprs(ALL),classlabel=y,alternative="greater")ကgeneVar-apply(exprs(ALL),1,var)ကdd-data.frame(x=geneVar[allProb],y=log10(pValue[allProb]),groups=groupProb)ကxyplot(y~x|groups,data=dd,groups=groups) Figure3showsforthethreegroupsofprobestheadjustedp-valuesandthegene-wisevariance.Probeswithlargechangesbetweenconditionshavelargevarianceandlowp-value.Inanidealcase,onewouldexpecttohavealargedensityofprobesinthelowerrightcornerofUsedpanelandfewprobesinthisregionintheothertwopanels.Wecanseethatthelteringprocessthrowsoutsomesignicantprobesandinarealanalysisamoreconservativelteringneedstobeapplied.However,therearealsomanydierentiallyexpressedprobeswithoutGOannotationwhichcannotbeusedintheanalysis.5WorkingwiththetopGOdataobjectOncethetopGOdataobjectiscreatedtheusercanusevariousmethodsdenedforthisclasstoaccesstheinformationencapsulatedintheobject.Thedescriptionslotcontainsinformationabouttheexperiment.Thisinformationcanbeaccessedorreplacedusingthemethodwiththesamename.description(GOdata)description(GOdata)-paste(description(GOdata),"Objectmodifiedon:",format(Sys.time(),"%d%b%Y"),sep="")ကdescription(GOdata)Methodstoobtainthelistofgenesthatwillbeusedinthefurtheranalysisormethodsforobtainingallgenescoresareexempliedbelow.ကa-genes(GOdata)##obtainthelistofgenesကhead(a)[1]"1000_at""1005_at""1007_s_at""1008_f_at""1009_at""100_g_at"ကnumGenes(GOdata)[1]3896Nextwedescribehowtoretrievethescoreofaspeciedsetofgenes,e.g.asetofrandomlyselectedgenes.Iftheobjectwasconstructedusingalistofinterestinggenes,thenthefactorvectorthatwasprovidedatthebuildingoftheobjectwillbereturned.ကselGenes-sample(a,10)ကgs-geneScore(GOdata,whichGenes=selGenes)ကprint(gs)Iftheuserwantsanunnamedvectororthescoreofallgenes:ကgs-geneScore(GOdata,whichGenes=selGenes,use.names=FALSE)ကprint(gs)ကgs-geneScore(GOdata,use.names=FALSE)ကstr(gs)ThelistofsignicantgenescanbeaccessedusingthemethodsigGenes().ကsg-sigGenes(GOdata)ကstr(sg)ကnumSigGenes(GOdata)AnotherusefulmethodisupdateGeneswhichallowstheusertoupdate/changethelistofgenes(andtheirscores)fromatopGOdataobject.Ifonewantstoupdatethelistofgenesbyincludingonlythefeasibleones,onecantype: .geneList-geneScore(GOdata,use.names=TRUE)ကGOdata##moreavailablegenesကGOdata-updateGenes(GOdata,.geneList,topDiffGenes)ကGOdata##theavailablegenesarenowthefeasiblegenesTherearealsomethodsavailableforaccessinginformationrelatedtoGOanditsstructure.First,wewanttoknowwhichGOtermsareavailableforanalysisandtoobtainallthegenesannotatedtoasubsetoftheseGOterms.ကgraph(GOdata)##returnstheGOgraphAgraphNELgraphwithdirectededgesNumberofNodes=5961NumberofEdges=13371ကug-usedGO(GOdata)ကhead(ug)[1]"GO:0000002""GO:0000003""GO:0000018""GO:0000027""GO:0000028""GO:0000038"Wefurtherselect10randomGOterms,countthenumberofannotatedgenesandobtaintheirannotation.ကsel.terms-sample(usedGO(GOdata),10)ကnum.ann.genes-countGenesInTerm(GOdata,sel.terms)##thenumberofannotatedgenesကnum.ann.genesကann.genes-genesInTerm(GOdata,sel.terms)##gettheannotationsကhead(ann.genes)Whenthesel.termsargumentismissingallGOtermsareused.Thescoresforallgenes,possiblyannotatedwithnamesofthegenes,canbeobtainedusingthemethodscoresInTerm().ကann.score-scoresInTerm(GOdata,sel.terms)ကhead(ann.score)ကann.score-scoresInTerm(GOdata,sel.terms,use.names=TRUE)ကhead(ann.score)Finally,somestatisticsforasetofGOtermsarereturnedbythemethodtermStat.Asmentionedpreviously,ifthesel.termsargumentismissingthenthestatisticsforallavailableGOtermsarereturned.ကtermStat(GOdata,sel.terms)AnnotatedSignificantExpectedGO:2000463700.61GO:0010878510.44GO:00069392442.09GO:00977201501.31GO:00007312632.26GO:19032061010.87GO:00097497176.18GO:00435253212.78GO:00514571821.57GO:0002026700.61 6RunningtheenrichmenttestsInthissectionweexplainhowwecanrunthedesiredenrichmentmethodoncethetopGOdataobjectisavailable.topGOpackagewasdesignedtoworkwithvariousteststatisticsandvariousalgorithmswhichtaketheGOdependenciesintoaccount.AtthebaseofthisdesignstandsaS4classmechanismwhichfacilitatesdeningandexecutinga(new)grouptest.Threetypesofenrichmenttestscanbedistinguishifwelookatthedatausedbytheeachtest.Testsbasedongenecounts.Thisisthemostpopularfamilyoftests,giventhatitonlyrequiresthepresenceofalistofinterestinggenesandnothingmore.TestslikeFisher'sexacttest,Hypegeometrictestandbinomialtestbelongtothisfamily.Draghicietal.(2006)Testsbasedongenescoresorgeneranks.ItincludesKolmogorov-Smirnovliketests(alsoknownasGSEA),Gentleman'sCategory,t-test,etc.AckermannandStrimmer(2009)Testsbasedongeneexpression.TestslikeGoeman'sglobaltestorGlobalAncovaseparatesfromtheotherssincetheyworkdirectlyontheexpressionmatrix.GoemanandBuhlmann(2007)Therearealsoanumberofstrategies/algorithmstoaccountfortheGOtopology,seeTable1,eachofthemhavingspecicrequirements.ForeachtesttypedescribedaboveandforeachalgorithmthereisS4classdenedinthepackage.Themainideaistohaveaclass(container)whichcanstore,foraspeciedgeneset(GOcategory),alldatanecessaryforcomputingthedesiredteststatistic,andamethodthatwilliterateoverallGOcategories.Insuchadesigntheuserneedstoinstantiateanobjectfromtheclasscorrespondingtothechosenmethod(teststatisticandalgorithm)andthenruntheiteratorfunctiononthisobject.ThedenedS4classesareorganisedinahierarchywhichisshowedinFigure4.Therearetwopossibilities(orinterfaces)forapplyingateststatistictoanobjectofclasstopGOdata.Thebasicinterface,whichprovidesthecoreofthetestingprocedureintopGO,oersmore exibilitytotheexperiencedRuserallowinghimtoimplementnewteststatisticsornewalgorithms.Thesecondinterfaceismoreuserfriendlybutatthesametimemorerestrictiveinthechoiceofthetestsandalgorithmsused.Wewillfurtherexplainhowthesetwointerfaceswork. Figure4:Theteststatisticsclassstructure. 6.1DeningandrunningthetestThemainfunctionforrunningtheGOenrichmentisgetSigGroups()andittakestwoarguments.TherstargumentisaninstanceofclasstopGOdataandthesecondargumentisaninstanceofclassgroupStatsoranyofitsdescendents.Tobetterunderstandthisprincipleconsiderthefollowingexample.Assumewedecidedtoapplytheclassicalgorithm.ThetwoclassesdenedforthisalgorithmareclassicCountandclassicScore.IfanobjectofthisclassisgivenasaargumenttogetSigGroups()thantheclassicalgorithmwillbeused.ThegetSig-Groups()functioncantakeawhile,dependingonthesizeofthegraph(theontologyused),sobepatient.ThegroupStatsclassesNextweshowhowaninstanceofthegroupStatsclasscanrepresentagenesetandhowtheteststatisticisperformed.Wecomputetheenrichmentofcellularlipidmetabolicprocess(GO:0044255)termusingFisher'sexacttest.Inordertodothisweneedtodenethegeneuniverse,toobtainthegenesannotatedtoGO:0044255,andtodenethesetofsignicantgenes.goID-"GO:0044255"ကgene.universe-genes(GOdata)ကgo.genes-genesInTerm(GOdata,goID)[[1]]ကsig.genes-sigGenes(GOdata)NowwecaninstantiateanobjectofclassclassicCount.Oncetheobjectisconstructedwecangetthe22contingencytableorapplytheteststatistic.ကmy.group-new("classicCount",testStatistic=GOFisherTest,name="fisher",+allMembers=gene.universe,groupMembers=go.genes,+sigMembers=sig.genes)ကcontTable(my.group)signotSiganno29237notAnno3103320ကrunTest(my.group)[1]0.1156745TheslottestStatisticcontainsthefunction(ortobemoreprecise,themethod)whichcomputestheteststatistic.WeusedtheGOFisherTestfunctionwhichisavailableinthetopGOpackageandasthenamestatesitimplementsFisher'sexacttest.Theusercandenehisownteststatisticfunctionandthenapplyitusingthepreferredalgorithm.Thefunction,however,shouldusethemethodsdenedforthegroupStatsclasstoaccessthedataencapsulatedinsuchanobject.(ForexampleafunctionwhichcomputestheZscorecanbeeasilyimplementedusingasanexampletheGOFisherTestmethod.)TherunTestmethodisdenedforthegroupStatsclassanditsusedtorun/computetheteststatistic,bycallingthetestStatisticfunction.ThevaluereturnedbytherunTestmethodinthiscaseisthevaluereturnedbyGOFisherTestmethod,whichistheFisher'sexacttestp-value.ThecontTablemethod,showedintheexampleabove,isonlydenedfortheclassesbasedongenecountsanditsusedtocompilethetwo-dimensionalcontingencytablebasedontheobject.ToshowhowthesameinterfaceisusedfortheclassesbasedongenecountswenextbuildaninstancefortheelimCountclass.Werandomlyselect25%oftheannotatedgenesasgenesthatshouldberemoved.ကset.seed(123)ကelim.genes-sample(go.genes,length(go.genes)/4) elim.group-new("elimCount",testStatistic=GOFisherTest,name="fisher",+allMembers=gene.universe,groupMembers=go.genes,+sigMembers=sig.genes,elim=elim.genes)ကcontTable(elim.group)signotSiganno24176notAnno3103320ကrunTest(elim.group)[1]0.0639882Weseethattheinterfaceaccountsforthegenesthatneedtobeeliminated,oncetheobjectisinstantiated.Thesamemechanismappliesfortheotherhierarchyofclasses(thescorebasedandexpressionbasedclasses),exceptthateachhierarchyhasitsownspecialisedmethodsforcomputingstatisticsfromthedata.PleasenotethatthegroupStatsclassoranydescendentclassdoesnotdependonGO,andanobjectofsuchaclasscanbeinstantiatedusinganygeneset.PerformingthetestAccordingtothemechanismdescribedabove,onerstdenesateststatisticforthechosenalgorithm,meaningthataninstanceofobjectspecicforthealgorithmisconstructedinwhichonlytheteststatisticmustbespecied,andthencallsagenericfunction(interface)torunthealgorithm.Accordingtothismechanism,onerstdenesateststatisticforthechosenalgorithm,inthiscaseclassicandthenrunsthealgorithm(seethesecondline).TheslottestStatisticcontainstheteststatisticfunction.IntheaboveexampleGOFisherTestfunctionwhichimplementsFisher'sexacttestandisavailableinthetopGOpackagewasused.Ausercandenehisownteststatisticfunctionandthenapplyitusingtheclassicalgorithm.(ForexampleafunctionwhichcomputestheZscorecanbeimplementedusingasanexampletheGOFisherTestmethod.)ကtest.stat-new("classicCount",testStatistic=GOFisherTest,name="Fishertest")ကresultFisher-getSigGroups(GOdata,test.stat)AshortsummaryontheusedtestandtheresultsisprintedattheRconsole.ကresultFisherDescription:GOanalysisofALLdata;B-cellvsT-cellObjectmodifiedon:30Oct2020Ontology:BP'classic'algorithmwiththe'Fishertest'test5961GOtermsscored:167termswithp0.01Annotationdata:Annotatedgenes:3896Significantgenes:339Min.no.ofgenesannotatedtoaGO:5Nontrivialnodes:4286TousetheKolmogorov-Smirnov(KS)testoneneedstoprovidethegene-wisescoresandthusweneedtoinstantiateanobjectofaclasswhichisabletodealofthescores.SuchaclassistheclassicScoreclass,seeFigure4whichwillletusruntheclassicalgorithm.-523;test.stat-new("classicScore",testStatistic=GOKSTest,name="KStests")ကresultKS-getSigGroups(GOdata,test.stat) Themechanismpresentedaboveforclassicalsoholdforelimandweight.Theusershouldpayattentiontothecompatibilitybetweenthechosenclassandthefunctionforcomputingtheteststatistic,sincenoincompatibilitytestaremadewhentheobjectisinstantiated.Forexampletheweightalgorithmwillnotworkwithclassesbasedongene-wisescores.ToruntheelimalgorithmwithKStestoneneedstotype:test.stat-new("elimScore",testStatistic=GOKSTest,name="Fishertest",cutOff=0.01)ကresultElim-getSigGroups(GOdata,test.stat)Similarly,fortheweightalgorithmwithFisher'sexacttestonetypes:ကtest.stat-new("weightCount",testStatistic=GOFisherTest,name="Fishertest",sigRatio="ratio")ကresultWeight-getSigGroups(GOdata,test.stat)6.2Theadjustmentofp-valuesThep-valuesreturnbythegetSigGroupsfunctionarerowp-values.Thereisnomultipletestingcorrectionappliedtothem,unlesstheteststatisticdirectlyincorporatesuchacorrection.Ofcourse,theresearchercanperformanadjustmentofthep-valuesifheconsidersitisimportantfortheanalysis.Thereasonfornotautomaticallycorrectingformultipletestingare:Inmanycasestherowp-valuesreturnbyanenrichmentanalysisarenotthatextreme.AFDR/FWERadjustmentprocedurecaninthiscaseproduceveryconservativep-valuesanddeclareno,orveryfew,termsassignicant.Thisisnotnecessaryabadthing,butitcanhappenthatthereareinterestingGOtermswhichdidn'tmakeitoverthecutobuttheyareomittedandthusvaluableinformationlost.InthiscasetheresearchermightbeinterestedintherankingoftheGOtermseventhoughnotoptermissignicantataspecifyFDRlevel.Oneshouldkeepinmindthatanenrichmentanalysisconsistofmanystepsandtherearemanyassumptionsdonebeforeapplying,forexample,Fisher'sexacttestonasetofGOterms.PerformingamultipletestingprocedureaccountingonlyonthenumberofGOtermsisfarfrombeingenoughtocontroltheerrorrate.ForthemethodsthataccountfortheGOtopologylikeelimandweight,theproblemofmultipletestingisevenmorecomplicated.Hereonecomputesthep-valueofaGOtermconditionedontheneighbouringterms.Thetestsarethereforenotindependentandthemultipletestingtheorydoesnotdirectlyapply.Weliketointerpretthep-valuesreturnedbythesemethodsascorrectedornotaectedbymultipletesting.6.3AddinganewtestExamplefortheCategorytest....6.4runTest:ahigh-levelinterfacefortestingOverthebasicinterfaceweimplementedanabstractlayertoprovidetheuserswithahigherlevelinterfaceforrunningtheenrichmenttests.Theinterfaceiscomposedbyafunction,namelytherunTestfunction,whichcanbeusedonlywithapredenedsetofteststatisticsandalgorithms.InfactrunTestisawarpingfunctionforthesetofcommandsusedfordeningandrunningatestpresentedinSection6.1.Therearethreemainargumentsthatthisfunctiontakes.TherstargumentisanobjectofclasstopGOdata.Thesecondargument,namedalgorithm,isoftypecharacterandspecieswhichmethodfordealingwiththeGOgraphstructurewillbeused.Thethirdargument,namedstatistic,specieswhichgroupteststatisticwillbeused.ToperformaclassicalenrichmentanalysisbyusingtheclassicmethodandFisher'sexacttest,theuserneedstotype: resultFis-runTest(GOdata,algorithm="classic",statistic="fisher")Variousalgorithmscanbeeasilycombinewithvariousteststatistics.Howevernotallthecombinationswillwork,asseeninTable1.Inthecaseofamismatchthefunctionwillthrowanerror.Thealgorithmargumentisoptionalandifnotspeciedtheweight01methodwillbeused.BellowwecanseemoreexamplesusingtherunTestfunction.ကweight01.fisher-runTest(GOdata,statistic="fisher")ကweight01.t-runTest(GOdata,algorithm="weight01",statistic="t")ကelim.ks-runTest(GOdata,algorithm="elim",statistic="ks")ကweight.ks-runTest(GOdata,algorithm="weight",statistic="ks")#willnotwork!!!ThelastlinewillreturnanerrorbecausewecannotusetheweightmethodwiththeKolmogorov-Smirnovtest.ThemethodsandthestatisticaltestswhichareaccessibleviatherunTestfunctionareavailableviathefollowingtwofunctions:ကwhichTests()[1]"fisher""ks""t""globaltest""sum""ks.ties"ကwhichAlgorithms()[1]"classic""elim""weight""weight01""lea""parentchild"ThereisnoadvantageofusingtherunTest()overgetSigGroups()exceptthatitismoreuserfriendlyanditgivescleanercode.However,iftheuserwantstodenehisownteststatisticorimplementanewalgorithmbasedontheavailablegroupStatsclasses,thenitwouldbenotpossibletousetherunTestfunction.Finally,thefunctioncanpassextraargumentstotheinitialisationmethodforangroupStatsobject.Thus,onecanspecifydierentcutosfortheelimmethod,orargumentsfortheweightmethod.7InterpretationandvisualizationofresultsThissectionpresenttheavailabletoolsforanalysingandinterpretingtheresultsoftheperformedtests.BothgetSigGroupsandrunTestfunctionsreturnanobjectoftypetopGOresult,andmostofthefollowingfunctionsworkwiththisobject.7.1ThetopGOresultobjectThestructureofthetopGOresultobjectisquitesimple.Itcontainsthep-valuesorthestatisticsreturnedbythetestandbasicinformationsontheusedteststatistic/algorithm.TheinformationstoredinthetopGOdataobjectisnotcarriedoverthisobject,andbothoftheseobjectswillbeneededbythediagnostictools.Sincetheteststatisticcanreturneitherap-valueorastatisticofthedata,wewillreferthemasscores!Toaccessthestoredp-values,theusershouldusethefunctionscore.Itreturnsanamednumericvector,werethenamesareGOidentiers.Forexample,wecanlookatthehistogramoftheresultsoftheFisher'sexacttestandtheclassicalgorithm.Bydefault,thescorefunctiondoesnotwarrantytheorderinwhichthep-valuesarereturned,aswecanseeifwecomparetheresultFisobjectwiththeresultWeightobject:ကhead(score(resultWeight))GO:1902894GO:0048812GO:1902895GO:0048813GO:0071870GO:00706691.00000000.86365940.28169631.00000000.82248521.0000000 pvalFis-score(resultFis)ကhead(pvalFis)GO:0000002GO:0000003GO:0000018GO:0000038GO:0000041GO:00000700.559646740.711209820.153506570.010701950.855953790.93563908ကhist(pvalFis,50,xlab="p-values") However,thescoremethodhasaparameter,whichGO,whichtakesalistofGOidentiersandreturnsthescoresforthesetermsinthespeciedorder.OnlythescoresforthetermsfoundintheintersectionbetweenthespeciedGOsandtheGOsstoredinthetopGOresultobjectarereturned.Toseehowthisworkletscomputethecorrelationbetweenthep-valuesoftheclassicandweightmethods:pvalWeight-score(resultWeight,whichGO=names(pvalFis))ကhead(pvalWeight)GO:0000002GO:0000003GO:0000018GO:0000038GO:0000041GO:00000700.559646740.987439471.000000000.010701950.950798091.00000000ကcor(pvalFis,pvalWeight)[1]0.5449325BasicinformationoninputdatacanbeaccessedusingthegeneDatafunction.Thenumberofannotatedgenes,thenumberofsignicantgenes(ifitisthecase),theminimalsizeofaGOcategoryaswellasthenumberofGOcategorieswhichhaveatleastonesignicantgeneannotatedarelisted:ကgeneData(resultWeight)AnnotatedSignificantNodeSizeSigTerms3896339542867.2SummarisingtheresultsWecanusetheGenTablefunctiontogenerateasummarytablewiththeresultsfromoneormoretestsappliedtothesametopGOdataobject.ThefunctioncantakeavariablenumberoftopGOresultobjectsanditreturnsadata.framecontainingthetoptopNodesGOtermsidentiedbythemethodspeciedwiththeorderByargument.Thisargumentallowstheuserdecidewhichp-valuesshouldbeusedfororderingtheGOterms. GO.IDTermAnnotatedSignicantExpectedRankinclassicclassicKSweight 1GO:0045061thymicTcellselection1281.0411.1e-060.000161.1e-062GO:0031580membraneraftdistribution650.52112.7e-050.000992.7e-053GO:0042102positiveregulationofTcellproliferat...45133.92207.3e-050.010507.3e-054GO:0042473outerearmorphogenesis750.61218.8e-050.003828.8e-055GO:0048935peripheralnervoussystemneurondevelop...750.61228.8e-050.002478.8e-056GO:0010863positiveregulationofphospholipaseCa...1571.31250.000120.011990.000127GO:2001187positiveregulationofCD8-positive,alp...850.70330.000220.005750.000228GO:0043651linoleicacidmetabolicprocess1261.04370.000240.008440.000249GO:0050854regulationofantigenreceptor-mediated...38113.31380.000260.002190.0002610GO:0036109alpha-linolenicacidmetabolicprocess540.44400.000260.002560.0002611GO:0042474middleearmorphogenesis950.78450.000450.017610.0004512GO:0045586regulationofgamma-deltaTcelldiere...950.78460.000450.013790.0004513GO:0034030ribonucleosidebisphosphatebiosynthetic...1871.57480.000490.000870.0004914GO:0034033purinenucleosidebisphosphatebiosynthe...1871.57490.000490.000870.0004915GO:0032760positiveregulationoftumornecrosisfa...2482.09520.000640.039270.0006416GO:0090270regulationofbroblastgrowthfactorp...640.52570.000730.016530.0007317GO:0090181regulationofcholesterolmetabolicproc...2071.74630.001020.003470.0010218GO:0006636unsaturatedfattyacidbiosyntheticproc...2482.09530.000640.019510.0010319GO:0048304positiveregulationofisotypeswitching...740.61770.001590.009640.0015920GO:0010453regulationofcellfatecommitment840.70910.002970.061050.00297 Table3:SignicanceofGOtermsaccordingtodierenttests.allRes-GenTable(GOdata,classic=resultFis,KS=resultKS,weight=resultWeight,+orderBy="weight",ranksOf="classic",topNodes=20)Pleasenotethatweneedtotypethefullnames(theexactname)ofthefunctionarguments:topNodes,rankOf,etc.Thisisthepricepaidfor exibilityofspecifyingdierentnumberoftopGOresultsobjects.ThetableincludesstatisticsontheGOtermsplusthep-valuesreturnedbytheotheralgorithms/teststatistics.Table3showstheinformationsincludedinthedata.frame.7.3AnalysingindividualGOsNextwewanttoanalysethedistributionofthegenesannotatedtoaGOtermofinterest.InanenrichmentanalysisoneexpectsthatthegenesannotatedtoasignicantlyenrichedGOtermhavehigherscoresthantheaveragegene'scoreofthegeneuniverse.OnewaytocheckthishypothesisistocomparethedistributionofthegenescoresannotatedtothespeciedGOtermwiththedistributionofthescoresofthecomplementarygeneset(allthegenesinthegeneuniversewhicharenotannotatedtotheGOterm).ThiscanbeeasilyachievedusingtheshowGroupDensityfunction.Forexample,letslookatthedistributionofthegenesannotatedtothemostsignicantGOtermw.r.t.theweightalgorithm.ကgoID-allRes[1,"GO.ID"]ကprint(showGroupDensity(GOdata,goID,ranks=TRUE)) Figure5:Distributionofthegene'rankfromGO:0045061,comparedwiththenulldistribution. WecanseeinFigure5thatthegenesannotatedtoGO:0045061havelowranks(geneswithlowp-valueofthet-test).Thedistributionoftheranksisskewedontheleftsidecomparedwiththereferencedistributiongivenbythecomplementarygeneset.Thisisaniceexampleinwhichthereisasignicantdierenceinthedistributionofscoresbetweenthegenesetandthecomplementaryset,andweseefromTable3thatthisGOisfoundassignicantlyenrichedbyallmethodsused.Intheaboveexample,thegeneswithap-valueequalto1wereomitted.TheycanbeincludedusingthevalueFALSEfortherm.oneargumentintheshowGroupDensityfunction.AnotherusefulfunctionforanalysingtermsofinterestisprintGenes.Thefunctionwillgenerateatablewithallthegenes/probesannotatedtothespeciedGOterm.Varioustypeofidentiers,thegenenameandthep-values/statisticsareprovidedinthetable.goID-allRes[10,"GO.ID"]ကgt-printGenes(GOdata,whichTerms=goID,chip=affyLib,numChar=40) ChipIDLL.idSymbol.idGenenamerawp-value 33821 at33821 at60481ELOVL5ELOVLfattyacidelongase53.53e-2139372 at39372 atNANA2.99e-1435117 at35117 at60481ELOVL5ELOVLfattyacidelongase51.45e-0932190 at32190 at9415FADS2fattyaciddesaturase27.68e-0740415 at40415 at30ACAA1acetyl-CoAacyltransferase11 Table4:GenesannotatedtoGO:0036109.Thedata.framecontainingthegenesannotatedtoGO:0036109isshowninTable4.OneormoreGOidentierscanbegiventothefunctionusingthewhichTermsargument.WhenmorethanoneGOisspecied,thefunctionreturnsalistofdata.frames,otherwiseonlyonedata.frameisreturned.Thefunctionhasaargumentlewhich,whenspecied,willsavetheresultsintoaleusingtheCSVformat.ForthemomentthefunctionwillworkonlywhenthechipusedhasanannotationpackageavailableinBioconductor.Itwillnotworkwithothertypeofcustomannotations.7.4VisualisingtheGOstructureAninsightfulwayoflookingattheresultsoftheanalysisistoinvestigatehowthesignicantGOtermsaredistributedovertheGOgraph.WeplotthesubgraphsinducedbythemostsignicantGOtermsreportedbyclassicandweightmethods.Therearetwofunctionsavailable.TheshowSigOfNodeswillplottheinducedsubgraphtothecurrentgraphicdevice.TheprintGraphisawarpingfunctionofshowSigOfNodesandwillsavetheresultinggraphintoaPDForPSle.showSigOfNodes(GOdata,score(resultFis),firstSigNodes=5,useInfo='all')showSigOfNodes(GOdata,score(resultWeight),firstSigNodes=5,useInfo='def')printGraph(GOdata,resultFis,firstSigNodes=5,fn.prefix="tGO",useInfo="all",pdfSW=TRUE)printGraph(GOdata,resultWeight,firstSigNodes=5,fn.prefix="tGO",useInfo="def",pdfSW=TRUE)Intheplots,thesignicantnodesarerepresentedasrectangles.Theplottedgraphistheupperinducedgraphgeneratedbythesesignicantnodes.ThesegraphplotsareusedtoseehowthesignicantGOtermsaredistributedinthehierarchy.ItisaveryusefultooltorealizebehaviourofvariousenrichmentmethodsandtobetterunderstandwhichofthesignicantGOtermsarereallyofinterest.WecanemphasisedierencesbetweentwomethodsusingtheprintGraphfunction:printGraph(GOdata,resultWeight,firstSigNodes=10,resultFis,fn.prefix="tGO",useInfo="def")printGraph(GOdata,resultElim,firstSigNodes=15,resultFis,fn.prefix="tGO",useInfo="all") Figure6:Thesubgraphinducedbythetop5GOtermsidentiedbytheclassicalgorithmforscoringGOtermsforenrichment.Boxesindicatethe5mostsignicantterms.Boxcolorrepresentstherelativesignicance,rangingfromdarkred(mostsignicant)tolightyellow(leastsignicant).Blackarrowsindicateis-arelationshipsandredarrowspart-ofrelationships. Figure7:Thesubgraphinducedbythetop5GOtermsidentiedbytheweightalgorithmforscoringGOtermsforenrichment.Boxesindicatethe5mostsignicantterms.Boxcolorrepresentstherelativesignicance,rangingfromdarkred(mostsignicant)tolightyellow(leastsignicant).Blackarrowsindicateis-arelationshipsandredarrowspart-ofrelationships. 8SessionInformationTheversionnumberofRandpackagesloadedforgeneratingthevignettewere:Rversion4.0.3(2020-10-10),x86_64-pc-linux-gnuLocale:LC_CTYPE=en_US.UTF-8,LC_NUMERIC=C,LC_TIME=en_US.UTF-8,LC_COLLATE=C,LC_MONETARY=en_US.UTF-8,LC_MESSAGES=en_US.UTF-8,LC_PAPER=en_US.UTF-8,LC_NAME=C,LC_ADDRESS=C,LC_TELEPHONE=C,LC_MEASUREMENT=en_US.UTF-8,LC_IDENTIFICATION=CRunningunder:Ubuntu18.04.5LTSMatrixproducts:defaultBLAS:/home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.soLAPACK:/home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.soBasepackages:base,datasets,grDevices,graphics,grid,methods,parallel,stats,stats4,utilsOtherpackages:ALL1.32.0,AnnotationDbi1.52.0,Biobase2.50.0,BiocGenerics0.36.0,GO.db3.12.1,IRanges2.24.0,Rgraphviz2.34.0,S4Vectors0.28.0,SparseM1.78,genelter1.72.0,graph1.68.0,hgu95av2.db3.2.3,lattice0.20-41,multtest2.46.0,org.Hs.eg.db3.12.0,topGO2.42.0,xtable1.8-4Loadedviaanamespace(andnotattached):DBI1.1.0,MASS7.3-53,Matrix1.2-18,R62.5.0,RSQLite2.2.1,Rcpp1.0.5,XML3.99-0.5,annotate1.68.0,bit4.0.4,bit644.0.5,blob1.2.1,compiler4.0.3,digest0.6.27,httr1.4.2,matrixStats0.57.0,memoise1.1.0,pkgcong2.0.3,rlang0.4.8,splines4.0.3,survival3.2-7,tools4.0.3,vctrs0.3.4ReferencesAckermann,M.andStrimmer,K.(2009).Ageneralmodularframeworkforgenesetenrichmentanalysis.BMCBioinformatics,10:47.10.1186/1471-2105-10-47.Alexa,A.,Rahnenfuhrer,J.,andLengauer,T.(2006).Improvedscoringoffunctionalgroupsfromgeneexpressiondatabydecorrelatinggographstructure.Bioinformatics(Oxford,England),22:1600{1607.10.1093/bioinformatics/btl140.Chiaretti,S.,etal.(2004).GeneexpressionproleofadultT-cellacutelymphocyticleukemiaidentiesdistinctsubsetsofpatientswithdierentresponsetotherapyandsurvival.Blood,103(7):2771{2778.Consortium,G.O.(2001).CreatingtheGeneOntologyResource:DesignandImplementation.GenomeResearch,11:1425{1433.ColdSpringHarborLaboratoryPress.Draghici,S.,Sellamuthu,S.,andKhatri,P.(2006).Babel'stowerrevisited:auniversalresourceforcross-referencingacrossannotationdatabases.Bioinformatics(Oxford,England),22:btl372v1{2939.10.1093/bioinformatics/btl372.Goeman,J.J.andBuhlmann,P.(2007).Analyzinggeneexpressiondataintermsofgenesets:methodologicalissues.Bioinformatics(Oxford,England),23:980{987.10.1093/bioinformatics/btm051.Grossmann,S.,Bauer,S.,Robinson,P.N.,andVingron,M.(2007).Improveddetectionofoverrepresentationofgene-ontologyannotationswithparentchildanalysis.Bioinformatics,23:3024.10.1093/bioinformat-ics/btm440.YonRhee,S.,Wood,V.,Dolinski,K.,andDraghici,S.(2008).Useandmisuseofthegeneontologyannotations.Naturereviews.Genetics,advancedonlinepublication:509{515.10.1038/nrg2363.