Fig1AOverviewofPEWOinputsandoutputsBAnexampleofplotsdynamicallygeneratedbythePACPruningbasedAccuracyEvaluationprocedureona16SrRNAbacterialreferenceMeasuredMeanexpectedNodeDistanceseNDarereportedlowe ID: 860379
Download Pdf The PPT/PDF document "PEWOacollectionofwork8owstobenchmarkphyl..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
1 PEWO:acollectionofworkowstobenchmark
PEWO:acollectionofworkowstobenchmarkphylogeneticplacementBenjaminLinard1,2,,NikolaiRomashchenko1,FabioPardi1andEricRivals1,31LIRMM,UniversityofMontpellier,CNRS,Montpellier,France2SPYGEN,17RueduLacSaint-André,73370LeBourget-du-Lac,France3InstitutFrançaisdeBioinformatique,CNRSUMS3601,Évry,France.Towhomcorrespondenceshouldbeaddressed.AbstractMotivation:Phylogeneticplacement(PP)isaprocessoftaxonomicidenticationforwhichseveraltoolsarenowavailable.However,itremainsdifculttoassesswhichtoolismoreadaptedtoparticulargenomicdataoraparticularreferencetaxonomy.WedevelopedPEWO,therstbenchmarkingtooldedicatedtoPPassessment.ItsautomatedworkowscanevaluatePPatmanylevels,fromparameteroptimisationforaparticulartool,totheselectionofthemostappropriategeneticmarkerwhenPP-basedspeciesidenticationsaretargeted.OurgoalisthatPEWOwillbecomeacommunityeffortandastandardsupportredforfuturedevelopmentsandapplicationsofPP.Availability:https://github.com/ph
2 ylo42/PEWOContact:benjamin.linard@lirmm.
ylo42/PEWOContact:benjamin.linard@lirmm.fr;rivals@lirmm.frSupplementary:Supplementarydataisavailableatpage4.1IntroductionWhenareferencephylogenyisavailable,taxonomicidenticationofbiologicalsequencescanbeachievedwithphylogeneticplacement(PP).PPprovidesthemostinformativetypeofclassicationbecauseeachquerysequenceisassignedtoitsputativeorigininthetree.PPcanbeappliedinmanycontexts,includingcommunityecology,speciesdiversity,ormedicalstudies.SeveralPPtoolsweredevelopedforthesepurposes(Matsenetal.,2010;Bergeretal.,2011;Mirarabetal.,2012;Zhengetal.,2018),withfourrecenttoolscapableofprocessinglargersequencevolumes(Barberaetal.,2018;Linardetal.,2019;CzechandStamatakis,2019;Balabanetal.,2020).Inthepreliminaryphaseofexperimentaldesign,assessingwhichtoolsanswertheneedsofagivenapplicationremainsatedioustaskofteninvolvingmanualtests(Manguletal.,2019).Strikingly,PPhasabroadrangeofapplications,butlacksuserguidelinesandbenchmarking.SomeprocedurestoevaluatePPaccuracywereprop
3 osed(Matsenetal.,2010),butneverautomated
osed(Matsenetal.,2010),butneverautomatedviaadedicatedsoftware.Benchmarkingisessentialtodeterminewhichtoolsuitsbetteragivenmetagenomictaskoraspecicdataset(Sczyrbaetal.,2017).Tollthisgap,wedevelopedPEWO(PlacementEvaluationWOrkows),thersttooldedicatedtoPPbenchmarking.PEWOautomatizesevaluationprocedures(whichwerenotimplementedforthecommunity),andintroducesnovelprocedures.Beyondbenchmarking,PEWOcanhelpdecision-makinginanymetagenomicormetabarcodingprojectforPP-basedtaxonomicidentication.Withapplicationsrangingfromparameteroptimizationonparticulargenomicdata,totheselectionofthemostappropriategeneticmarker,PEWOprovidestheusercommunitywithstandardizedworkowsforeasyandreproducibleassessmentofPPanalyses.2OverviewPEWOimplementsevaluationworkowsinPythonandSnakemake(KösterandRahmann,2012),whoseframeworkensuresexibility,platformindependence,andreproducibility.Eachworkowautomaticallyperformsmultiplestepsfromquerygenerationuptosummaryplots/t
4 ables,andcanbetailoredviaSnakemakecon
ables,andcanbetailoredviaSnakemakecongurationles.PEWOanditsdependenciesareeasilyinstalledviaacondavirtualenvironment.Currently,PEWOincorporatesvestate-of-the-artPPtools,whichcoveramajorityofPPuses:EPA(RAxML),PPlacer,EPA-ng,RAPPASandAPPLES.Fourarealignment-basedtools,whileRAPPASisalignment-free.Asinput,eachworkowtakesaphylogenetictreeandthereferencemultiplesequencealignmentfromwhichitwasbuilt(Figure1).Optionally,theusercanprovideasetofquerysequences.Belowwedescribetheworkowsandsomeoftheirapplications.2.1PEWOproceduresPruning-basedaccuracyevaluation(PAC):inthisstandardprocedureforassessingplacementaccuracy(Matsenetal.,2010;Bergeretal.,2011),asubsetofsequencesisrandomlyprunedfromthereferencephylogenyandalignment.Eachprunedsequencethenservestogeneratequeriesforplacement,andtheaccuracyofeachtoolismeasuredinnumberofnodesseparatingpredictedfromtrueplacement.PEWOofferstwoversionsofthistopologicalmetric:NodeDistance(ND)andexpectedNodeDistance(eND
5 ).TheeNDaccountsforplacementuncertainty(
).TheeNDaccountsforplacementuncertainty(e.g.likelihoodweightratios).Allselectedtoolsarecomparedforauser-selectedcombinationofparameters.Likelihood-basedaccuracyevaluation(LAC)isanew,fasterevaluationprocedureintroducedinPEWOtoassessrelativeaccuracyofPP.Ititeratesthefollowingprocessforasetofqueries:placethequery,extendthephylogenytoincludethatquery,optimizethebranchlengthsofthisextendedtree,andreturnitslog-likelihood(LL).TheusercanthencomparetheLLvaluesobtainedwithdifferenttools,ordifferentsettingsofasametool(e.g.byinspectingthedistributionof2 Fig.1.A.OverviewofPEWOinputsandoutputs.B.Anexampleofplotsdynamically-generatedbythePAC(Pruning-basedAccuracyEvaluation)procedureona16SrRNAbacterialreference.MeasuredMeanexpectedNodeDistances(eND)arereported(lowervalue=betteraccuracy).PanelsreportselectedconditionsforPPlacerandRAPPAS,e.g.differentparametervaluestestedindifferentrowsandcolumns.ForPPlacer,varyingparametersarems(max-strikes,Xaxis)andsb(strike-box,Yaxis).Parame
6 termp(max-pitches,greybox)isxed.For
termp(max-pitches,greybox)isxed.ForRAPPAS,varyingparametersarek(phylo-kmersize)ando(omegathreshold).Parametersred(alignmentreduction)andar(softwareusedforancestralreconstruction)arexed.C.FourPACprocedureswererunfordifferentColeopteranmitogenomeloci(rows)andcompiled.AverageexpectedNodeDistance(eND)ismeasuredforthreetools(columns)usingdefaultparameters.Foreachlocus,thelowestaverageeNDishighlightedinbold.ForRAPPAS,thelastcolumnshowsthataccuracycanbeimprovedwhenincreasingk-mersize(defaultisk=8).ExamplesB.andC.aremoreextensivelydiscussedinSupplementaryMaterials.thedifferencesbetweenLLvaluesobtainedwithtwodifferenttools).SeetheSupplementaryMaterialsforamoredetaileddescription.Resourceevaluation(RES):outputstheruntimeandmemoryusageofselectedtools,withdetailsforeachplacementstep(e.g.,prolealignment,databaseconstruction,placement...).Onecancomparetheimpactontimeandmemoryfortool-specicparametercombinations,whilesearchingforanappropriateaccuracy/resou
7 rcetrade-off,orevaluatethetools'scalabil
rcetrade-off,orevaluatethetools'scalabilitywithrespecttoinputsize.2.2ApplicationsPEWOprocedurescovernumeroususecasesarisingwithPP,asillustratedbysixexemplarapplicationsprovidedonGitHub(twoarereportedinFigure1B-C).AsnewPPtoolscanbeincorporatedinPEWO,PEWOproceduresenablecomparingexistingandfuturetoolsonresourceusage,scalability,oraccuracyinareproducibleway.WithPEWO,userscanoptimizetheirPPpipelinedesign.Forinstance,foragivenreference(treeandalignment),determinewhichtoolandparametercombinationwillmaximizeplacementaccuracy,andatwhichcomputationalcost.PEWOfacilitatessuchtests,asinFigure1-B,whichshowstwoplotsautomaticallygeneratedbythePACprocedurerunningPPlacerandRAPPASfor9and6parametercombinations,respectively.Asasecondexample,weshowhowPEWOcanbeusedtocomparedifferentgeneticmarkersavailableforthesametaxa,asthechoiceofthemarkermayimpacttheaccuracyofplacement.Forexample,weevaluatedtheplacementsforfourloci(16S,12S,cox1,cyt)ontheirassociatedphylogenyfor900Coleopteranmitochond
8 rialgenomes(Linardetal.,2018).Figure1-Cd
rialgenomes(Linardetal.,2018).Figure1-Cdisplaystheresults(reproducibleviaGitHubexample4)highlightingthat:i)12Syieldsthemostaccurateplacements,despitebeingthesecondshortestlocus,ii)thetoolachievingthebestaccuracydependsonthemarker,andiii)withRAPPAS,alongerk-mersizeisrequiredtoobtainaccuracysimilarorbetterthanalignment-basedmethods.2.3AvailabilityandimplementationPEWO,withfulldocumentationandexampleworkows,isfreelyavailablefromitsrepositoryURL:https://github.com/phylo42/PEWO.Itsmodular,well-documented,andevolvablesourcecodeenablesthecommunitytoeasilyextenditbyaddingnewtools,procedures,ormetrics.Notably,userscandeveloptheirownevaluationproceduresstartingfromPEWOSnakemakerulesastemplatesfortheirownworkows.AnyPPtoolcanbeintegratedaslongasitoutputsresultsinjplaceformat(ajsonspecication,standardinPP,see(Matsenetal.,2012)),canbeparameterizedviathecommandline,andisavailableonacondaorpiprepository(seethedocumentationforguidelines).3ConclusionReproducibilityofcom
9 putationalanalysesinlifesciencesisacruci
putationalanalysesinlifesciencesisacrucialissue,evenmorewhenlargescaledatacomesintoplay,asinthecaseofmetagenomics.WithPEWO,weprovidearesourcethatfacilitatestheevaluationandcomparisonofPPtoolsunderauniedframework.Italliesexibility,extensibility,witheaseofuse,whileitinheritsastandardizedinstallationprocedurefromthecondaframework.ThesetofworkowsinPEWOaimstogrowasacommunityeffort,andextensionsarewelcome.InPEWO,weintroducealikelihood-basedaccuracyevaluationprocedure,whichiscomplementarytoexistingprocedures(Matsenetal.,2010).PEWOwillhelpthecommunityinitseffortstodevelopfuturePPtoolsandwillfacilitateexperimentaldecisionswhenPPischosenasameanstospeciesidentication.Withthehelpoffuturecontributors,wehopethatPEWOwillevolveasastandardforPPbenchmarking,andanswerforthcomingunforeseenyetauspiciousapplications.AcknowledgementsWethankVincentLefortfortechnicalassistance,theATGCbioinformaticplatform,theInstitutFrançaisdeBioinformatique[ANR-11-INBS-0013].FundingThis
10 workhasbeensupportedbyFranceGénomique[A
workhasbeensupportedbyFranceGénomique[ANR-10-INBS-0009],MNERTfellowshiptoNR.ReferencesBalaban,M.etal.(2020).Apples:scalabledistance-basedphylogeneticplacementwithorwithoutalignments.SystematicBiology,69(3),566578.Barbera,P.etal.(2018).EPA-ng:MassivelyParallelEvolutionaryPlacementofGeneticSequences.SystematicBiology,68(2),365369. Berger,S.A.etal.(2011).Performance,accuracy,andwebserverforevolutionaryplacementofshortsequencereadsundermaximumlikelihood.SystematicBiology,60(3),291302.Czech,L.andStamatakis,A.(2019).Scalablemethodsforanalyzingandvisualizingphylogeneticplacementofmetagenomicsamples.PLOSONE,14(5),e0217050.Köster,J.andRahmann,S.(2012).Snakemakeascalablebioinformaticsworkowengine.Bioinformatics,28(19),25202522.Linard,B.etal.(2018).Thecontributionofmitochondrialmetagenomicstolarge-scaledataminingandphylogeneticanalysisofcoleoptera.MolecularPhylogeneticsandEvolution,128,111.Linard,B.etal.(2019).Rapidalignment-freephylo
11 geneticidenticationofmetagenomicseq
geneticidenticationofmetagenomicsequences.Bioinformatics,35(18),33033312.Mangul,S.etal.(2019).Systematicbenchmarkingofomicscomputationaltools.NatureCommunications,10(1).Matsen,F.A.etal.(2010).pplacer:Lineartimemaximum-likelihoodandbayesianphylogeneticplacementofsequencesontoaxedreferencetree.BMCBioinformatics,11(1),538.Matsen,F.A.etal.(2012).Aformatforphylogeneticplacements.PLoSONE,7(2),e31009.Mirarab,S.etal.(2012).SEPP:sate-enabledphylogeneticplacement.InR.B.Altman,A.K.Dunker,L.Hunter,T.Murray,andT.E.Klein,editors,Biocomputing2012:ProceedingsofthePacicSymposium,KohalaCoast,Hawaii,USA,January3-7,2012,pages247258.WorldScienticPublishing.Sczyrba,A.etal.(2017).Criticalassessmentofmetagenomeinterpretation-abenchmarkofmetagenomicssoftware.NatureMethods,14(11),10631071.Zheng,Q.etal.(2018).HmmUFOtu:AnHmmandPhylogeneticPlacementBasedUltra-FastTaxonomicAssignmentandOtuPickingToolforMicrobiomeAmpliconSequencingStudies.GenomeBiology,19(1)