/
Probabilistic Reconstruction of Ancestral Gene Orders with Insertions and Deletions Fei Probabilistic Reconstruction of Ancestral Gene Orders with Insertions and Deletions Fei

Probabilistic Reconstruction of Ancestral Gene Orders with Insertions and Deletions Fei - PDF document

celsa-spraggs
celsa-spraggs . @celsa-spraggs
Follow
606 views
Uploaded On 2015-02-01

Probabilistic Reconstruction of Ancestral Gene Orders with Insertions and Deletions Fei - PPT Presentation

Inferring the gene order of an extinct species has a wide range of applications including the potential to reve al more detailed evolutionary histories to determine gene co ntent and ordering and to understand the consequences of st ructural changes ID: 35585

Inferring the gene order

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Probabilistic Reconstruction of Ancestra..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1ProbabilisticReconstructionofAncestralGeneOrderswithInsertionsandDeletionsFeiHu,JunZhou,LingxiZhouandJijunTangAbstract—Changesofgeneorderingshavebeenextensivelyusedasasignaltoreconstructphylogeniesandancestralgenomes.Inferringthegeneorderofanextinctspecieshasawiderangeofapplications,includingthepotentialtorevealmoredetailedevolutionaryhistories,todeterminegenecontentandordering,andtounderstandtheconsequencesofstructuralchangesfororganismalfunctionandspeciesdivergence.Inthisstudy,weproposeanewadjacency-basedmethod,PMAG,toinferancestralgenomesunderamoregeneralmodelofgeneevolutioninvolvinggeneinsertionsanddeletions(indels),inadditiontogenerearrangements.PMAGimprovesonourpreviousmethodPMAGbydevelopinganewapproachtoinferancestralgenecontentsandreducingtheadjacencyassemblyproblemtoaninstanceofTSP.WedesignedaseriesofexperimentstoextensivelyvalidatePMAGandcomparedtheresultswiththemostrecentandcomparablemethodGapAdj.Accordingtotheresults,ancestralgenecontentspredictedbyPMAGcoincideshighlywiththeactualcontentswitherrorrateslessthan1%.Undervariousdegreesofindels,PMAGconsistentlyachievesmoreaccuratepredictionofancestralgeneordersandatthesametime,producescontigsveryclosetotheactualchromosomes.IndexTerms—AncestralGenome,GeneOrder,GenomeRearrangement,GeneInsertion,GeneDeletion F 1INTRODUCTIONGeneorderdatahasbeenprovedtobeveryusefulinphylogeneticreconstruction,butdeterminingtheancestralordersandorientationsofgenesisstillfarfromsolved.Inrecentyears,reconstructionthehy-potheticalgeneordersofancestorswithorwithoutbeinggiventhespeciationhistoryhavebothbeenstudied.Ifthespeciationhistoryisgiven(intheformofabinarytree),theproblemofndingancestorsatnon-leafnodesisdenedasthesmallphylogenyproblem(SPP);ontheotherhand,startingfromasetofrelatedspecies,thebigphylogenyproblem(BPP)searchesforthephylogenytreealongwithalltheancestorsinthetree.CurrentmethodstosolveSPPareeitherevent-basedoradjacency-based.Event-basedmethodsseekforasetofassignmentsofgeneorderstoeachancestorsuchthatthenumberofevolutionaryeventsisminimized.Thesemethodsareveryexpensive,andmaynotbeabletondasolutionevenaftermonthsofcomputation.Toovercomethisproblem,severaladjacency-basedmethodswerepro-posed,whichcomputethescoreorprobabilityofeachgeneadjacencyandassembleindividualadjacenciesintoavalidpermutationofgeneorderbasedontheirscoresorprobabilities. FeiHuandJijunTangareafliatedwiththeTianjinKeyLaboratoryofCognitiveComputingandApplicationattheTianjinUniversityofChina,andtheDepartmentofComputerScienceandEngineeringattheUniversityofSouthCarolina.E-mail:jtang@cse.sc.eduJunZhouandLingxiZhouarePh.D.StudentsintheDepartmentofComputerScienceandEngineeringattheUniversityofSouthCarolina.Currentlymostmethodsarerestrictedtohandledatasetsinvolvingonlyrearrangements.Undersuchmodel,speciescanonlyhaveequalgenecontentsuchthateachgenehasexactlyonecopyineveryspecies.ThereforeinthisstudyweproposePMAG+asanextensiontoourpreviousmethodPMAGinordertoefcientlyhandledatasetsunderwentalargescaleofrearrangements,aswellasgenedeletionsandinser-tions(indels)ofasingleorasegmentsofgenes.OurexperimentalresultsonsimulateddatasetssuggestthatPMAG+canefcientlyandaccuratelypredictbothancestralgenecontentsandancestralgeneorders.2EVOLUTIONOFGENEORDERSGivenasetofngeneslabeledasf1;2;;ng,agenomecanberepresentedbyanorderingofthesegenes.Eachgeneisassignedwithanorientationthatiseitherpositive,writteni,ornegative,writteni.Twogenesiandjformanadjacency(i;j)ifiisimmediatelyfollowedbyj,or,equivalently,jisimmediatelyfollowedbyi.Ifgenekliesatoneendofalinearchromosome,weletkbeadjacenttoanextremityetomarkthebeginningorendingofthechromosome,writtenas(e;k)or(k;e),andcalledtelomere.Genomerearrangementoperationschangetheor-deringofgenesonchromosomes.Aninversionop-eration(alsocalledreversal)reversesasegmentofachromosome.Atranspositionisanoperationthatswapstwosegmentsofachromosome.Incaseofmul-tiplechromosomes,translocationbreaksachromosomeandreattachesaparttoanotherchromosome,whilefusionjoinstwochromosomesandssionsplitone 2chromosomeintotwo.Yancopoulosetal.[1]proposedauniversaldouble-cut-and-join(DCJ)operationthataccountsforallcommonevents.Thereareanothersetofoperationswhichcanalterthegenecontentinagenome.Adeletion(alsocalledloss)deletesasingleorasegmentofgenesfromthegenome.Itsreverseoperationcalledinsertionintroducesoneorasegmentofgenesthathavenotseenbeforeintoachromosomeatatime.Wholegenomeduplication(WGD)createsanadditionalcopyoftheentiregenomeofaspecies.3METHODSFOROLVINGTHEMALLHYLOGENYROBLEM(SPP)Inthecontextofevent-basedmethods,tondasolu-tionforSPP,itistypicaltoiterateovereachinternalnodetosolveforthemediangenomesuntilthesumofalledgedistances(treescore)isminimized.Themedianproblemcanbeformalizedasfollows:giveasetofmgenomeswithpermutationsfxg1mandadistancemeasurementd,ndanotherpermutationxsuchthatthemedianscoredenedasPm=1d(x;x)isminimized.GRAPPA[2]andMGR[3](aswellastheirrecentlyenhancedversions)aretwowidely-referencedmethodsthatimplementaselectionofmediansolversforphylogenyandancestralgene-orderinference.HoweversolvingeventhesimplestcaseofmedianproblemwhenmequalstothreeisNP-hardformostdistancemeasurements.Progresshasbeenmadeinhandlinggenomeswithunequalgenecontent.TangandMoretproposedatwo-phasemethod[4]inwhichthebestgenecontentfortheme-dianiscomputedandthenabranch-and-boundap-proachisusedtodeterminethebestorderingofthesegenecontents.Zhangetal.laterextendedCaprara'sinversionmediansolver[5]andproposedasimpliedDCJ-baseddistancecomputationforunichromosomalgenomeswithindels.Therstadjacency-basedmethodinprobabilisticframeworkwasintroducedinInferCarsPro[6].ThekeyofthismethodistoestimatetheposteriorprobabilityofobservinganadjacencyintheancestorbasedonanextendedJukes-Cantormodelforbreak-points.Withtheobtainedadjacencyprobabilities,itthenusesagreedyheuristictondavalidgeneorderforeachancestor.LaterHuetal.proposedafasterandmoreaccuratemethodPMAG[7].AlthoughPMAGalsoseekstocomputetheprobabilitiesforadjacen-ciesandusesthesamegreedyheuristictoassemblegeneorders,itavoidstheanalysisofpredecessorandsuccessorrelationships,anddirectlycalculatestheprobabilitiesforonlyasubsetofadjacenciesappearedinleafnodes.Howeverbothmethodsareunabletohandledatasetswithindelsandthegreedyheuristicoftenreturnsanexcessivenumberofcontigs(frag-mentsofchromosomes)whensomeadjacenciesmayhaveequallyhighprobabilitiesbutconicteachother.Inthepastfewyears,severalmethodshadbeenproposedtoaccommodatedatasetswithunequalgenecontent[8],[9],[10].Amongthem,themostrecentmethodGapAdj[10]usesanotherscoringmechanismforgeneadjacenciesandreducestheassemblyprob-lemtoaninstanceofTSP.Tolteroutlessreliableadjacencies,itintroducedacutoffvaluetoremoveadjacencieswithscoresbelowitintheTSPsolution.Furtherbyconsideringpairofgenesseparatedbyuptoagivennumberofgenesasdirectgeneadjacency,contigsareiterativelycombinedintolongerones.ComparedtoInferCars[11],GapAdjproducesamorecorrelatednumberofcontigstotheactualnum-berofchromosomesatthecostofaccuracy.Throughanaturalprocessfortheinferenceofancestralgenecontentsdescribedin[12],GapAdjalsosupportstheanalysisofunequalgenecontents.4ALGORITHMDETAILSGivenaphylogeny,ournewmethodcomputesthegenecontentandorderingofancestral(internal)nodesoneatatime.Priortotheinferenceofatargetancestralnode,wererootthegivenphylogenytreetothenodesuchthatitbecomestherootofthenewtree.Theunderlyingrationaleisthatthecalculationofprobabilitiesfollowsabottom-upmannerandonlythespeciesinthesubtreeofthetargetnodeareconsidered,thereforererootingcanpreventlossofinformation.Asastandardprocedure,rerootinghasalreadyfounduseforancestralgenomereconstruc-tion[6],[7].Afterrerooting,PMAG+proceedsthefollowingthreesteps:1)inferringthegenecontentoftargetnodetodeterminewhichgenesshouldappear;2)computingtheprobabilitiesofgeneadjacencies;3)formingandsolvingaTSPproblemtoplacegenesonchromo-somes.Thefollowingsubsectionsdescribethesestepsindetail.4.1InferenceofAncestralGeneContentsTheveryrststepofancestralreconstructionofteninvolvesexplicitlyestimatinggenecontentinances-tralnodes,usingcontentinformationfromleaves.AnumberofapproacheshavebeendevelopedandmostofthemaresimilarinspirittotheFitch-Hartiganparsimonyalgorithm[4],[12],[13].Forpurerearrangements,everygeneobservedinleafspeciesshouldalsobepresentinallancestors;howeverinthepresenceofgeneindels,suchcorre-spondencedoesnotholdanymoreandagenecanbeeitherpresentorabsentinanancestor.Thereforeourinferenceofancestralcontentsreliesonviewinggenesasindependentcharacters(withbinarystates);wecanthendeterminethestateforeverygeneintheancestor.Therststepinvolvesencodingthegenecontentsofleafspeciesintobinarysequences.Inparticular,supposeadatasetGwithNspeciesisgivenandasetofndistinctgenesS=fg1;g2;:::;gngisidentied 3fromG.ForeachleafspeciesG,itsgenecontentS=fg1;:::;gkgwithkncanbeequivalentlyrepresentedbyasequence=f1;2;:::;nginwhicheachelementhastwostates;ifgj2S,j=1,otherwisej=0forallj(1jn).Forinstance(table1),atotalofvedistinctgenesfa;b;c;d;egcanbeidentiedfromtwotoyspeciesG1andG2withgeneorders(+a;c;+d)and(+b;+a;e)re-spectively.Manymethodsareavailabletoinferancestralstatesfrombinarycharacters,includingRAxML[14]formax-imumlikelihoodandPAUP[15].Inthisstudy,wechoseRAxML(version7.2.8wasusedtoproducetheresultsgiveninthispaper)toconducttheinferenceofstates.Oncetheprobabilitiesofpresencestate,P=fp1;p2;:::;png,fortherootnodearecomputed,thegeneibelongstothegenecontentofrootSrootifp0:5,otherwise,geneiisnotinSroot.Followingthisparadigm,genecontentsforallancestralnodescanbeseparatelyinferredfromleafspecies.Oursimulationshowsthatthisapproachcanestimategenecontentswithlessthan1%errorevenforverydifcultdatasets.4.2InferencetheProbabilitiesofAncestralGeneAdjacenciesIn[7],wehavepresentedanadjacency-basedmethodinprobabilisticframeworkcalledPMAGtocalculatetheprobabilityofobservinganadjacencyinthetargetancestralnode.Themethodproceedsinthefollowingthreemainsteps.Step1Eachspeciesinthedatasetisscreenedtoiden-tifyalluniquegeneadjacenciesandtelomeres.Byviewingeachadjacencyandtelomereasanindependentcharacterwithbinarystates—presenceorabsence,geneordersofspeciescanberigorouslyencodedintoalignedse-quencesofbinarycharacters.Step2Thephylogenytreeisrerootedtothetargetancestralnodeinordertotakeallleafspeciesintoconsideration.Atthesametime,the2nratioforbasecompositionsissetupsuchthattherateofpresencetoabsencetransitionsisroughly2ntimesashighastherateoftransitionsintheotherdirectionunderthesameevolutionarydistance,wherenisequaltothenumberofgenes.Suchmodelhasbeensuccessfullyusedforphylogenyreconstruc-tion[16].Step3Theprobabilitiesofcharactersstatesforallgeneadjacenciesandtelomeresattherootnodearecomputed.Themarginalances-tralreconstructionapproachsuggestedbyYang[17]formoleculardatawasadoptedandextendedtocomputefortPMAG+reusesthethreestepsasdescribedtocalculateprobabilitiesforadjacenciesandtelomeres.Oncetheseprobabilitiesareobtained,itthenusesthefollowingsteptoconnectgeneadjacenciesandtelomeresintocontigs,fromwhichtheancestralgeneorderingcanbeidentied.4.3AssemblingAncestralAdjacenciesintoAn-cestralGeneOrdersThelaststepistoassemblegeneadjacenciesandtelomereintoavalidgeneorder,withrespecttothegenecontentinferredfromtherststep.Ingen-eral,higherprobabilityofpresencestateimpliesanadjacencyortelomereshouldbemorelikelytobeincludedintheancestor;howeverthedecisiononchoosinganadjacencyortelomerecannotbesolelymadeuponitsownprobabilityaseachgenecanonlybeselectedonce.InPMAG,ancestraladjacenciesareassembledbythegreedyheuristicbasedontheadjacencygraphproposedbyMaetal..Thisgreedymethodstartsfromacontigwiththerstgeneandpicksitsneighborbyusingtheadjacencywiththehighestprobability;itthencontinuesaddingnewgenesuntilthereisnomorevalidconnection,inwhichcasethecurrentcontigisclosedandanewonewillbeformed.Therearetwoissueswiththisapproachthatmotivatedustoreplacethegreedyassemblerwithanexactsolver.First,thegreedyheuristiccanachievegoodapproximationonlywhenthedatasetiscloselyrelatedinwhichcasemostverticesinthegraphhaveonlyoneoutgoingedge.Second,thegreedyheuristictendstoreturnanexcessivenumberofcontigsasitfrequentlyleadsitselfintodeadends.Obtaininggeneordersfrom(conict)adjacenciescanbetransformedintoaninstanceofsymmetricTravelingSalesmanProblem(TSP),asshownin[10],[18].Inthiscase,wecantransformgenesintocitiesandadjacencyprobabilitiesintoedgeweightsintheTSPgraph.Inparticular,supposeforthetargetances-tralnodeI,wehaveidentiedasetofmadjacenciesA=fa1;a2;:::;amgandntelomeresT=ft1;t2;:::;tngfromleafspecies.IfthegenecontentofIhasbeeninferredasSI=fg1;g2;:::;gkgandtheprobabilitiesP=fpa1;:::;pam;p1;:::;pngforeachadjacencyandtelomereareknown,wecancreatetheTSPgraphGasfollows:1)Eachgeneg2SIisrepresentedbytwovertices—itsheadandtail,denotedasghandgrespec-tively.Everyextremityinthetelomeret2Tisrepresentedbyauniquevertexe,where1in.Inthisway,thetotalnumberofverticesinthegraphisequalto2m+n.2)Edgesbetweenallpairsofheadandtailofthesamegene(gh;g)areaddedwithinftoguar-anteethisconnectionispresentinthesolution.Edgesarealsoestablishedwithinfforallpairsofextremities(e;ej)wherei=jand1i;jn.3)Foreveryadjacency(f;g)2A,thecorrespond-ingedgeisaddedtoGconnectingfandgh. 4TABLE1:Exampleofbinaryencodingongenecontent. abcde G110110G211001 Similarlyforothercombinationoforientations(f;g),(f;g)and(f;g),wecanadd(fh;gh),(f;g)and(fh;g)respectively.4)Foreverytelomere(e;g)2T,weaddanedgetoGbetweeneandgh.Incaseof(g;e),anedgebetweengandeareadded.5)FortherestoftheedgesinG,wesettheedgeweightstoinftoexcludethemfromthesolution.Astheinferredprobabilitiesrangefrom0to1,usingthemdirectlyasedgeweightsmayintroduceundesirableimpactassociatedwithhandlingsmalloatpoints.ItiscriticalforTSPtohaveamorepreciseandne-grainedsetofedgeweightstoassurethequalityofitssolution.Themoststraightforwardwayistolinearlycorrelatetheedgeweightwithitsproba-bility,howeverinsuchcase,differencesofweightsbetweenadjacenciesaretoostrongandadjacencieswithsmallerprobabilitiescanhardlybeconsidered.Thereforewedecidetousethefollowingequationtocurvetheprobabilitiesintoedgeweights:w(f;g)(m)=log2(10m(1p(f;g)))(1)where(f;g)2fA[Tgandp(f;g)istheprobabilitiesofobserving(f;g).misthesoleparameterdeterminingtheshapeofthecurveandaccordingtoourexperi-ments,TSPyieldsgoodresultswhenm=6.WethenutilizethepowerofoneofthemostusedTSPsolverConcorde[19]tondtheoptimalpathwhichtraverseseveryvertexoncewiththeminimumtotalscore.Inthesolutionpath,multiplecontiguousextremitiesareshranktoasingleoneandagenesegmentbetweentwoextremitiesistakenasacontig.OurconstructionofTSPtopologyisinspiritsimilartoGapAdj,howeverGapAdjrequiresadditionalpro-ceduresandparameterstoadjustthecontignumber.InsteadourinferenceofancestralgenomeisuniformanddirectlyfromthesolutionofTSP,minimizingtheriskofintroducingartifacts.5RESULTS5.1ExperimentalDesignToevaluatetheperformanceofPMAG+,weranaseriesofexperimentsonsimulateddatasetsunderawidevarietyofsettings.Wegeneratedmodeltopologiesfromtheuniformlydistributedbinarytrees,eachwithsspecies.Aninitialgeneorderofndistinctgenesandmchromosomeswasassignedattherootsoitcanevolvedowntotheleavesfollowingthetreetopologymimickingthenaturalprocessofevolution,bycarryingoutasetofpredenedevolutionaryevents.Weuseddifferentevolutionaryratesrwith50%relativeuctuation,thustheactualnumberofeventsperedgeisintheintervalbrn 2;rnc.Sev-eralevolutionaryeventswereconsidered—inversions,translocationsandindelsandeachkindofeventwasassignedaprobabilitytobeselectedduringthesim-ulationprocess.Inthispaper,weonlypresentresultswith20genomes,eachwith1000genesand5chro-mosomes,tocloselymimicbacterialgenomes.Theevolutionaryratesrweresetfrom50to200events,thelaterrepresentinghighlydisturbeddatasets.Foreachcombinationofevolutionaryevents,wesimu-lated10datasetsandreportedaveragesandstandarddeviations.Ourpredictedancestralgenomesareevaluatedbytheratioofcorrectadjacenciesandtelomeresrecov-ered.Inspecic,weusedthefollowingequationtocomputetheerrorrateofreconstruction.E=(1jD\D0j jD[D0j)100%whereDrepresentsthesetofgeneadjacenciesandtelomeresintherealgenomeandD0thepredictedgenomes.Wefurtherreferanelementthatiscon-tainedininferredsetS0butnotintruesetSasafalsepositive(FP)andfalsenegative(FN)isdenedsimilarly,byswappingSandS0.5.2AssessingtheAccuracyofAncestralGeneContentsWerstransimulationstotestPMAG+onthein-ferenceofancestralgenecontents.Ourgeneorders,derivedfromitsdirectancestorthroughanumberofevents,underwentrandomindelsandinversions(twoboundariesofeachinversionareuniformlydis-tributed).Twodifferentprobabilities(5%and10%)ofoccurrencesforindelswereused.WecomparedourinferredgenecontentwithitscorrespondingtruecontentandcountedthenumberofFPsandFNs.Foreachdataset,wesummedthenumberofFPsandFNsinallinternalnodesanddivideditbythetotalnumberofgenesinallancestralnodesthataremissingorinserted.Figure1showsourresults.Fromthisgure,theFPratesarealwaysextremelylow(onlyonedatasetproducedFPs),indicatingthatourinferencecanpreventintroducingerroneousgenecontentandtheinferredcontentsarereliable.FNratesincreaseslightlywhenmoreindeloperationswereperformed,butevenintheworstcasetheerrorratestaysbelow1%.Atthesametime,weranGapAdjwithoutspecifyinganyWGDnodeandsetthecut-offvalueandmaximaliterationsto0:6and25assuggested.Accordingtotheresults,GapAdjfailedto 5 0 0.5 1 1.5 2 5 10 15 20 500 1045 1370 1599 FP and FN rate (%)Evolutionary Rates (%) False Positive 0 0 0 0 False Negative 2.6 3.4 5.5 7.2 (a)5%GeneInsertionandDeletion 0 0.5 1 1.5 2 5 10 15 20 951 1807 2432 3231 FP and FN rate (%)Evolutionary Rates (%) False Positive 0 1.4 0 0 False Negative 3.7 9.6 14.8 21.8 (b)10%GeneInsertionandDeletionFig.1:FPandFNrates(dividedbythenumbersonupperx-axis)withstandarddeviationsundervariousevolutionaryratesandindelrates.Labelsonupperx-axisrepresentthetotalnumberofgenesthatareinsertedordeletedoverallinternalnodesduetoindeloperations.Numbersabovepointsindicatetheactualamountoferrorsinaverage.inferalargeportionofinsertedgenes,makingtheFPsratesinallcaseshigherthan60%.5.3AssessingtheAccuracyofAncestralGeneOrdersWeconductedseveralteststoevaluatetheaccuracyofPMAG+underdifferentdegreesofindels.OurrsttestistocomparePMAG+withcurrentstandardapproachthatreducesthedatasetintoequalcontentbyelimat-inggenesthatarenotpresentineverygenome,whichformsthebaselinemethod(namedPMAG+-Base).OursecondtestistogivePMAG+the“groundtrue”content(namedPMAG+-True)toeliminateallimpactsfromgenecontents.TocomparethegreedyheuristictotheTSPsolution,weswitchedbacktothegreedyheuristicandredidthetests(namedPMAG+-Greedy).FinallytheresultsofGapAdj(whichisthemostrecentmethodtoourknowledge)werereported.Tohaveafaircomparison,wealsocomparedPMAG+withGapAdjusingdatasetswithoutindeloperations.Evaluationofdesignedexperimentsintermsoferrorratesisshowningure2.Fromthegure,the 0 10 20 30 5 10 15 20 Error Rate (%)Evolutionary Rates (%) PMAG+ PMAG+-Greedy GapAdj (a)90%Invand10%Tsl 0 10 20 30 40 50 60 70 80 90 100 5 10 15 20 Error Rate (%)Evolutionary Rates (%) PMAG+ PMAG+-True PMAG+-Base PMAG+-Greedy GapAdj (b)5%InsandDel,80%Invand10%Tsl 0 10 20 30 40 50 60 70 80 90 100 5 10 15 20 Error Rate (%)Evolutionary Rates (%) PMAG+ PMAG+-True PMAG+-Base PMAG+-Greedy GapAdj (c)10%InsandDel,70%Invand10%Tsl 0 200 400 600 800 1000 1200 1400 1600 5 10 15 20 Running Time (s)Evolutionary Rates (%)PMAG+ PMAG+-Greedy GapAdj (d)Runningtimeoftestsin(a)Fig.2:(a),(b)and(c)summarizetheerrorratesundervariousevolutionaryratesandcombinationsofevolu-tionaryevents(Insforinsertion,Delfordeletion,InvforinversionandTslfortranslocation).(d)showstherunningtimeformethodsin(a).Errorbarsindicatethestandarddeviations 6errorratesforbothPMAG+andPMAG+-Truearethelowestinallcasesandthedifferencebetweenthetwoapproachesisalmostindistinguishable,indicatingthaterrorsintroducedbyaverylimitedamountoffalsecontentsarenotsignicant.Asexpected,PMAG+-Baserecoveredtheleastamountofadjacenciesduetothelossofcontents.GapAdj,duetoitsfailureingenecontentinference,achievedmuchhighererrorratesinthepresenceofindels.Eveninthetestofequalgenecontent,PMAG+canstilloutperformGapAdjwitharound5%higheraccuracy.PMAG+-GreedycameveryclosetoPMAG+,how-everinalltest,PMAG+canalwaysreturnmoreaccu-ratereconstructionthanPMAG+-Greedy,suggestingtheusefulnessofourTSPassembler.UsingdifferentdegreesofindelshaslittleimpactontheperformancesofPMAG+.Fromtheperspectiveofadjacencyevolution,aninversionoperationalwaysbreakstwoextantadjacenciesandcreatestwonewad-jacencies,thedisturbancesonadjacenciesintroducedbyanindeloperationareessentiallymuchsimilartoaninversion.Inparticular,adeletionbreakstwoadja-cenciesandcreatesanewone,whileainsertionbreaksoneadjacencyandintroducestwonewadjacencies.Therefore,aslongasancestralgenecontentscanbeaccuratelypredicted,PMAG+returnscomparableresultswithallcombinationsofevolutionaryevents.Thelastguresummariestherunningtimeofallmethods.Fromthegure,PMAG+-GreedybenetsfromthegreedyheuristicisindeedslightlyfasterthanPMAG+,whileGapAdjwhichsolvestheTSPproblemheuristicallytookalongertimetonishthanPMAG+usinganexactsolver.5.4AssessingtheNumberofInferredContigsIn[7],PMAGwastestedwithonlyunichromosomalgenomes,buttheinferredancestralgenomeswereal-wayscomposedofalargenumberofcontigs.GapAdjdesignedaseriesofalgorithmswithtwoargumentstoreconnectcontigsintochromosomeswithrestric-tionoflocalandsmallevolutionaryoperations.OurmethodPMAG+,ontheotherhand,bytreatingtelom-eresasaspecialtypeofadjacencies,simultaneouslyndsthebestsetofadjacenciesandtelomeresinonestep.Astranslocationoperationsaccountforinter-chromosomalrearrangementswhichcanbeequiva-lentlyviewedasassionfollowedbyafusion,thusallancestorsshouldalsohavethesameamountofchromosomestotherootnode,whichis5inourtestcases.ForeachdatasetwithNancestors,thenumberofcontigsc(1iN)ineachancestorwascountedandtheaverageabsolutedifferencesperancestralnodeP=1jc5j Nwascomputedtoas-sesstheaccuracyofchromosomalassembly.Figure3summariesourndings.Aspredicted,theamountofcontigsproducedbyPMAGwastotallyirrelevantto 0 5 10 15 20 25 5 10 15 20 Average Absolute Differences per NodeEvolutionary Rates (%) PMAG+ PMAG+-Greedy GapAdj (a)0%GeneInsertionandDeletion 0 5 10 15 20 25 5 10 15 20 Average Absolute Differences per NodeEvolutionary Rates (%) PMAG+ PMAG+-Greedy GapAdj (b)10%GeneInsertionandDeletionFig.3:Theaverageofabsolutedifferencesperances-tralnodeproducedbyvariousmethods.Errorbarsindicatethestandarddeviationsthetruenumberofchromosomes,whileGapAdjcanindeedreducedalargeportionofredundantcontigs.Incomparison,thenumberofcontigsreturnedbyPMAG+canpreciselyreecttheactualnumberofchromosomesinthetruegenomes.6CONCLUSIONSInthisstudy,weproposedanewadjacency-basedmethodcalledPMAG+toinfertheancestralgeneor-dersunderamoregeneralmodelofgeneevolution,includingintra-chromosomalandinter-chromosomalrearrangementsaswellasgeneinsertionsanddele-tions.Asrealancestorsareunknown,wetestedourmethodthroughaseriesofsimulationstudies.Ac-cordingtotheresults,PMAG+canaccuratelydeducetheancestralgenecontentswitherrorrateslessthan1%.Inthesubsequentinferenceofancestralgeneorders,PMAG+canoutperformallexistingmethods.AlsobyadoptingaTSPsolutionforadjacencyassem-bly,PMAG+notonlyovercametheissueonproducingexcessivecontigs,butalsoachievedbetterperfor-mancethanPMAG.7ACKNOWLEDGMENTFH,JZ,LZandJTaresupportedbygrantsUSNSF#0904179and#1161586. 7REFERENCES[1]S.Yancopoulos,O.AttieandR.Friedberg:Efcientsortingofgenomicpermutationsbytranslocation,inversionandblockinterchangeBioinformatics21(16):3340-3346,2005.[2]B.Moret,S.Wyman,D.Bader,T.Warnow,andM.Yan:Anewimplementationanddetailedstudyofbreakpointanalysis.InProc.6thPacicSymp.Biocomputing(PSB'01),583–594,2001.[3]G.BourqueandP.Pevzner:Genome-scaleevolution:recon-structinggeneordersintheancestralspecies.GenomeResearch,12,26–36,2002.[4]J.Tang,B.Moret,L.Cui,andC.dePamphilis:Phylogeneticreconstructionfromarbitrarygene-orderdata.InProc.4thIEEESymp.onBioinformaticsandBioengineering(BIBE'04),592–599,2004.[5]Y.Zhang,F.HuandJ.Tang:Phylogeneticreconstructionwithgenerearrangementsandgenelosses.2010IEEEInternationalConferenceonBioinformaticsandBiomedicine(BIBM'10),35–38,2010.[6]J.MaAprobabilisticframeworkforinferringancestralgenomicorders2010IEEEInternationalConferenceonBioinformaticsandBiomedicine(BIBM'10),179–184,2010.[7]F.Hu,L.ZhouandJ.Tang:ReconstructingAncestralGenomicOrdersUsingBinaryEncodingandProbabilisticModels9thInternationalSymposiumonBioinformaticsResearchandAp-plications(ISBRA),17–27,2013.[8]J.Ma,A.Ratan,B.Raney,B.Suh,W.MillerandD.Haussler:Theinnitesitesmodelofgenomeevolution.ProceedingsoftheNationalAcademyofSciences105(38):14254–14261,2008.[9]S.Berard,C.Gallien,B.Boussau,G.Szollosi,V.DaubinandE.Tannier:Evolutionofgeneneighborhoodswithinreconciledphylogenies.Bioinformatics28(18):i382-i388,2012.[10]Y.Gagnon,M.BlanchetteandN.El-Mabrouk:Aexibleancestralgenomereconstructionmethodbasedongappedadja-cencies.BMCbioinformatics,13(Suppl19):S4,2012.[11]J.Ma,L.Zhang,B.Suh,B.Raney,R.Burhans,W.Kent,M.Blanchette,D.HausslerandW.Miller:Reconstructingcontiguousregionsofanancestralgenome.GenomeResearch16(12):1557-1565,2006.[12]J.Gordon,K.Byrne,andK.Wolfe:Additions,losses,andrearrangementsontheevolutionaryroutefromareconstructedancestortothemodernSaccharomycescerevisiaegenome.PLoSGenetics5(5):e1000485,2009.[13]V.KuninandC.Ouzounis:GeneTRACE:reconstructionofgenecontentofancestralspecies.Bioinformatics19(11):1412-1416,2003.[14]A.Stamatakis:RAxML-VI-HPC:maximumlikelihood-basedphylogeneticanalyseswiththousandsoftaxaandmixedmodels.Bioinformatics,22(21):2688-2690,2006.[15]D.SwoffordDavid:PAUP*.PhylogeneticAnalysisUsingParsimony(*andOtherMethods).Version4.(2003).[16]Y.Lin,F.Hu,J.TangandB.Moret:MaximumLikelihoodPhy-logeneticReconstructionFromHigh-resolutionWhole-genomeDataAndATreeOf68EukaryotesPacicSymposiumonBiocomputing.PacicSymposiumonBiocomputing(PSB'13)285–296,2013.[17]Z.Yang,K.SudhirKandN.Masatoshi:Anewmethodofinferenceofancestralnucleotideandaminoacidsequences.Genetics1995,141(4):1641-1650.[18]J.TangandL.S.Wang:ImprovingGenomeRearrangementPhylogenyUsingSequence-StyleParsimony.Proc.5thIEEESymp.onBioinformaticsandBioengineering(BIBE'05),137–144,2005.[19]D.Applegate,R.Bixby,V.ChvatalandW.Cook:ConcordeTSPsolver.URL:http://www.math.uwaterloo.ca/tsp/concorde/(2011).FeiHureceivedhisbachelordegreeinbiomedicalengineeringattheHuaZhongUniversityofScienceandTechnology.Hisresearchinterestsismainlyonthephylogeneticreconstructionandinferenceofancestralgenomesusinggene-orderdata.JunZhoucompletedhisbachelordegreinBiotechnologyin2008,atNanJingUniversity,China.Hehadhisrstcontactwithbioin-formaticsin2012,whenhestartedworkingincomputersciencedepartmentonancestralgenomeinformationreferringproject.HeiscurrentlyaPh.D.studentatthecomputersciencedepartment,UniversityofSouthCarolina,studyingthesmallphylogenyproblem.LingxiZhouisaPh.D.candidateincomputerscienceandengi-neering,supervisedbyDr.JijunTangatthebioinformaticslaboftheUniversityofSouthCarolina.Beforethat,hegothisB.S.degreeatthecollegeofcomputerscienceandtechnologyofJilinUniversityinJuly,2011.JijunTangobtainedhisPh.D.fromUniversityofNewMexicoin2004.HeisnowanassociateprofessorinComputerScienceandEngineering,UniversityofSouthCarolina,USA.HeisalsoanadjunctprofessorinSchoolofComputerScienceandTechnology,TianjinUniversity,China.Hismainresearchareaiscomputationalbiology,withfocusonalgorithmdevelopmentinphylogeneticrecon-structionfromgenomerearrangementdata.