/
Ph ylogenetic analysis Phylogenetics Ph ylogenetic analysis Phylogenetics

Ph ylogenetic analysis Phylogenetics - PowerPoint Presentation

felicity
felicity . @felicity
Follow
66 views
Uploaded On 2023-07-27

Ph ylogenetic analysis Phylogenetics - PPT Presentation

Phylogenetics is the study of the evolutionary history of living organisms using treelike diagrams to represent pedigrees of these organisms The tree branching patterns representing the evolutionary divergence are referred to as ID: 1011651

evolutionary tree based taxa tree evolutionary taxa based sequences sequence distance evolution taxon sites molecular number species lectures methods

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Ph ylogenetic analysis Phylogenetics" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Phylogenetic analysis

2. PhylogeneticsPhylogenetics is the study of the evolutionary history of living organisms using treelike diagrams to represent pedigrees of these organisms.The tree branching patterns representing the evolutionary divergence are referred to as phylogeny.

3. Studying phylogeneticsFossil records – morphological information, available only for certain species, data can be fragmentary, morphological traits are ambiguous, fossil record nonexistent for microorganismsMolecular data (molecular fossils) – more numerous than fossils, easier to obtain, favorite for reconstruction of the evolutionary historyhttp://www.agiweb.org/news/evolution/fossilrecord.html

4. Tree of lifehttp://tikalon.com/blog/blog.php?article=2011/domains

5. DNA sequence evolution-3 mil yrs-2 mil yrs-1 mil yrstodayAAGACTTTGGACTTAAGGCCTAGGGCATTAGCCCTAGCACTTAAGGCCTTGGACTTAGCGCTTAGCACAATAGACTTTAGCCCAAGGGCATwww.cs.utexas.edu/users/tandy/CSBtutorial.ppt

6. Tree terminologyAncestral node or root of the treeInternal nodes orDivergence points (represent hypothetical ancestors of the taxa)BranchesTerminal nodes – taxa (taxon) ABCDEBased on lectures by Tal Pupko

7. Monophyletic (clade) – a taxon that is derived from a single ancestral species.Polyphyletic – a taxon whose members were derived from two or more ancestors not common to all members.Paraphyletic – a taxon that excludes some members that share a common ancestor with members included in the taxon.Based on lectures by Tal Pupko

8. dichotomy – all branches bifurcate, vs. polytomy – result of a taxon giving rise to more than two descendants or unresolved phylogeny (the exact order of bifurcations can not be determined exactly)

9. unrooted – no knowledge of a common ancestor, shows relative relationship of taxa, no direction of an evolutionary pathrooted – obviously, more informative

10. Finding a true tree is difficultCorrect reconstruction of the evolutionary history = find a correct tree topology with correct branch lengths.Number of potential tree topologies can be enormously large even with a moderate number of taxa.6 taxa … NR = 945, NU = 10510 taxa … NR = 34 459 425, NU = 2 027 025 

11.

12. Rooting the treeABCRootDABCDRootNote that in this rooted tree, taxon A is no more closely related to taxon B than it is to C or D.Rooted treeUnrooted treeTo root a tree mentally, imagine that the tree is made of string. Grab the string at the root and tug on it until the ends of the string (the taxa) fall opposite the root: Based on lectures by Tal Pupko

13. Now, try it again with the root at another position: ABCRootDUnrooted treeNote that in this rooted tree, taxon A is most closely related to taxon B, and together they are equally distantly related to taxa C and D.CDRootRooted treeABBased on lectures by Tal Pupko

14. The unrooted tree 1:ACBDRooted tree 1dCDAB4Rooted tree 1cABCD 3Rooted tree 1eDCAB5Rooted tree 1bABCD2Rooted tree 1aBACD1These trees show five different evolutionary relationships among the taxa!An unrooted, four-taxon tree theoretically can be rooted in five different places to produce five different rooted treesBased on lectures by Tal Pupko

15. Rooting the treeoutgroup – taxa (the “outgroup”) that are known to fall outside of the group of interest (the “ingroup”). Requires some prior knowledge about the relationships among the taxa. The outgroup can either be species (e.g., birds to root a mammalian tree) or previous gene duplicates (e.g., a-globins to root b-globins).outgroupBased on lectures by Tal Pupko

16. Rooting the treemidpoint rooting approach - roots the tree at the midway point between the two most distant taxa in the tree, as determined by branch lengths. Assumes that the taxa are evolving in a clock-like manner. A BCD102352d (A,D) = 10 + 3 + 5 = 18Midpoint = 18 / 2 = 9Based on lectures by Tal Pupko

17. Molecular clockThis concept was proposed by Emil Zuckerkandl and Linus Pauling (1962) as well as by Emanuel Margoliash (1963).For every given gene (or protein), the rate of molecular evolution is approximately constant.Pioneering study by Zuckerkandl and PaulingThey observed the number of amino acid differences between human globins – β and δ (~ 6 differences), β and γ (~ 36 differences), α and β (~ 78 differences), and α and γ (~ 83 differences).They could also compare human to gorilla (both β and α globins), observing either 2 or 1 differences respectively.They knew from fossil evidence that humans and gorillas diverged from a common ancestor about 11 MYA.Using this divergence time as a calibration point, they estimated that gene duplications of the common ancestor to β and δ occurred 44 MYA; β and derived from a common ancestor 260MYA; α and β 565 MYA; and α and γ 600MYA.

18. Gene phylogeny vs. species phylogenyMain objective of building phylogenetic trees based on molecular sequences: reconstruct the evolutionary history of the species involved.A gene phylogeny only describes the evolution of that particular gene or encoded protein. This sequence may evolve more or less rapidly than other genes in the genome.The evolution of a particular sequence does not necessarily correlate with the evolutionary path of the species.Branching point in a species tree – the speciation eventBranching point in a gene tree – which event? The two events may or may not coincide. To obtain a species phylogeny, phylogenetic trees from a variety of gene families need to be constructed to give an overall assessment of the species evolution.

19. Closest living relatives of humans?Based on lectures by Tal Pupko

20. Closest living relatives of humans?HumansBonobosGorillasOrangutansChimpanzeesMYA015-30MYAChimpanzeesOrangutansHumansBonobosGorillas014 Mitochondrial DNA, most nuclear DNA-encoded genes, and DNA/DNA hybridization all show that bonobos and chimpanzees are related more closely to humans than to gorillas. The pre-molecular view was that the great apes (chimpanzees, gorillas and orangutans) formed a clade separate from humans, and that humans diverged from the apes at least 15-30 MYA.

21. OrangutanGorillaChimpanzeeHumanFrom the Tree of the Life Website, University of Arizona

22. Forms of tree representationphylogram – branch lengths represent the amount of evolutionary divergencecladogram – external taxa line up neatly, only the topology matters

23. Based on lectures by Tal Pupko((A,(B,C)),(D,E)) = The above phylogeny as nested parenthesesTaxon ATaxon BTaxon CTaxon ETaxon DNo meaning to thespacing between thetaxa, or to the order inwhich they appear fromtop to bottom.This dimension either can have no scale (for ‘cladograms’),can be proportional to genetic distance or amount of change(for ‘phylograms’), or can be proportional to time (for ‘ultrametric trees’ or true evolutionary trees). These say that B and C are more closely related to each other than either is to A,and that A, B, and C form a clade that is a sister group to the clade composed ofD and E. If the tree has a time scale, then D and E are the most closely related.

24. Newick format

25. A consensus treecombining the nodes:strict consensus - all conflicting nodes are collapsed into polytomiesmajority rule – among the conflicting nodes, those that agree by more than 50% of the nodes are retained whereas the remaining nodes are collapsed into polytomies

26. ProcedureChoice of molecular markersMultiple sequence alignmentChoice of a model of evolutionDetermine a tree building methodAssess tree reliability

27. Choice of molecular markersNucleotide or protein sequence data?NA sequences evolve more rapidly.They can be used for studying very closely related organisms.E. g., for evolutionary analysis of different individuals within a population, noncoding regions of mtDNA are often used.Evolution of more divergent organisms – either slowly evolving NA (e.g., rRNA) or protein sequences.Deepest level (e.g., relatioships between bacteria and eukaryotes) – conserved protein sequences

28. Positive and negative selectionsynonymous substitution – nucleotide changes in a sequence not resulting in amino acid sequence changes (genetic code degeneracy, 3rd codon position)nonsynonymous changesnonsynonsymous substitution rate synonymous – positive selectioncertain parts of the protein are undergoing active mutations that may contribute to the evolution of new functionnegative selection – synonymous > nonsynonymousneutral changes at the AA level, the protein sequence is critical enough that its changes are not tolerated 

29. MSACritical stepMultiple state-of-the-art alignment programs (e.g., T-Coffee, Praline, Poa, …) should be used.The alignment results from multiple sources should be inspected and compared carefully to identify the most reasonable one. Automatic sequence alignments almost always contain errors and should be further edited or refined if necessary – manual editing!Rascal and NorMD can help to improve alignment by correcting alignment errors and removing potentially unrelated or highly divergent sequences.

30. Model of evolutionA simple measure of the divergence of two sequences – number of substitutions in the alignment, a distance between two sequences – a proportion of substitutionsIf A was replaced by C: A → C or A → T → G → C? Back mutation: G → C → G.Parallel mutations – both sequences mutate into e.g., T at the same time.All of this obscures the estimation of the true evolutionary distances between sequences. This effect is known as homoplasy and must be corrected.Statistical models infer the true evolutionary distances between sequences.

31. Model of evolutionHomplasy is corrected by substitution (evolutionary) models.There exists a lot of such models.Jukes-Cantor modeldAB … distance, pAB … proportion of substitutionsexample: alignment of A and B is 20 nucleotides long, 6 pairs are different, pAB = 0.3, dAB = 0.38Kimura modelpti … frequency of transition, ptv … frequency of transversion  Transition: YY, RRTransversion: YR, RY

32. Models of amino acids substitutionsuse the amino acid substitution matrixPAMJTT – 90s, the same methodology as PAM, but with larger protein databaseprotein equivalents of of Jukes–Cantor and Kimura models, e.g., 

33. Among site variationsUp to now we have assumed that different positions in a sequence are assumed to be evolving at the same rate.However, in reality this may not be true.In DNA, the rates of substitution differ for different codon positions. 3rd codon mutates much faster.In proteins, some AAs change rarely than others owing to functional constraints.It has been shown that there are always a proportion of positions in a sequence dataset that have invariant rates and a proportion that have more variable rates.

34. Gamma distribution The distribution of variant sites follows a Gamma distribution pattern.

35. To account for site-dependent rate variation, a correction factor is used. is derived from statistics.For the Jukes–Cantor model, the evolution distance can be adjusted with the following formula: For the Kimura model, the evolutionary distance becomes 

36. Tree building methodsTwo major categories.Distance based methods.Based on the amount of dissimilarity between pairs of sequences, computed on the basis of sequence alignment.Characters based methods.Based on discrete characters, which are molecular sequences from individual taxa.

37. Tree building methodsCOMPUTATIONAL METHODClustering algorithmOptimality criterionDATA TYPECharactersDistancesMaximum parsimony (MP)Maximum likelihood (ML)UPGMANeighbor-joining (NJ)Fitch-Margoliash (FM)

38. Distance based methodsCalculate evolutionary distances dAB between sequences using some of the evolutionary model.Construct a distance matrix – distances between all pairs of taxa.Based on the distance scores, construct a phylogenetic tree.clustering algorithms – UPGMA, neighbor joining (NJ)optimality based – Fitch-Margoliash (FM)

39. Clustering methodsUPGMA  (Unweighted Pair Group Method with Arithmetic Mean)Hierachical clustering, agglomerative, you know it as an average linkageProduces rooted tree (most phylogenetic methods produce unrooted tree).Basic assumption of the UPGMA method: all taxa evolve at a constant rate, they are equally distant from the root, implying that a molecular clock is in effect. However, real data rarely meet this assumption. Thus, UPGMA often produces erroneous tree topologies.

40. Neighbor joiningBDACEADCEBA,BBDACEABCDEA02344B0345C034D05E0A,BCDEA,B02.54.53.5C034D05E0The Minimum Evolution (ME) criterion: in each iteration we separate the two sequences which result with the minimal sum of branch lengths

41. Optimality based methodsClustering methods produce a single tree.There is no criterion in judging how this tree is compared to other alternative trees.Optimality based methods have a well-defined algorithm to compare all possible tree topologies and select a tree that best fits the actual evolutionary distance matrix.

42. Distance based – pros and consclusteringFast, can handle large datasetsNot guaranteed to find the best treeUPGMA – assumes a constant rate of evolution of the sequences in all branches of the tree (molecular clock assumption)NJ – does not assume that the rate of evolution is the same in all branches of the treeNJ is slower but better than UPGMAexhaustive tree searching (Fitch-Margoliash)better accuracy, prohibitive for more than 12 taxa

43. Character based methodsAlso called discrete methodsBased directly on the sequence charactersThey count mutational events accumulated on the sequences and may therefore avoid the loss of information when characters are converted to distances.Evolutionary dynamics of each character can be studiedThe two most popular character-based approaches: maximum parsimony (MP) and maximum likelihood (ML) methods.

44. Maximum parsimonyBased on Occam’s razor.William of Occam, 13th century.The simplest explanation is probably the correct one.This is because the simplest explanation requires the fewest assumptions and the fewest leaps of logic.A tree with the least number of substitutions is probably the best to explain the differences among the taxa under study.

45. 1234567891AAGAGTGCA2AGCCGTGCG3AGATATCCA4AGAGATCCGTo save computing time, only a small number of sites that have the richest phylogenetic information are used in tree determination.informative site – sites that have at least two different kinds of characters, each occurring at least twiceA worked example

46. To save computing time, only a small number of sites that have the richest phylogenetic information are used in tree determination.informative site – sites that have at least two different kinds of characters, each occurring at least twice1234567891AAGAGTGCA2AGCCGTGCG3AGATATCCA4AGAGATCCGA worked example

47. How many possible unrooted trees?1231GGA2GGG3ACA4ACG 123413241432Tree ITree IITree III

48. GGAAGGAAGAGAGAAGAGGGGGTree ITree IITree III123413241432

49. GGCCGGCCGCGCGCCGCGGGGGTree ITree IITree III

50. AGAGAGAGAAGGAGAGAAGAGGTree ITree IITree III

51. IIIIIIGGAA122GGCC122AGAG212Tree length456GGAGGGACAACGACAGGATree I211

52. Weighted parsimonyThe parsimony method discussed so far is unweighted because it treats all mutations as equivalent.This may be an oversimplification; mutations of some sites are known to occur less frequently than others, for example, transversions versus transitions, functionally important sites versus neutral sites.A weighting scheme takes into account the different kinds of mutations.

53.

54. Branch-and-boundThe parsimony method examines all possible tree topologies to find the maximally parsimonious tree.This is an exhaustive search method, expensive. N = 10 … 2 027 025N = 20 … 2.22 × 1020Branch-and-boundRationale: a maximally parsimonious tree must be equal to or shorter than the distance-based tree.First build a distance tree using NJ or UPGMA.Compute the minimum number of substitutions for this tree.The resulting number defines the upper bound to which any other trees are compared. I.e., when you build a parsimonous tree, you stop growing it when its length exceeds the upper bound.

55. Heuristic methodsWhen a number of taxa exceeds 20, even branch-and-bound becomes computationally unfeasible.Then, heuristic search can be applied.Both exhaustive search and branch-and-bound methods lead to the optimum tree.Heuristic search leads to the suboptimum tree (compare to BLAST which is also heuristic).

56. MP – pros and consIntuitive - its assumptions are easily understoodThe character-based method is able to provide evolutionary information about the sequence characters, such as information regarding homoplasy and ancestral states.It tends to produce more accurate trees than the distance-based methods when sequence divergence is low because this is the circumstance when the parsimony assumption of rarity in evolutionary changes holds true.When sequence divergence is high, tree estimation by MP can be less effective, because the original parsimony assumption no longer holds.Estimation of branch lengths may also be erroneous because MP does not employ substitution models to correct for multiple substitutions.

57. Maximum likelihood – MLUses probabilistic models to choose a best tree that has the highest probability (likelihood) of reproducing the observed data.ML is an exhaustive method that searches every possible tree topology and considers every position in an alignment, not just informative sites.By employing a particular substitution model that has probability values of residue substitutions, ML calculates the total likelihood of ancestral sequences evolving to internal nodes and eventually to existing sequences. It sometimes also incorporates parameters that account for rate variations across sites.

58. ML – pros and consBased on well-founded statistics instead of a medieval philosophy.More robust, uses the full sequence information, not just informative sites.Employs substitution model – strength, but also weakness (choosing wrong model leads to incorrect tree).Accurately reconstructs the relationships between sequences that have been separated for a long time.Very time consuming, considerably more than MP which is itself more time consuming than clustering methods.

59. Phylogeny packagesPHYLIP, Phylogenetic inference packageevolution.genetics.washington.edu/phylip.htmlFelsensteinFree!PAUP, phylogenetic analysis using parsimonypaup.csit.fsu.eduSwofford