Master Thesis By Ashu Gupta 1 Phylogenetic Pipeline Most common pipeline using summary method Step 1 Get gene alignments from sequence data eg Prank MAFFT etc Step 2 Get gene trees from gene alignments eg ID: 545217
Download Presentation The PPT/PDF document "Improving gene trees without more data" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Improving gene trees without more data
Master Thesis ByAshu Gupta
1Slide2
Phylogenetic Pipeline
Most common pipeline using summary method
Step 1: Get gene alignments from sequence data (e.g.
Prank
,
MAFFT, etc.)Step 2: Get gene trees from gene alignments (e.g. RAxML, FastTree, etc.)Step 3: Get species tree from gene trees (e.g. ASTRAL2, ASTRID, NJst, etc.)
Getting good gene trees is a major problem
2Slide3
Motivation
Most common problem of the phylogenetic pipeline? poorly estimated gene trees
Summary methods suffer from gene tree error
e.g.
ASTRAL2, ASTRID, NJst
etc.Why can’t we get good gene trees?Errors in gene alignmentShort alignment length for individual gene Accurate gene tree impossible from limited data under certain conditionse.g. Short branches , Very long branches etc.Individual genes have low phylogenetic signal3Slide4
Solution?
Collect more data and re-estimateCostlyTime consuming
How to improve an estimated gene tree?
Use data from other estimated gene trees
Co-estimate
gene trees and species tree e.g. *BEAST (too computationally expensive)Naïve binning (Bayzid et al.)Statistical Binning (Mirarab et al.)Weighted Statistical Binning (Bayzid et al.)4Slide5
Weighted Statistical Binning Pipeline
WSB+CAMLObservations
Concatenation better than summary methods
in low ILS conditions
Concatenation can be used to boost phylogenetic signal
among similar gene trees IdeaPartition genes into disjoint bins using initially estimated gene treesGenes within a bin are similar to each other (less discord)Concatenate gene alignments within a bin (supergene alignment for a bin)Estimate supergene tree using ML-based methods (e.g. RAxML, FastTree)Use supergene trees as new gene trees5Slide6
WSB+CAML (contd.)
Incompatibility Graph
g8
g1
g3
g4g9g6g7g2
g5
g10
g11
g12
A
A
B
C
D
E
B
C
D
E
g1
g2
65%
80%
85%
60%
A
A
B
C
D
E
B
C
D
E
g1
g2
80%
85%
Binning threshold
75%
Check Incompatibility of g1 and g2 after collapsing edges
g1 and g2 incompatible
A E D| B C
(g1)
incompatible with
A C | D E B
(g2)
Binning threshold
t
6Slide7
WSB+CAML (contd.)
Incompatibility Graph
g8
g1
g3
g4g9g6g7
g2
g5
g10
g11
g12
Binning threshold
t
g1
g8
g2
g6
g7
g3
g4
g9
g5
g10
g11
g12
Supergene alignments
Supergene trees
G1
G8
G2
New gene trees
G6
G7
G3
G4
G9
G5
G10
G11
G12
A B C D E
Species tree
WSB
CAML
7Slide8
Problems?
Each gene within a bin has the same new gene treeConcatenation used for getting supergene trees
Large running time for large bin sizes
True gene trees of genes within a bin still have some discord
Initially estimated gene tree is not explicitly used to compute new gene tree
Can be fairly accurate in certain conditions (e.g. decent alignment length)Only tested on MLBS analysis(not in BestML analysis, known to be more accurate for large enough number of genes) 8Slide9
MLBS vs BestML
WLOG assume N gene trees each having K bootstrap replicatesMLBSRun phylogenetic pipeline K times with
B1 from g1, B1 from g2 … B1 from
gN
as input for 1
st runB2 from g1, B2 from g2 … B2 from gN as input for 2nd runBK from g1, BK from g2 … BK from gN as input for Kth runTake greedy consensus of K species tree obtained to get final treeBestMLRun phylogenetic pipeline 1 time withBest ML tree for g1, Best ML tree for g2 .. Best ML tree for gN as inputOutput species tree is the final tree9Slide10
Weighted Quartet Max Cut (WQMC)
Quartet based tree estimation methodInput
:
A set of
weighted quartets Q, set of Taxa X Output: Tree T* (approximate solution to MQC)T* tries to maximize the total weight of induced quartetsDivide and Conquer Amalgamation Technique Robust, doesn’t need all quartets10Slide11
WSB+WQMC
Novel technique aimed to tackle problems from WSB+CAMLTested on BestML analysisFeaturesComputes unique new gene tree for each initial gene tree
Initially estimated gene tree is used to get new gene tree
Modifies initial gene trees based on frequent quartet topologies in similar gene trees
Uses
WQMC rather than concatenation ( faster and scalable)11Slide12
WSB+WQMC (contd.)
IdeaPartition input gene trees into disjoint bins using WSB (binning threshold t)
For each gene tree extract the weighted quartet topologies
For each gene combine weighted quartet topologies from genes within its bin with
upweighting
its own quartets by confidence_value * BIN_SIZERun WQMC for combined quartet topologies for each gene to get new gene tree12Slide13
WSB+WQMC (contd.)
WSB (Binning threshold
t
)
g1
g8g2
g6
g7
g3
g4
g9
g5
g10
g11
g12
Confidence value
c = 2/3
Upweight =
0.2
*3 (BIN_SIZE) = 2
AD|BC : 2
*
w3
AB|CD : w4+w9
Weighted quartets g3
D
B
C
A
D
B
C
A
g3
g4
D
B
C
A
g9
AD|BC : w3
A
B
|CD : w4
A
B
|CD : w9
Weighted quartets g9
AB|CD : 2*w4+w9
AD|BC
:
w3
Weighted quartets g4
AB|CD : 2*w9+w4
AD|BC
:
w3
13Slide14
WSB+WQMC
(contd.)
g1
g8
g2
g6
g7
g3
g4
g9
g5
g10
g11
g12
WSB (Binning threshold
t
)
g1
g8
g2
g6
g7
g3
g4
g9
g5
g10
g11
g12
Weighted Quartets
A B C D E
Species tree
WQMC
New gene trees
G1
G8
G2
G6
G7
G3
G4
G9
G5
G10
G11
G12
WSB
WQMC
14Slide15
Experimental Study
Evaluated WSB+WQMC on various datasetsCompared WSB+WQMC and WSB+CAML
How to determine WSB+WQMC parameters?
Can’t try all
Binning threshold
75% used (also used in WSB paper)Confidence ValueTraining PhaseSimulated Datasets used for trainingSimulated Datasets used for testing(Measure of ILS)Average FN % between true gene trees and true species tree 15Slide16
AD%
measures ILS (Average FN % between true gene trees and true species
tree
)
Gene trees improve for low ILS and worsen for high
ILSConfidence value 0.0 worse than 0.2 and 0.3Confidence value 0.2 same as 0.3Confidence value 0.2 used for rest of the studyResultsGene tree estimation error16Slide17
Results
ILS
Gene tree estimation error
(
avg. FN
rate of WSB+WQMC gene trees – avg. FN rate of original gene trees) 17Slide18
Results
ILS
Gene tree estimation error
WSB+WQMC vs WSB+CAML
18Slide19
Results
ILS
Gene tree estimation error
WSB+WQMC vs WSB+CAML
19Slide20
Results
ILS
ASTRAL 2 species tree estimation error
(
avg. FN
rate of WSB+WQMC species tree– avg. FN rate of original species tree) 20Slide21
Results
ILS
ASTRAL2 species tree estimation error
WSB+WQMC vs WSB+CAML
21Slide22
Results
ILS
ASTRAL2 species tree estimation error
WSB+WQMC vs WSB+CAML
22Slide23
Conclusions
In general both WSB+WQMC and WSB+CAML improves gene trees and species tree in low/medium ILS In general both WSB+WQMC and WSB+CAML worsen
gene trees and species tree in
high ILS
Better gene trees need not give better species trees
WSB+WQMC better than WSB+CAML in high ILS (more than 30% AD)For both gene tree and species tree estimationMay be better to just use original gene trees in some situations !!WSB+CAML better than WSB+WQMC to get species tree (30% AD and below)No clear winner for gene tree estimation in low and medium ILS (30% AD and below)23Slide24
Future Extensions
Dynamic inference of parametersBinning thresholdConfidence ValueOther quartet based heuristics
e.g.
wASTRAL
Results on biological datasetsHandling cases of high ILS24Slide25
Questions?
25Slide26
Thank You
26