/
Improving gene trees without more data Improving gene trees without more data

Improving gene trees without more data - PowerPoint Presentation

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
403 views
Uploaded On 2017-05-06

Improving gene trees without more data - PPT Presentation

Master Thesis By Ashu Gupta 1 Phylogenetic Pipeline Most common pipeline using summary method Step 1 Get gene alignments from sequence data eg Prank MAFFT etc Step 2 Get gene trees from gene alignments eg ID: 545217

tree gene trees wsb gene tree wsb trees wqmc species ils caml binning weighted bin estimation g10 results g11 g12 threshold quartets

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Improving gene trees without more data" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Improving gene trees without more data

Master Thesis ByAshu Gupta

1Slide2

Phylogenetic Pipeline

Most common pipeline using summary method

Step 1: Get gene alignments from sequence data (e.g.

Prank

,

MAFFT, etc.)Step 2: Get gene trees from gene alignments (e.g. RAxML, FastTree, etc.)Step 3: Get species tree from gene trees (e.g. ASTRAL2, ASTRID, NJst, etc.)

Getting good gene trees is a major problem

2Slide3

Motivation

Most common problem of the phylogenetic pipeline? poorly estimated gene trees

Summary methods suffer from gene tree error

e.g.

ASTRAL2, ASTRID, NJst

etc.Why can’t we get good gene trees?Errors in gene alignmentShort alignment length for individual gene Accurate gene tree impossible from limited data under certain conditionse.g. Short branches , Very long branches etc.Individual genes have low phylogenetic signal3Slide4

Solution?

Collect more data and re-estimateCostlyTime consuming

How to improve an estimated gene tree?

Use data from other estimated gene trees

Co-estimate

gene trees and species tree e.g. *BEAST (too computationally expensive)Naïve binning (Bayzid et al.)Statistical Binning (Mirarab et al.)Weighted Statistical Binning (Bayzid et al.)4Slide5

Weighted Statistical Binning Pipeline

WSB+CAMLObservations

Concatenation better than summary methods

in low ILS conditions

Concatenation can be used to boost phylogenetic signal

among similar gene trees IdeaPartition genes into disjoint bins using initially estimated gene treesGenes within a bin are similar to each other (less discord)Concatenate gene alignments within a bin (supergene alignment for a bin)Estimate supergene tree using ML-based methods (e.g. RAxML, FastTree)Use supergene trees as new gene trees5Slide6

WSB+CAML (contd.)

Incompatibility Graph

g8

g1

g3

g4g9g6g7g2

g5

g10

g11

g12

A

A

B

C

D

E

B

C

D

E

g1

g2

65%

80%

85%

60%

A

A

B

C

D

E

B

C

D

E

g1

g2

80%

85%

Binning threshold

75%

Check Incompatibility of g1 and g2 after collapsing edges

g1 and g2 incompatible

A E D| B C

(g1)

incompatible with

A C | D E B

(g2)

Binning threshold

t

6Slide7

WSB+CAML (contd.)

Incompatibility Graph

g8

g1

g3

g4g9g6g7

g2

g5

g10

g11

g12

Binning threshold

t

g1

g8

g2

g6

g7

g3

g4

g9

g5

g10

g11

g12

Supergene alignments

Supergene trees

G1

G8

G2

New gene trees

G6

G7

G3

G4

G9

G5

G10

G11

G12

A B C D E

Species tree

WSB

CAML

7Slide8

Problems?

Each gene within a bin has the same new gene treeConcatenation used for getting supergene trees

Large running time for large bin sizes

True gene trees of genes within a bin still have some discord

Initially estimated gene tree is not explicitly used to compute new gene tree

Can be fairly accurate in certain conditions (e.g. decent alignment length)Only tested on MLBS analysis(not in BestML analysis, known to be more accurate for large enough number of genes) 8Slide9

MLBS vs BestML

WLOG assume N gene trees each having K bootstrap replicatesMLBSRun phylogenetic pipeline K times with

B1 from g1, B1 from g2 … B1 from

gN

as input for 1

st runB2 from g1, B2 from g2 … B2 from gN as input for 2nd runBK from g1, BK from g2 … BK from gN as input for Kth runTake greedy consensus of K species tree obtained to get final treeBestMLRun phylogenetic pipeline 1 time withBest ML tree for g1, Best ML tree for g2 .. Best ML tree for gN as inputOutput species tree is the final tree9Slide10

Weighted Quartet Max Cut (WQMC)

Quartet based tree estimation methodInput

:

A set of

weighted quartets Q, set of Taxa X Output: Tree T* (approximate solution to MQC)T* tries to maximize the total weight of induced quartetsDivide and Conquer Amalgamation Technique Robust, doesn’t need all quartets10Slide11

WSB+WQMC

Novel technique aimed to tackle problems from WSB+CAMLTested on BestML analysisFeaturesComputes unique new gene tree for each initial gene tree

Initially estimated gene tree is used to get new gene tree

Modifies initial gene trees based on frequent quartet topologies in similar gene trees

Uses

WQMC rather than concatenation ( faster and scalable)11Slide12

WSB+WQMC (contd.)

IdeaPartition input gene trees into disjoint bins using WSB (binning threshold t)

For each gene tree extract the weighted quartet topologies

For each gene combine weighted quartet topologies from genes within its bin with

upweighting

its own quartets by confidence_value * BIN_SIZERun WQMC for combined quartet topologies for each gene to get new gene tree12Slide13

WSB+WQMC (contd.)

WSB (Binning threshold

t

)

g1

g8g2

g6

g7

g3

g4

g9

g5

g10

g11

g12

 

Confidence value

c = 2/3

Upweight =

0.2

*3 (BIN_SIZE) = 2

AD|BC : 2

*

w3

AB|CD : w4+w9

Weighted quartets g3

D

B

C

A

D

B

C

A

g3

g4

D

B

C

A

g9

AD|BC : w3

A

B

|CD : w4

A

B

|CD : w9

Weighted quartets g9

AB|CD : 2*w4+w9

AD|BC

:

w3

Weighted quartets g4

AB|CD : 2*w9+w4

AD|BC

:

w3

13Slide14

WSB+WQMC

(contd.)

g1

g8

g2

g6

g7

g3

g4

g9

g5

g10

g11

g12

WSB (Binning threshold

t

)

g1

g8

g2

g6

g7

g3

g4

g9

g5

g10

g11

g12

Weighted Quartets

A B C D E

Species tree

WQMC

New gene trees

G1

G8

G2

G6

G7

G3

G4

G9

G5

G10

G11

G12

WSB

WQMC

14Slide15

Experimental Study

Evaluated WSB+WQMC on various datasetsCompared WSB+WQMC and WSB+CAML

How to determine WSB+WQMC parameters?

Can’t try all

Binning threshold

75% used (also used in WSB paper)Confidence ValueTraining PhaseSimulated Datasets used for trainingSimulated Datasets used for testing(Measure of ILS)Average FN % between true gene trees and true species tree 15Slide16

AD%

measures ILS (Average FN % between true gene trees and true species

tree

)

Gene trees improve for low ILS and worsen for high

ILSConfidence value 0.0 worse than 0.2 and 0.3Confidence value 0.2 same as 0.3Confidence value 0.2 used for rest of the studyResultsGene tree estimation error16Slide17

Results

ILS

Gene tree estimation error

(

avg. FN

rate of WSB+WQMC gene trees – avg. FN rate of original gene trees) 17Slide18

Results

ILS

Gene tree estimation error

WSB+WQMC vs WSB+CAML

18Slide19

Results

ILS

Gene tree estimation error

WSB+WQMC vs WSB+CAML

19Slide20

Results

ILS

ASTRAL 2 species tree estimation error

(

avg. FN

rate of WSB+WQMC species tree– avg. FN rate of original species tree) 20Slide21

Results

ILS

ASTRAL2 species tree estimation error

WSB+WQMC vs WSB+CAML

21Slide22

Results

ILS

ASTRAL2 species tree estimation error

WSB+WQMC vs WSB+CAML

22Slide23

Conclusions

In general both WSB+WQMC and WSB+CAML improves gene trees and species tree in low/medium ILS In general both WSB+WQMC and WSB+CAML worsen

gene trees and species tree in

high ILS

Better gene trees need not give better species trees

WSB+WQMC better than WSB+CAML in high ILS (more than 30% AD)For both gene tree and species tree estimationMay be better to just use original gene trees in some situations !!WSB+CAML better than WSB+WQMC to get species tree (30% AD and below)No clear winner for gene tree estimation in low and medium ILS (30% AD and below)23Slide24

Future Extensions

Dynamic inference of parametersBinning thresholdConfidence ValueOther quartet based heuristics

e.g.

wASTRAL

Results on biological datasetsHandling cases of high ILS24Slide25

Questions?

25Slide26

Thank You

26