/
James Lindsay* James Lindsay*

James Lindsay* - PowerPoint Presentation

pamella-moone
pamella-moone . @pamella-moone
Follow
414 views
Uploaded On 2016-10-26

James Lindsay* - PPT Presentation

Hamed Salooti Alex Zelikovski Ion Mandoiu ACMBCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut Georgia State University Denovo Assembly Paradigm ID: 480702

orientation contig state scaffolding contig orientation scaffolding state contigs pair utr read sequencing adjacent variables results paired programming consistency

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "James Lindsay*" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

James Lindsay*, Hamed Salooti, Alex Zelikovski, Ion Mandoiu*ACM-BCB 2012

Scaffolding Large Genomes Using Integer Linear Programming

University of Connecticut*

Georgia State UniversitySlide2

De-novo Assembly Paradigmshotgun sequencing

short

contigs

the scaffolds

short reads

the genome

d

enovo

a

ssembly

scaffoldingSlide3

Why Scaffolding?AnnotationComparative biologyRe-sequencing and gap filling

Structural variation!

gene XYZ

3’ UTR

5

’ UTR

S

caffold

gene XYZ

No scaffoldSlide4

Why Scaffolding?AnnotationComparative biologyRe-sequencing and gap filling

Structural variation!

gene XYZ

3’ UTR

5

’ UTR

Sanger Sequencing

gene XYZ

3’ UTR

5

’ UTR

Biologist: There are holes in my genes!Slide5

Why Scaffolding?AnnotationComparative biologyRe-sequencing and gap Filling

Structural variation!Slide6

Massive Sequencing ProjectsEffects of Read LengthI5k5000 insect and arthropod speciesG10k10,000 vertebrate species

Dog Genome7.5x SangerN50: 180Kb

Chicken Genome6x Illumina

N50: 12KbHuman Genome100x Illumina

N50: 24Kb

Fragmented GenomesSlide7

The Scaffolding ProblemGivenContigs, Paired readsFindOrientation, Ordering, Relative distanceGoal

Recreate true scaffoldsSlide8

Paired Read ConstructionPaired Read StylesMate PairPaired EndPaired Reads

2kb

2kb

same strand and orientation

R1

R2

100b

100b

10

kb

different strand and orientation

R1

R2Slide9

Linkage InformationPossible States (mate pair)Two contigs are adjacent if:A read pair spans the contigs

State (A, B, C, D)Depends on orientation of the readOrder of

contigs is arbitraryEach read pair can be “consistent” with one of the four states

5’

3’

contig

i

contig

j

R1

R2

A

B

C

DSlide10

NodesEdgesNodes are contigs

Adjacent contigs have 4 edges (one for each state)Weighted by overlap with repetitive region

Scaffolding Graph

contig

i

contig

j

State A

 Slide11

Integer Linear Program FormulationVariables

,

,

 

 

Contig

pair state:

Contig

orientation:

 

Adjacent

contig

consistency:

 

Objective

Maximize weight of consistent pairsSlide12

ConstraintsVariables

,

,

 

Contig

pair state:

Contig

orientation:

 

Adjacent

contig

consistency:

 

Pairwise Orientation

 

 

 

 Slide13

ConstraintsVariables

,

,

 

Contig

pair state:

Contig

orientation:

 

Adjacent

contig

consistency:

 

State Variables

 

 

 

 Slide14

ConstraintsVariables

,

,

 

Contig

pair state:

Contig

orientation:

 

Adjacent

contig

consistency:

 

 

 

Mutual

ExclusivitySlide15

ConstraintsForbid 2 Cycles

 

 

Forbid 3 Cycles

2

 

2

 

2

 

2

 

2

 

2

 

2

 

2

 

*larger cycles are broken at the endSlide16

Largest Connected ComponentSlide17

Graph Decomposition: Articulation Points

solve

Articulation point

MIP,

Salmela

2011Slide18

Largest Biconnected ComponentSlide19

Non-Serial Dynamic Programming

A technique which exploits the

sparsity

of the scaffolding graph by computing the

solution in stages, incorporating the results from previous stages

~inspired by (

Neumaier

, 06)Slide20

Non-Serial Dynamic Programming

2-cut

+

+

+

-

-

+

-

-

 

 

 

 Slide21

Non-Serial Dynamic Programming

+

+

+

-

-

+

-

-

 

 

 

 

+

 

Objective Modification:

 

 

 

 Slide22

SPQR-tree Based Implementation

SPQR-tree efficiently finds 2 cuts (

Tarjan

, 73)

DFS of SPQR-tree is used to schedule elimination order for NSDPSlide23

Post Processing ILP SolutionMay have cyclesNot a total ordering for each connected components

A

B

C

D

F

E

ILP Solution

outgoing

incoming

A

B

C

D

E

F

A

B

C

D

E

F

B

ipartite matching

Objectives:

Max weight

Max cardinality

Max cardinality / Max weightSlide24

GAGE FrameworkGenomeSize (Mb)# reads

Staphlococcus Aureus2.9

3,494,070Rhodobacter

sphaeorides4.6

2,050,868Human Chr14

107

22,669,408

Assembled using:

ABySS

,

Allpaths

-LG, Bambus2, CABOG, MSR-CA, SGA,

SOAPdenovo

, Velvet

Scaffolded

using:

SILP (our method

), Opera, MIP, Bambus2 Slide25

Testing MetricsTPN50Break scaffold at incorrect edges, then find N50Size of contig where 50% of the contigs are this size

Binary ClassificationGiven n contigs

in a scaffoldHow many of n-1 adjacencies can you predictPPVSensitivityMCC

 Slide26

ResultsSlide27

ResultsSlide28

ResultsSlide29

ResultsSlide30

ConclusionsSuccessILP solves scaffolding problem!NSDP worksImprovementsInclude SOAPdenovo, Allpaths-LG scaffolds in comparisonLook at parameter effects

Practical considerations (read style, multi-libraries, merge ctgs)Future Work

Where else can I apply NSDP?Scaffold before assembly … promisingStructural Variation??Slide31

Questions?