Hamed Salooti Alex Zelikovski Ion Mandoiu ACMBCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut Georgia State University Denovo Assembly Paradigm ID: 480702
Download Presentation The PPT/PDF document "James Lindsay*" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
James Lindsay*, Hamed Salooti, Alex Zelikovski, Ion Mandoiu*ACM-BCB 2012
Scaffolding Large Genomes Using Integer Linear Programming
University of Connecticut*
Georgia State UniversitySlide2
De-novo Assembly Paradigmshotgun sequencing
short
contigs
the scaffolds
short reads
the genome
d
enovo
a
ssembly
scaffoldingSlide3
Why Scaffolding?AnnotationComparative biologyRe-sequencing and gap filling
Structural variation!
gene XYZ
3’ UTR
5
’ UTR
S
caffold
gene XYZ
No scaffoldSlide4
Why Scaffolding?AnnotationComparative biologyRe-sequencing and gap filling
Structural variation!
gene XYZ
3’ UTR
5
’ UTR
Sanger Sequencing
gene XYZ
3’ UTR
5
’ UTR
Biologist: There are holes in my genes!Slide5
Why Scaffolding?AnnotationComparative biologyRe-sequencing and gap Filling
Structural variation!Slide6
Massive Sequencing ProjectsEffects of Read LengthI5k5000 insect and arthropod speciesG10k10,000 vertebrate species
Dog Genome7.5x SangerN50: 180Kb
Chicken Genome6x Illumina
N50: 12KbHuman Genome100x Illumina
N50: 24Kb
Fragmented GenomesSlide7
The Scaffolding ProblemGivenContigs, Paired readsFindOrientation, Ordering, Relative distanceGoal
Recreate true scaffoldsSlide8
Paired Read ConstructionPaired Read StylesMate PairPaired EndPaired Reads
2kb
2kb
same strand and orientation
R1
R2
100b
100b
10
kb
different strand and orientation
R1
R2Slide9
Linkage InformationPossible States (mate pair)Two contigs are adjacent if:A read pair spans the contigs
State (A, B, C, D)Depends on orientation of the readOrder of
contigs is arbitraryEach read pair can be “consistent” with one of the four states
5’
3’
contig
i
contig
j
R1
R2
A
B
C
DSlide10
NodesEdgesNodes are contigs
Adjacent contigs have 4 edges (one for each state)Weighted by overlap with repetitive region
Scaffolding Graph
contig
i
contig
j
State A
Slide11
Integer Linear Program FormulationVariables
,
,
Contig
pair state:
Contig
orientation:
Adjacent
contig
consistency:
Objective
Maximize weight of consistent pairsSlide12
ConstraintsVariables
,
,
Contig
pair state:
Contig
orientation:
Adjacent
contig
consistency:
Pairwise Orientation
Slide13
ConstraintsVariables
,
,
Contig
pair state:
Contig
orientation:
Adjacent
contig
consistency:
State Variables
Slide14
ConstraintsVariables
,
,
Contig
pair state:
Contig
orientation:
Adjacent
contig
consistency:
Mutual
ExclusivitySlide15
ConstraintsForbid 2 Cycles
Forbid 3 Cycles
2
2
2
2
2
2
2
2
*larger cycles are broken at the endSlide16
Largest Connected ComponentSlide17
Graph Decomposition: Articulation Points
solve
Articulation point
MIP,
Salmela
2011Slide18
Largest Biconnected ComponentSlide19
Non-Serial Dynamic Programming
A technique which exploits the
sparsity
of the scaffolding graph by computing the
solution in stages, incorporating the results from previous stages
~inspired by (
Neumaier
, 06)Slide20
Non-Serial Dynamic Programming
2-cut
+
+
+
-
-
+
-
-
Slide21
Non-Serial Dynamic Programming
+
+
+
-
-
+
-
-
+
Objective Modification:
Slide22
SPQR-tree Based Implementation
SPQR-tree efficiently finds 2 cuts (
Tarjan
, 73)
DFS of SPQR-tree is used to schedule elimination order for NSDPSlide23
Post Processing ILP SolutionMay have cyclesNot a total ordering for each connected components
A
B
C
D
F
E
ILP Solution
outgoing
incoming
A
B
C
D
E
F
A
B
C
D
E
F
B
ipartite matching
Objectives:
Max weight
Max cardinality
Max cardinality / Max weightSlide24
GAGE FrameworkGenomeSize (Mb)# reads
Staphlococcus Aureus2.9
3,494,070Rhodobacter
sphaeorides4.6
2,050,868Human Chr14
107
22,669,408
Assembled using:
ABySS
,
Allpaths
-LG, Bambus2, CABOG, MSR-CA, SGA,
SOAPdenovo
, Velvet
Scaffolded
using:
SILP (our method
), Opera, MIP, Bambus2 Slide25
Testing MetricsTPN50Break scaffold at incorrect edges, then find N50Size of contig where 50% of the contigs are this size
Binary ClassificationGiven n contigs
in a scaffoldHow many of n-1 adjacencies can you predictPPVSensitivityMCC
Slide26
ResultsSlide27
ResultsSlide28
ResultsSlide29
ResultsSlide30
ConclusionsSuccessILP solves scaffolding problem!NSDP worksImprovementsInclude SOAPdenovo, Allpaths-LG scaffolds in comparisonLook at parameter effects
Practical considerations (read style, multi-libraries, merge ctgs)Future Work
Where else can I apply NSDP?Scaffold before assembly … promisingStructural Variation??Slide31
Questions?