massive nextgeneration sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2 Alex Zelikovsky 2 Ion Mandoiu 3 and Yury Khudyakov 1 1 Centers for Disease Control and Prevention Atlanta GA ID: 513527
Download Presentation The PPT/PDF document "Optimizing pooling strategies for the" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Optimizing pooling strategies for the massive next-generation sequencing of viral samples
Pavel Skums1Joint work withOlga Glebova2, Alex Zelikovsky2, Ion Mandoiu3 and Yury Khudyakov1
1Centers for Disease Control and Prevention, Atlanta, GA2Georgia State University, Atlanta, GA3University of Connecticut, Storrs, CTSlide2
1. Massive NGS of viral samples
2. Optimal pooling design problem
3. Algorithm and resultsOutlineSlide3
NGS in epidemiology
Molecular surveillance
Prediction of the epidemics progressTransmission networksEpidemiological parameters
Vaccination strategiesSlide4
NGS in epidemiology
A large-scale molecular surveillance requires sequencing ofunprecedentedly large sets of viral samples.
NGS of tens of thousands of samples is highly cost- and labor-intensive.Example: sequencing 100K samples using 454 senior system with 50 MIDs doing one sequencing run per dayCost: 5000*(100 000)/50 = 10 000 000$Time
: (100 000)/50 = 2000 days 5.5 years Slide5
Optimal Pooling Design Problem
Goal: a framework for identification of viral sequences from large number of samples using the smallest possible number of NGS runs.
Idea: for n samples generate m pools (i.e. mixtures of samples) with m <<
n in such a way that every sample is uniquely identified by the pools to which it belongs. Slide6
Optimal Pooling Design Problem
EExample.8 samples: 1,2,3,4,5,6,7,8
Sequencing each sample separately: 8 runs4 pools: M1 = {1,2,3,4} M2
= {5,6,7,8} M3 = {1,2,5,6} M
4
= {1,3,5,7
}
Sequencing
pools
:
4 runsSlide7
Optimal Pooling Design Problem
4 pools: M1
= {1,2,3,4} M2 = {5,6,7,8} M3 = {1,2,5,6} M4 = {1,3,5,7}
{1} = M1M
3
M
4
{2} = (M
1
M
3
) \ M
4
……………………………………
{8} = (M
2
\ M
3
) \ M
4Slide8
Optimal Pooling Design Problem
Problem 1 (Optimal Pooling Design Problem).Given
: a set of samples S = {S1,...,Sn}Find: a set of pools P = {P1,…,Pm} ,
Pk S for k=1,…,m such that
1)P
1
…P
m
=
S
2) for every
S
i
,S
j
S
there
exists
P
k
P
such that
|
P
k
{
S
i
,S
j
}| = 1
3)
m
is minimal
Theorem1
. There exists a solution of Problem 1 with m = log(n) + 1
(
P
k
separates
S
i
and
S
j
)Slide9
Optimal Pooling Design Problem
Additional conditions for the problem
ConditionReasonsEach pool contains at most k samples|Pj| ≤ k for j = 1,…,mnumber of reads which could be obtained by each NGS technology is bounded
if large number of samples are mixed in one pool, some of them may be lost due to a PCR biasSlide10
Optimal Pooling Design Problem
Additional conditions for the problem
ConditionReasonsEach pool contains at most k samples|Pj| ≤ k for j = 1,…,mnumber of reads which could be obtained by each NGS technology is bounded
if large number of samples are mixed in one pool, some of them may be lost due to a PCR bias
Each
sample belongs to at least l pools
|{j : S
i
P
j
}| ≥ l for i = 1,…,n
to ensure sufficient coverage for sequences of each
sampleSlide11
Optimal Pooling Design Problem
Additional conditions for the problem
ConditionReasonsEach pool contains at most k samples|Pj| ≤ k for j = 1,…,mnumber of reads which could be obtained by each NGS technology is bounded
if large number of samples are mixed in one pool, some of them may be lost due to a PCR bias
Each
sample belongs to at least l pools
|{j : S
i
P
j
}| ≥ l for i = 1,…,n
to ensure sufficient coverage for sequences of each
sample
Some pairs of samples should not be put into a same pool
Some samples may intersect (if they belong to
the same transmission cluster)Slide12
Optimal Pooling Design Problem
ConditionReasons
Each pool Pj should be a clique of the graph G(S)Some samples may intersect (if they belong to the same transmission cluster)A graph G(S) = (V,E), where
V = SSiSjE
if
and only if there is a confidence that
the samples S
i
and
S
j
do not intersect.Slide13
Optimal Pooling Design Problem
Problem 2 (Minimum Clique Test Set Problem).Given
: a graph G=G(S), natural numbers k,lFind: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every
i=1,…,m 2) every vertex v V(G) belongs to at least l cliques from P
3) for every
u,v
V
(G)
there
exists
P
i
P
such that
|P
i
{
u,v
}| = 1
4)
m
is minimalSlide14
Optimal Pooling Design Problem
Minimum Test Set Problem (Garey, Johnson)
Given: collection Q={Q1,…,Qn} of subsets of a finite set SFind: a subcollection P = {P1,…,Pm
}Q such that 1) for every s
i
,s
j
S
there
exists
P
r
P
such that
|
P
r
{
s
i
,s
j
}| = 1
2)
m
is minimalSlide15
Problem reformulations
Minimum Clique Test Set ProblemGiven: a graph G=G(S), natural numbers k,l
Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m
2) every vertex v V(G) belongs to at least l cliques from P
3) for every
u,v
V
(G)
there exists P
i
P
such that
|P
i
{
u,v
}| = 1
4)
m
is minimal
Only some pairs of vertices should be separated
A graph H with V(H)=V(G),
uv
E
(H) if and only if u and v should be separatedSlide16
Problem reformulations
Minimum Clique Test Set ProblemGiven: a graphs G and H on the same set of vertices V, natural
numbers k,lFind: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every
i=1,…,m 2)
every vertex v V(G) belongs to at least l cliques from
P
3
) for every
uv
E
(H)
there
exists
P
i
P
such that
|P
i
{
u,v
}| = 1
4)
m
is minimal
Only some pairs of vertices should be separated
A graph H with V(H)=V(G),
uv
E
(H) if and only if u and v should be separatedSlide17
Problem reformulations
Minimum Clique Test Set ProblemGiven: a graphs G and H on the same set of vertices V, natural numbers k,l
Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m
2) every vertex v V belongs to at least l cliques from P
3) for every
uv
E
(H)
there
exists
P
i
P
such that
|P
i
{
u,v
}| = 1
4)
m
is minimal
Replace each vertex
v
V
(G) with l pairwise non-adjacent copiesSlide18
Problem reformulations
Minimum Clique Test Set ProblemGiven: a graphs G and H on the same set of vertices V, natural number k
Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m
2) P
1
…P
m
=
V(G)
3
) for every
uv
E
(H)
there
exists
P
i
P
such that
|P
i
{
u,v
}| = 1
4)
m
is minimal
Replace each vertex
v
V
(G) with l pairwise non-adjacent copiesSlide19
Problem reformulations
Minimum Clique Test Set ProblemGiven: a graphs G and H on the same set of vertices V, natural number k
Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m
2) P1…Pm =
V
3
) for every
uv
E
(H)
there
exists
P
i
P
such that
|P
i
{
u,v
}| = 1
4)
m
is minimal
For every
u
V
add a vertex
x
u
and an edge
ux
u
E
(H)
1
2
3Slide20
Problem reformulations
Minimum Clique Test Set ProblemGiven: a graphs G and H on the same set of vertices V, natural number k
Find: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m
2) P1…Pm =
V
3
) for every
uv
E
(H)
there
exists
P
i
P
such that
|P
i
{
u,v
}| = 1
4)
m
is minimal
For every
u
V
add a vertex
x
u
and an edge
ux
u
E
(H)
1
2
3
x
1
x
2
x
3Slide21
Problem reformulations
Minimum Clique Test Set ProblemGiven: a graphs G and H on the same set of vertices, natural number kFind
: a set of cliques P = {P1,…,Pm} of G such that 1) |Pi| ≤ k for every i=1,…,m 3) for every
uvE(H) there exists P
i
P
such that
|P
i
{
u,v
}| = 1
4)
m
is minimal
For every
u
V
add a vertex
x
u
and an edge
ux
u
E
(H)
1
2
3
x
1
x
2
x
3Slide22
Heuristic algorithm
Input: a graphs G and H on the same set of vertices V, natural number kP =
WHILE CP C V OR E(H) find maximum cut (A,B) in H (using local search); for every aA put w(a) = # of neighbors of a from B in H;
for every b
B
put
w(b)
= # of neighbors of
b
from
A
in
H;
find maximum clique C
1
with |C
1
|≤k in a
subgraph
G[A] with weights w;
find
maximum clique
C
2
with |
C
2
|
≤k
in a
subgraph
G[B]
with weights
w;
C :=
argmax
{w(C
1
),w(C
2
)} ;
P
:=
P
{C};
E(H) := E(H) \ {
uv
:
uC
,
vV
\C}
ENDWHILESlide23
Algorithm results
1. All samples are unrelated (i.e. G is complete graph)
# samples# generated pools
4
3
8
4
16
5
32
6
64
7
128
8
256
9
512
10
1024
11
2048
12Slide24
Algorithm results
2. Some samples are related (G is a random graph with the edge probability p)Slide25
Thank you!