Jorge Duitama 12 Thomas Huebsch 1 Gayle McEwen 1 Sabrina Schulz 1 Eun Kyung Suk 1 Margret R Hoehe 1 1 Max Planck Institute for Molecular Genetics Berlin Germany ID: 618577
Download Presentation The PPT/PDF document "Bioinformatics Pipeline for Fosmid based..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing
Jorge Duitama
1,2
, Thomas
Huebsch
1
, Gayle McEwen
1
,
Sabrina Schulz
1
,
Eun
-Kyung
Suk
1
, Margret R.
Hoehe
1
1
. Max Planck Institute for Molecular Genetics, Berlin, Germany
2. Department of Computer Science and Engineering University of Connecticut, Storrs, CT, USASlide2
MHC class I
MHC class III
MHC class II
29,74
31,59
32,34
33,21
MHC: Key Region for Common Diseases & Transplant MedicineSlide3
MHC: Variation amongst Haplotypes
Variation and annotation map for eight MHC haplotypes,
Horton
et al
. Immunogenetics (2008) 60,1-18
Variation amongst 8 MHC Haplotypes:
37.451 Substitutions
7.093 Short Indels
RCCX
CNV
HLA-DRB
CNV
Variation of MHC Haplotypes against PGF reference
7 further MHC
Haplotype sequences
PGF reference sequence
MHC class III
MHC class IISlide4
SNP Mapping for Prioritization of MHC Informative Pools
SOLiD NGS Platform
Shotgunning complete 40 kb fosmids
40 kb
haploid
molecules
5000 fosmids
One pool
3x96-well = 288 fosmid pools
Targeted Enrichment
Complete Fosmid Pool
100 Individuals
100 Libraries
Identification of 40 kb fosmid sequences
Data Analysis Pipeline
A
G
T
Haplotype A
Haplotype B
Phasing molecular fosmid sequences
Contiguous MHC haplotype sequence
Experimental ApproachSlide5
Data Analysis Pipeline
SOLiD Standard
Pipeline
Consensus
Calling
SNP Analysis
Read Alignment
against Genome
Pairing
In House Project Specific Analysis Pipeline
Fosmid Detection
Program
Fosmid Sequences
Based Phasing
Visualization &
MHC Database
Fosmid Specific Matching Algorithm
Slide6
Data Analysis Pipeline
SOLiD Standard
Pipeline
Read Alignment
against Genome
Pairing
Consensus
Calling
SNP Analysis
In House Project Specific Analysis Pipeline
Fosmid Detection
Program
Fosmid Specific Matching Algorithm
Fosmid Sequences
Based Phasing
Visualization &
MHC DatabaseSlide7
Mapping real data
Pool of 15.000 Fosmids 22 Mill. Reads 50bpSlide8
Data Analysis Pipeline
SOLiD Standard
Pipeline
Read Alignment
against Genome
Pairing
Consensus
Calling
SNP Analysis
In House Project Specific Analysis Pipeline
Fosmid Detection
Program
Fosmid Specific Matching Algorithm
Fosmid Sequences
Based Phasing
Visualization &
MHC DatabaseSlide9
gDNA
# cov
ref
consen
F3
coord
335
C
Y
177/17
62511614
3345
T
C
3191/56
62512095
875
G
A
862/25
62513689
1795
G
K
722/23
62513754
707
C
S
528/13
62515375
2643
C
Y
1391/20
62517737
643
C
Y
417/23
62518998
1074
A
R
554/21
62522445
606
C
S
226/21
62524689
639
A
M
167/15
62532474
158
G
R
89/14
62533464
1032
A
R
443/26
62534973
7
A
G
7/4
62537153
775
T
G
742/26
62540402
10
G
C
10/5
62540465
698
G
C
684/29
62541769
40
C
T
40/4
62542550
94
C
G
93/9
62542574
286
C
T
283/16
62543011
194
C
A
190/22
62543067
Fosmid
# cov
ref
consen
F3
coord
595
C
T
572/91
62511614
3418
T
C
3278/98
62512095
2089
G
A
2048/98
62513689
2238
G
T
2194/98
62513754
1134
C
G
1107/73
62515375
3104
C
T
2922/98
62517737
1033
C
T
1014/83
62518998
1799
A
G
1753/98
62522445
1053
C
G
1049/83
62524689
54
G
A
39/22
62527964
32
A
C
27/23
62529870
1374
A
C
1355/95
62532474
973
G
A
946/97
62533464
2850
A
G
2745/98
62534973
49
A
G
48/33
62537153
1888
T
G
1845/95
62540402
37
G
C
36/20
62540465
923
G
C
901/97
62541769
8411
A
W
2006/78
62542258253CT253/4762542550
SNP calls: Haploid fosmids vs. genomic DNASlide10
SNP Calling Accuracy in the MHC
Affymetrix genotype information for 1583 SNP positions as reference standard:
- Homozygous identical with reference: 957
- Heterozygous: 562
- Homozygous different from reference: 64Compared to variants called from the SOLiD sequenced genomic DNA sample (15x average read coverage)
Percentage of error in genotype calling: 3.66%False positive rate: 0.1%False negative rate: 9.25%Slide11
Data Analysis Pipeline
SOLiD Standard
Pipeline
Read Alignment
against Genome
Pairing
Consensus
Calling
SNP Analysis
In House Project Specific Analysis Pipeline
Fosmid Detection
Program
Fosmid Specific Matching Algorithm
Fosmid Sequences
Based Phasing
Visualization &
MHC DatabaseSlide12
UCSC Genome browser
http://genome.ucsc.edu/
Kent
et al
.
2002 Genome Res. 12(6):996-1006.
Fosmid Detection Algorithm
Assign each read to a single 1kb long bin. Select bins with more than 5 reads
Perform allele calls for each heterozygous SNP. Mark bins with heterozygous callsCluster adjacent bins as belonging to the same fosmid if:
The gap distance between them is less than 10kb andThere are no bins with heterozygous SNPs between them
Keep fosmids with lengths between 3kb and 60kb
Fosmids DetectionSlide13
Fosmids Detection
fosmid sized contigs
Size distribution of read-contigs
20 – 50 kbSlide14
Data Analysis Pipeline
SOLiD Standard
Pipeline
Read Alignment
against Genome
Pairing
Consensus
Calling
SNP Analysis
In House Project Specific Analysis Pipeline
Fosmid Detection
Program
Fosmid Specific Matching Algorithm
Fosmid Sequences
Based Phasing
Visualization &
MHC DatabaseSlide15
Haplotyping
Locus
Event
Alleles
1
SNV
C,T
2
Deletion
C,-
3
SNV
A,G
4
Insertion
-,GC
Locus
Event
Alleles Hap 1
Alleles Hap 2
1
SNV
T
C
2
Deletion
C
-
3
SNV
A
G
4
Insertion
-
GC
The process of grouping alleles that are present together on the same chromosome copy of an individual is called
haplotypingSlide16
Single Individual Haplotyping
Input: Matrix M of m fragments covering n loci
Locus
1
2
3
4
5
...
n
f
1
-
0
1
1
0
0
f
2
1
1
0
-
1
1
f
3
0
0
0
1
1
-
...
f
m
-
-
1
-
1
1Slide17
Single Individual Haplotyping
Input: Matrix M of m fragments covering n loci
Locus
1
2
3
4
5
...
n
f
1
-
0
1
1
0
0
f
2
1
1
0
-
1
1
f
3
0
0
0
1
1
-
...
f
m
-
-
1
-
1
1Slide18
Single Individual Haplotyping
Input: Matrix M of m fragments covering n loci
Locus
1
2
3
4
5
...
n
f
1
-
0
1
1
0
0
f
2
1
1
0
-
1
1
f
3
0
0
0
1
1
-
...
f
m
-
-
1
-
1
1Slide19
Single Individual Haplotyping
Input: Matrix M of m fragments covering n loci
Locus
1
2
3
4
5
...
n
f
1
-
0
1
1
0
0
f
2
1
1
0
-
1
1
f
3
0
0
0
1
1
-
...
f
m
-
-
1
-
1
1Slide20
ReFHap Problem Formulation
For two alleles a
1
, a
2
For two rows i
1
, i2
of M
f
1
-
0
1
1
0
f
2
1
1
1
-
1
Score
0
1
-1
0
1
s(M,1,2) = 1Slide21
ReFHap Problem Formulation
For a cut I of rows of MSlide22
ReFHap Algorithm
Locus
1
2
3
4
5
f
1
-
0
1
1
0
f
2
1
1
0
-
1
f
3
1
-
-
0
-
f
4
-
0
0
-
1
3
1
1
1
-1
-1
4
2
3
h
1
00110
h
2
11001
Reduce the problem to Max-Cut.
Solve Max-Cut
Build
haplotypes
according with the cut Slide23
ReFHap Algorithm
Build
G=(
V,E,w
) from M
Sort E from largest to smallest weight
Init I with a random subset of V
For each e in the first k edgesI’
← GreedyInit(G,e)
I’ ← GreedyImprovement(G,I’
)If s(M, I) < s(M, I’) then
I ← I’Slide24
ReFHap Algorithm
Classical greedy algorithm
1
3
4
2
1
3
4
2Slide25
ReFHap Algorithm
Edge flipping
1
2
3
4
2
1
3
4Slide26
Phasing the MHC:
Mixed Diploid
vs Fosmid-Based NGS
Mixed Diploid
Fosmid-Based
Libraries
Mate Pair & Paired End Genomic DNA
Paired End
16 Barcoded Pools
Uniquely Mapped
47 Gb
15 Gb
1/3
rd
Number of Blocks
407
40
1/10
th
Av. Block Length
438 bp
85 kb
194 x
Max. Block Length
3.7 kb
691 kb
186 x
Total Length all Blocks
178 kb
3.4 Mb
19 x
% of Phased SNPs
12 %
66 %
5 xSlide27
Phasing MHC:Preliminary Results
Number of blocks: 8
N50 block length: 793 kb
Maximum block length: 1.6 MB
Total extent of all blocks: 3.8 MBFraction of MHC phased into
haplotype blocks: 95%Number of heterozygous SNPs: 8030 SNPs Fraction of SNPs phased: 86%Slide28
Acknowledgements
The Life Tech Team:
Kevin McKernan Alexander Sartori
Clarence Lee Dustin Holloway
Jessica Spangler Heather Peckham
Tristen Weaver Stephen McLaughlin
Tamara Gilbert Tim Harkins
Anita
Suk
Sabrina
Schulz
Steffi
Palczewski
Britta Horstmann
Margret
Hoehe
Gayle
McEwen
Roger Horton
Thomas Hübsch
Thank You!Slide29
Comparison Mapping algos
COX Haplotype simulated readsSlide30
Phasing MHC