/
Bioinformatics Pipeline for Fosmid based Molecular Haplotyp Bioinformatics Pipeline for Fosmid based Molecular Haplotyp

Bioinformatics Pipeline for Fosmid based Molecular Haplotyp - PowerPoint Presentation

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
387 views
Uploaded On 2017-12-30

Bioinformatics Pipeline for Fosmid based Molecular Haplotyp - PPT Presentation

Jorge Duitama 12 Thomas Huebsch 1 Gayle McEwen 1 Sabrina Schulz 1 Eun Kyung Suk 1 Margret R Hoehe 1 1 Max Planck Institute for Molecular Genetics Berlin Germany ID: 618577

mhc fosmid analysis pipeline fosmid mhc pipeline analysis specific algorithm phasing read genome sequences detection haplotype data amp solid

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Bioinformatics Pipeline for Fosmid based..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing

Jorge Duitama

1,2

, Thomas

Huebsch

1

, Gayle McEwen

1

,

Sabrina Schulz

1

,

Eun

-Kyung

Suk

1

, Margret R.

Hoehe

1

1

. Max Planck Institute for Molecular Genetics, Berlin, Germany

2. Department of Computer Science and Engineering University of Connecticut, Storrs, CT, USASlide2

MHC class I

MHC class III

MHC class II

29,74

31,59

32,34

33,21

MHC: Key Region for Common Diseases & Transplant MedicineSlide3

MHC: Variation amongst Haplotypes

Variation and annotation map for eight MHC haplotypes,

Horton

et al

. Immunogenetics (2008) 60,1-18

Variation amongst 8 MHC Haplotypes:

37.451 Substitutions

7.093 Short Indels

RCCX

CNV

HLA-DRB

CNV

Variation of MHC Haplotypes against PGF reference

7 further MHC

Haplotype sequences

PGF reference sequence

MHC class III

MHC class IISlide4

SNP Mapping for Prioritization of MHC Informative Pools

SOLiD NGS Platform

Shotgunning complete 40 kb fosmids

40 kb

haploid

molecules

5000 fosmids

One pool

3x96-well = 288 fosmid pools

Targeted Enrichment

Complete Fosmid Pool

100 Individuals

100 Libraries

Identification of 40 kb fosmid sequences

Data Analysis Pipeline

A

G

T

Haplotype A

Haplotype B

Phasing molecular fosmid sequences

Contiguous MHC haplotype sequence

Experimental ApproachSlide5

Data Analysis Pipeline

SOLiD Standard

Pipeline

Consensus

Calling

SNP Analysis

Read Alignment

against Genome

Pairing

In House Project Specific Analysis Pipeline

Fosmid Detection

Program

Fosmid Sequences

Based Phasing

Visualization &

MHC Database

Fosmid Specific Matching Algorithm

Slide6

Data Analysis Pipeline

SOLiD Standard

Pipeline

Read Alignment

against Genome

Pairing

Consensus

Calling

SNP Analysis

In House Project Specific Analysis Pipeline

Fosmid Detection

Program

Fosmid Specific Matching Algorithm

Fosmid Sequences

Based Phasing

Visualization &

MHC DatabaseSlide7

Mapping real data

Pool of 15.000 Fosmids 22 Mill. Reads 50bpSlide8

Data Analysis Pipeline

SOLiD Standard

Pipeline

Read Alignment

against Genome

Pairing

Consensus

Calling

SNP Analysis

In House Project Specific Analysis Pipeline

Fosmid Detection

Program

Fosmid Specific Matching Algorithm

Fosmid Sequences

Based Phasing

Visualization &

MHC DatabaseSlide9

gDNA

# cov

ref

consen

F3

coord

335

C

Y

177/17

62511614

3345

T

C

3191/56

62512095

875

G

A

862/25

62513689

1795

G

K

722/23

62513754

707

C

S

528/13

62515375

2643

C

Y

1391/20

62517737

643

C

Y

417/23

62518998

1074

A

R

554/21

62522445

606

C

S

226/21

62524689

639

A

M

167/15

62532474

158

G

R

89/14

62533464

1032

A

R

443/26

62534973

7

A

G

7/4

62537153

775

T

G

742/26

62540402

10

G

C

10/5

62540465

698

G

C

684/29

62541769

40

C

T

40/4

62542550

94

C

G

93/9

62542574

286

C

T

283/16

62543011

194

C

A

190/22

62543067

Fosmid

# cov

ref

consen

F3

coord

595

C

T

572/91

62511614

3418

T

C

3278/98

62512095

2089

G

A

2048/98

62513689

2238

G

T

2194/98

62513754

1134

C

G

1107/73

62515375

3104

C

T

2922/98

62517737

1033

C

T

1014/83

62518998

1799

A

G

1753/98

62522445

1053

C

G

1049/83

62524689

54

G

A

39/22

62527964

32

A

C

27/23

62529870

1374

A

C

1355/95

62532474

973

G

A

946/97

62533464

2850

A

G

2745/98

62534973

49

A

G

48/33

62537153

1888

T

G

1845/95

62540402

37

G

C

36/20

62540465

923

G

C

901/97

62541769

8411

A

W

2006/78

62542258253CT253/4762542550

SNP calls: Haploid fosmids vs. genomic DNASlide10

SNP Calling Accuracy in the MHC

Affymetrix genotype information for 1583 SNP positions as reference standard:

- Homozygous identical with reference: 957

- Heterozygous: 562

- Homozygous different from reference: 64Compared to variants called from the SOLiD sequenced genomic DNA sample (15x average read coverage)

Percentage of error in genotype calling: 3.66%False positive rate: 0.1%False negative rate: 9.25%Slide11

Data Analysis Pipeline

SOLiD Standard

Pipeline

Read Alignment

against Genome

Pairing

Consensus

Calling

SNP Analysis

In House Project Specific Analysis Pipeline

Fosmid Detection

Program

Fosmid Specific Matching Algorithm

Fosmid Sequences

Based Phasing

Visualization &

MHC DatabaseSlide12

UCSC Genome browser

http://genome.ucsc.edu/

Kent

et al

.

2002 Genome Res. 12(6):996-1006.

Fosmid Detection Algorithm

Assign each read to a single 1kb long bin. Select bins with more than 5 reads

Perform allele calls for each heterozygous SNP. Mark bins with heterozygous callsCluster adjacent bins as belonging to the same fosmid if:

The gap distance between them is less than 10kb andThere are no bins with heterozygous SNPs between them

Keep fosmids with lengths between 3kb and 60kb

Fosmids DetectionSlide13

Fosmids Detection

fosmid sized contigs

Size distribution of read-contigs

20 – 50 kbSlide14

Data Analysis Pipeline

SOLiD Standard

Pipeline

Read Alignment

against Genome

Pairing

Consensus

Calling

SNP Analysis

In House Project Specific Analysis Pipeline

Fosmid Detection

Program

Fosmid Specific Matching Algorithm

Fosmid Sequences

Based Phasing

Visualization &

MHC DatabaseSlide15

Haplotyping

Locus

Event

Alleles

1

SNV

C,T

2

Deletion

C,-

3

SNV

A,G

4

Insertion

-,GC

Locus

Event

Alleles Hap 1

Alleles Hap 2

1

SNV

T

C

2

Deletion

C

-

3

SNV

A

G

4

Insertion

-

GC

The process of grouping alleles that are present together on the same chromosome copy of an individual is called

haplotypingSlide16

Single Individual Haplotyping

Input: Matrix M of m fragments covering n loci

Locus

1

2

3

4

5

...

n

f

1

-

0

1

1

0

0

f

2

1

1

0

-

1

1

f

3

0

0

0

1

1

-

...

f

m

-

-

1

-

1

1Slide17

Single Individual Haplotyping

Input: Matrix M of m fragments covering n loci

Locus

1

2

3

4

5

...

n

f

1

-

0

1

1

0

0

f

2

1

1

0

-

1

1

f

3

0

0

0

1

1

-

...

f

m

-

-

1

-

1

1Slide18

Single Individual Haplotyping

Input: Matrix M of m fragments covering n loci

Locus

1

2

3

4

5

...

n

f

1

-

0

1

1

0

0

f

2

1

1

0

-

1

1

f

3

0

0

0

1

1

-

...

f

m

-

-

1

-

1

1Slide19

Single Individual Haplotyping

Input: Matrix M of m fragments covering n loci

Locus

1

2

3

4

5

...

n

f

1

-

0

1

1

0

0

f

2

1

1

0

-

1

1

f

3

0

0

0

1

1

-

...

f

m

-

-

1

-

1

1Slide20

ReFHap Problem Formulation

For two alleles a

1

, a

2

For two rows i

1

, i2

of M

f

1

-

0

1

1

0

f

2

1

1

1

-

1

Score

0

1

-1

0

1

s(M,1,2) = 1Slide21

ReFHap Problem Formulation

For a cut I of rows of MSlide22

ReFHap Algorithm

Locus

1

2

3

4

5

f

1

-

0

1

1

0

f

2

1

1

0

-

1

f

3

1

-

-

0

-

f

4

-

0

0

-

1

3

1

1

1

-1

-1

4

2

3

h

1

00110

h

2

11001

Reduce the problem to Max-Cut.

Solve Max-Cut

Build

haplotypes

according with the cut Slide23

ReFHap Algorithm

Build

G=(

V,E,w

) from M

Sort E from largest to smallest weight

Init I with a random subset of V

For each e in the first k edgesI’

← GreedyInit(G,e)

I’ ← GreedyImprovement(G,I’

)If s(M, I) < s(M, I’) then

I ← I’Slide24

ReFHap Algorithm

Classical greedy algorithm

1

3

4

2

1

3

4

2Slide25

ReFHap Algorithm

Edge flipping

1

2

3

4

2

1

3

4Slide26

Phasing the MHC:

Mixed Diploid

vs Fosmid-Based NGS

Mixed Diploid

Fosmid-Based

Libraries

Mate Pair & Paired End Genomic DNA

Paired End

16 Barcoded Pools

Uniquely Mapped

47 Gb

15 Gb

1/3

rd

Number of Blocks

407

40

1/10

th

Av. Block Length

438 bp

85 kb

194 x

Max. Block Length

3.7 kb

691 kb

186 x

Total Length all Blocks

178 kb

3.4 Mb

19 x

% of Phased SNPs

12 %

66 %

5 xSlide27

Phasing MHC:Preliminary Results

Number of blocks: 8

N50 block length: 793 kb

Maximum block length: 1.6 MB

Total extent of all blocks: 3.8 MBFraction of MHC phased into

haplotype blocks: 95%Number of heterozygous SNPs: 8030 SNPs Fraction of SNPs phased: 86%Slide28

Acknowledgements

The Life Tech Team:

Kevin McKernan Alexander Sartori

Clarence Lee Dustin Holloway

Jessica Spangler Heather Peckham

Tristen Weaver Stephen McLaughlin

Tamara Gilbert Tim Harkins

Anita

Suk

Sabrina

Schulz

Steffi

Palczewski

Britta Horstmann

Margret

Hoehe

Gayle

McEwen

Roger Horton

Thomas Hübsch

Thank You!Slide29

Comparison Mapping algos

COX Haplotype simulated readsSlide30

Phasing MHC