/
CSE291: Personal genomics for CSE291: Personal genomics for

CSE291: Personal genomics for - PowerPoint Presentation

basidell
basidell . @basidell
Follow
344 views
Uploaded On 2020-08-03

CSE291: Personal genomics for - PPT Presentation

bioinformaticians Class meetings TR 330450 MCGIL 2315 Office hours M 300500 W 400500 CSE 4216 Contact mgymrekucsdedu Todays schedule 330 355 Sequence alignment 4 55 ID: 795960

reads bwt reference calling bwt reads calling reference alignment snp genome bwa read short match mapping string variant www

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "CSE291: Personal genomics for" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

CSE291: Personal genomics for

bioinformaticians

Class meetings: TR 3:30-4:50 MCGIL 2315

Office hours: M 3:00-5:00, W 4:00-5:00 CSE 4216

Contact:

mgymrek@ucsd.edu

Today’s schedule:

3:30

-

3:55

Sequence alignment

4

:55-

4

:25

Variant calling – SNPs/indels

4:

25-

4

:

30

Break

4

:

30

-

4:50

lobSTR

: genotyping STRs from short reads

Announcements

:

Proposals due today

Time to work on PS4 on Tuesday (bring laptops)

Slide2

Short read alignment and variant calling

CSE291: Personal Genomics for

Bioinformaticians

02/

16/

17

Slide3

Outline

Mapping using BWT

Local realignment

Basic SNP calling approaches

State of the art variant calling

Genotyping STRs from short reads

Slide4

Mapping using BWT

Slide5

The data! – “FASTQ”

ERR194146_1.fastq

@ERR194146.1 HSQ1008:141:D0CC8ACXX:3:1308:20201:36071/1

ACATCTGGTTCCTACTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCA

+

?@@FFBFFDDHHBCEAFGEGIIDHGH@GDHHHGEHID@C?GGDG@FHIGGH@FHBEG:GGIEEHH;@CHH=EDFFBBE>;@?3;;;>;;;AA?<C>C@@C:

@ERR194146.2 HSQ1008:141:D0CC8ACXX:4:1208:5231:93701/1

CCAGCGTCTCGCAATGCTATCGCGTGCATACCCCCCAGACGAAAATACCAAATGCATGGAGAGCTCCCGTGAGTGGTTAATAGGGTGATAGACCTGTGATC

+

?88ADA?BDF8F:@@DB49C>B7C806B<?DEF=@FFC=AABEA<;@D>AC265;ADA;@@?<@B@A<;?5(8+:0>(4:@AAA832?@@BBBB:<4(:>>

@ERR194146.3 HSQ1008:141:D0CC8ACXX:2:1305:18078:8420/1GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGG+@CCFFFFFHHFGHJIGJJJJJJJJJJJIJIIJJJIJJJIGJJJJIIJIGIHIJI=FFIJJJJHHJHGHHDD;@CDDEDDDDDBDDDCECDDDDDBDDDDDD

ERR194146_2.fastq

@ERR194146.1 HSQ1008:141:D0CC8ACXX:3:1308:20201:36071/2CCAGCGTCTCGCAATGCTATCGCGTGCATACCCCCCAGACGAAAATACCAAATGCATGGAGAGCTCCCGTGAGTGGTTAATAGGGTGATAGACCTGTGATC+<?@DDDBDHHHAAH@CG@FBBGGIEG@D@GCHFG@F/5=C4=B'=7?CA9@?ACDCDCA@<?BDCCCB=BB0<4>?4?>@:CCD???A:@C3>A>AC:++:@ERR194146.2 HSQ1008:141:D0CC8ACXX:4:1208:5231:93701/2CCCACACGTTCCCCTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGG+@??BD1:AFHBDDGGHIJIF@EEE>GFHGHEAG?DEG<FD>B;?BB<*/9??C8BB@FHC7=CCGAE;=,5,,.;@@>(5(;>ACCC(,,5;>CCB88<??@ERR194146.3 HSQ1008:141:D0CC8ACXX:2:1305:18078:8420/2GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTATGCACGCGATAGCATTGCGAGACGCTGG+@CCFFFFFHHFGHJIGJJJJJJJJJJJIJIIJJJIJJJIGJJJJIIJIGIHIJI=FFIJJJJHHJHGHHDD;@CDDEDDDDDBDDDCECDDDDDBDDDDDD

(first end of each pair)

(second end of each pair)

Slide6

Challenge: Rapidly align short reads to a reference

Goal:

Where in the genome does each read originate?

Reference genome: ~3×10

9

bp

Reads: ~100bp

Solution:

Index genome in quickly searchable data structure

Genome

Index

?

(Burrows Wheeler Transform)

ACGATAG

Chr1:3899292

ATCGACGATAGCCG

ACGATAG

Slide7

Burrows-Wheeler Transform

String T

:

AACATCGAGCTACGTACGTACGTACGTACGTACGTACGTACGACGAGGGGAGAGAAT

BWT(T)

(Burrows Wheeler Transform

)

Size O(|T|), searching in time O(|query string|)

T

Burrows Wheeler Transform (and related auxiliary data structures) allows:

Compression

Indexing/searching

Slide8

Burrows-Wheeler Transform

String T: BANANA$

m

=|T| (length of T)

T ends with a terminator character ($), lexicographically prior to all other characters

Creating BWT(T)

Write down rotations of T

Sort the rows lexicographically to make BWM(T) (Burrows Wheeler Matrix)

The final column of BWM(T) gives BWT(T)

Step 1:

B A N A N A

$$ B A N A N AA

$ B A N A NN A $ B A N AA N A $ B A NN A N A $ B AA N A N A $ BStep 2:$ B A N A N AA

$ B A N A NA N A $

B A N

A N A N A

$

B

B A N A N A

$

N A

$

B A N A

N A N A

$

B A

BWT(T) = ANNB$AA

Slide9

Relationship to suffix arrays

String T: BANANA$

Creating

SuffixArray

(T)

Write T’s suffixes, note the starting index of each suffix

Sort lexicographically

BWM(T)

SuffixArray

(T)

Suffix given by SA

$ B A N A N A

A $ B A N A NA N A $ B A NA N A N A $ BB A N A N A $N A $ B A N AN A N A $ B A$ A $A N A

$A N A N A

$

B A N A N A

$

N A

$

N A N A

$

6

5

3

1

04

2BWT(T)[i

] =

T[SA[i]-1] if SA[i

]>0

$ if SA[i

]==0

Slide10

BWT for compression

BWT is

reversible

, necessary for compression

BWT(T) has same length as T, so not immediately clear where compression comes in.

BUT BWT often brings similar characters together, easier to shrink with run length encoding methods, e.g. bzip2

BWT(BANANA$) =

ANNB$AA

BWT(tomorrow and tomorrow and tomorrow) =

wwwdd

nnoooaatttmmmrrrrrrooo $

ooo

Slide11

Reversing the BWT with LF mapping

Step 1:

B

0

A

0

N

0

A

1

N1 A2 $0$0

B0 A0

N0 A1 N1 A2A2 $0 B0 A0 N0 A1 N1

N1 A2

$

0

B

0

A

0

N

0

A

1A1 N1

A2 $0

B

0 A0 N0

N0

A1 N

1 A2 $

0

B0

A0A0

N0 A1

N

1

A

2

$

0

B

0

Step 2:

Create the BWM, but this time include order of each character

e.g.:

B

0

A

0

N

0

A

1

N

1

A

2

$

$

0

B

0

A

0

N

0

A

1

N

1

A

2

A

2

$

0

B0 A0 N0 A1 N1A1 N1 A2 $0 B0 A0 N0A0 N0 A1 N1 A2 $0 B0B0 A0 N0 A1 N1 A2 $0N1 A2 $0 B0 A0 N0 A1N0 A1 N1 A2 $0 B0 A0

LF Property

: the

i

th

occurrence of a character

c

in the last column (L) has the same rank as the

i

th

occurrence of

c

in the first column (F)

e.g. for A, has order 2, 1, 0 in both L and F

e.g. for N, has order 1, 0 in both L and F

Slide12

Reversing the BWT with LF mapping

T =

B

0

A

0

N

0

A

1

N1 A2 $$0

B0 A0 N0

A1 N1 A2A2 $0 B0 A0 N0 A1 N1A

1 N1 A

2

$

0

B

0

A

0

N

0

A0 N0 A1

N1 A2

$0

B0B

0 A0

N0 A

1 N1 A2

$

0N

1 A2 $

0 B0

A

0

N

0

A

1

N

0

A

1

N

1

A

2

$

0

B

0

A

0

BWM(T)=

First col (F)

Last col (L)

Rank in last col.

$

A

A

A

B

0

N

N

A

N

N

B

$

A

A

0

0

1

0

0

23Start with the first row, with $ in first columnLast column of first row must have character to the left in T, so A$. Rank of A is 0, so must correspond to 1st “A” in F. So we get NA$. Rank of N is 0, so must correspond to 1st “N” in F. So we get ANA$.Repeat steps to get BANANA$

Slide13

Searching using LF mapping

T =

B

0

A

0

N

0

A

1

N1 A2 $$0 B

0 A0 N0

A1 N1 A2A2 $0 B0 A0 N0 A1 N1A

1 N1 A

2

$

0

B

0

A

0

N

0

A0 N0 A1

N1 A2

$0

B0B

0 A0

N0 A

1 N1 A2

$

0N

1 A2 $

0 B0

A

0

N

0

A

1

N

0

A

1

N

1

A

2

$

0

B

0

A

0

BWT(T)=

F

Rank in last col.

0

0

1

0

0

2

3

L

Searching for

P =

N A N

Backwards matching:

repeatedly apply LF mapping to range of rows prefixed by successively longer suffixes of P.

First find rows beginning with shortest suffix of P, “N”

From “L”, see both are

preceeded

by “A”, corresponding to “A2” and “A3”

Now find all rows beginning with next-longest suffix of P, “AN”

See only one is

preceeded

by “N”, corresponding to “N1”

Number of rows remaining = number of occurrences of P in T.

Slide14

String search with FM-index

Searching with LF-mapping requires O(m) time, where m=|T|

. We’d like O(|P|)

Augment rank array with m x |

Σ

| matrix. Each row gives how many times that character has been observed up to and including that position in BWT(T)

F

$ A B N

1

1

11

123

L$ B A N A N AA $ B A N A NA N A $ B A NA N A N A $ BB A N A N A $N A $ B A N AN A N A $ B A

0

0

0

1

1

1

1

0

1

2

2

2

22

Occ0

00

0111

Now, searching for P=N A N

Look for row beginning with first suffix, “N”. Look up range of next character using corresponding row of Occ

(A). Only need to look at the first and last.Now, only need to lookup next character “A” with the ranks of those occurrencesStill only gives us how many times P occurs in T, not where!

Slide15

Looking up offsets

BWM(T)

SuffixArray

(T)

$

B A N A N A

A

$

B A N A N

A N A

$ B A NA N A N A $ BB A N A N A

$N A $ B A N AN A N A

$ B A6531042T = B0 A0 N0 A1 N1

A2 $

Searching for

P =

N A N

F

L

Suffix array tells us that the match occurs at offset 2 in the original string T!

To save space, we can store only subsets of

SuffixArray

and

Occ

tables

Allows string searching in O(length of query string). Independent of reference (T) size!

Slide16

Tools for

mapping short

reads

Tool

Publication

Link

Notes

Maq

Li,

Ruan

, Durbin 2008 Genome Research

http://

maq.sourceforge.net

/

maq-man.shtml

Widely used early on

Novoalign

www.novocraft.com

/

http://

www.novocraft.com

/

High

sensitivity

bwa

aln

Li and Durbin 2009 Bioinformatics

https://github.com/lh3/bwa

Allows short indels, widely

used (e.g. 1000 Genomes)

bowtie

Langmead

,

et al. 2009 Genome Biology

http://bowtie-

bio.sourceforge.net

/

index.shtml

no

indel

support

bowtie2

Langmead

&

Salzberg

2012 Nature Methods

http://bowtie-

bio.sourceforge.net

/bowtie2/

index.shtml

Allows

indels, faster

bwa

mem

Li

2013

Arxiv

https://github.com/lh3/bwa

Faster, more

indel

sensitive

Many others: e.g.

Stampy

,

mrsFast

, SOAP

Slide17

BWA Aligner

Slide18

BWA

Exact matching:

backward search

as described above

Inexact matching: find intervals of the reference that match the query string with no more than z differences (mismatches or gaps)

Use bounded traversal/backtracking.

In practice, use a “seed” sequence from beginning of read, which must have minimal match to the genome

The more errors in the reads, the slower the inexact mapping process is

Slide19

BWA - continued

BWA-SW:

Designed specifically to align longer reads (up to 1Mb)

Build FM-index for both the reference and query sequence

Dynamic programming to compare the two, similar to the Smith

Waterman algorithm (coming up)

BWA-MEM

“Seed-and-extend” paradigm: for each position, find longest exact match

Chain together co-linear seeds

Extend the longest seedsAllows for larger gaps, variable length reads

Slide20

Example alignment view

Reference genome

Aligned reads

Slide21

Local realignment

Slide22

Needleman-

Wunsch

pairwise alignment

Goal:

align two sequences to each other in order to maximize some score function

Score: (Example)

Match: +1

Mismatch: -1

Indel

: -1

E.g.: GCATGCA and GATTACAGCATG-CAG-ATTACAScore = 2.

Possible alignments between two length N sequences is 22N/sqrt(pi*N). Need a smart way to narrow this down!

Slide23

Needleman-

Wunsch

pairwise alignment

G

C

A

T

G

C

A

0-1-2-3-4

-5-6-7G

-1A-2T-3T-4

A

-5

C

-6

A

-7

Dynamic programming principle:

simplify the problem by solving

subproblems

.

Subproblem

solutions

make it easier to compute the overall solution

For NW alignment:

D[

i, j] gives optimal alignment ending at position A[

i] and B[j]. Where “A” and “B” are the two sequences and “D” is the matrix depicted here.

Slide24

Needleman-

Wunsch

pairwise alignment

G

C

A

T

G

C

A

0

-1-2-3

-4-5-6-7G-110-1-2-3-4-5A-20010

-1-2-3

T

-3

-1

-1

0

2

1

0

-1

T

-4

-2-2

-1110-1

A

-5-3-3-1

000

-1C

-6-4-2

-2-1

-11

0A-7

-5-3-1-2

-2

0

2

D[

i,j

] = max{

D[i-1, j-1] + match(A[

i

], B[j])

D[i-1,j] + G

D[i,j-1] + G

Match(A, B) = +1 if A=B, else -1

G = gap penalty

Trace back the process to create the alignment!

Variation (Smith-Waterman), doesn’t require end-to-end alignment

Slide25

Affine gap penalties

Motivation:

Scoring scheme above treats all gaps the same

More plausible to have one long gap vs. multiple short gaps

ACGATCATCAAAAAAAAAAAACTGATGTTCATA

ACGATCATC-A-AA--A--A-CTGATGTTCATA

ACGATCATCAAAAA-------CTGATGTTCATA

Affine gap penalty:

Gap open penalty (expensive) (O)

Gap extension penalty (cheaper) (E)

Score = O + E*L, where L is length of gap

Slide26

CIGAR Scores

M: match

D: deletion

I: insertion

ACGATCATCAAAAAAAAAACTGATGTTCATA

ACGATCATCAAAAAAAAAACTGATGTTCATA

Reference

Read

31M

ACGATCATCAAAAAAAAAAAACTGATGTTCATA

Reference

Read

19M2I12MACGATCATCAAAAAAAAAA--CTGATGTTCATAACGATCATC--AAAAAAAACTGATGTTCATAReference

Read

9M2D20M

ACGATCATCAAAAAAAAAACTGATGTTCATA

Slide27

CIGAR Scores

Slide28

Basic SNP calling approaches

Slide29

SNP calling overview

BAM file (read alignments)

SNP caller

VCF file (list of locations that vary from the reference genome)

Slide30

SNP calling – take 1

At a given position we see:

N total reads

K match the reference allele (A)

N-K don’t match (B)

Goal:

classify each column as:

Homozygous reference (

AA

)

Heterozygous (

AB)Homozygous non-reference (BB

)Algorithm I: use allele frequenciesK/N in [0, 0.2) -> homozygous referenceK/N in [0.2, 0.8] -> heterozygousK/N in (0.8, 1] -> homozygous non-referenceLimitations?

Slide31

SNP calling – a probabilistic approach

Recall Bayes Theorem:

P(G|D) =

P(D|G)P(G)

P(D)

Posterior

Likelihood

Prior

“G”: genotype (AA, AB, BB)

“D”: data (the list of reads)

P(G|D): (posterior) probability of a given genotype, given the observed data

P(D|G): (likelihood) probability of the data, given an underlying genotype

P(G): (prior) probability to observe a given genotype

Maximum a posteriori (MAP) approach: find assignment for G that maximizes the posterior P(G|D)Maximum likelihood approach:

find assignment for G that maximizes the likelihood P(D|G)

Slide32

SNP calling – maximum likelihood

N: total number of reads

K: number of non-reference alleles

e

: error rate

P(N,K | G=<AA>) =

(1-e)

N-

k

e

kN

k(

)P(N,K | G=<BB>) = (1-e)keN-kNk

()

P(N,K | G=<AB>) =

(0.5)

k

(0.5)

N-k

N

k

(

)

Can calculate (approximately) the likelihoods as:

Slide33

SNP calling – maximum a posteriori

N: total number of reads

K: number of non-reference alleles

e

: error rate

r

: prior probability of a heterozygous genotype

P(G=<AA>| N, K) =

(1-e)

N-

kek × (1-r)/2 / P(D)

N

k()(1-e)keN-k × (1-r)/2 / P(D)N

k

(

)

(0.5)

k

(0.5)

N-k

× r

/ P(D)

N

k

(

)

P(G=<BB>| N, K) =

P(G=<AB>| N, K) =

Slide34

SNP calling – alternative approaches

Heuristic:

E.g.

Varscan

,

samtools

Maximum a posteriori:

E.g. MAQ

Expectation maximization:

E.g.

SNVMixMaximum likelihood:E.g. GATK returns genotype likelihood scores

Slide35

SNP calling – complications and artifacts

MANY

potential sources/signs of error, including:

Variant only seen at read ends

Variant only seen on one strand

Multiple nearby variants, complicate alignment

Indels get messy!

Poorly

mappable

regions, e.g. repeats, are source of false positive SNPs

Regions with very low or very high coverage are error prone

Slide36

End of reads

Slide37

Strand bias

Slide38

Evaluating short variant calls

1. Comparison to “ground truth” (e.g. SNP array) via a “confusion table”

AA

AB

BB

AA

100

20

0

AB

101505

BB21090

NGS“Ground truth”False negativesFalse positives

2. Experimentally validate (e.g. Sanger sequencing)

3. “Sanity checks”

How many SNPs? ~3 million/people

Transition to

transversion

ratio (Ti/

Tv

) should be around 2.1

Transitions: A<->G, C<->T

Transversions

: A<->C, G<->T, A<->T, C<->G

Slide39

State of the art variant calling

Slide40

The Genome Analysis Toolkit

http://

www.broadinstitute.org

/

gatk

/guide/best-practices

Slide41

Variant score recalibration

http://

weallseqtoseq.blogspot.com

/2013/10/

gatk

-best-practices-workshop-

variant.html

Slide42

GATK

HaplotypeCaller

http://

gatkforums.broadinstitute.org

/

Slide43

Copy number variants

Genome

STRiP

: infer multi-allelic copy number status from read depth distributions

Handsaker

et al.

2015

Slide44

Structural variants

Tattini

,

D’Aurizio

, and Magi 2015

Slide45

Examples of tools for SNP/

indel

calling

Local reassembly based (slower, but more accurate)

Based on input alignment

Tool

Publication

Link

Notes

Samtools

Li et al. 2009 Bioinformatics

www.htslib.org/

Simple

to use

Varscan

Koboldt

et al. 2009

Bionformatics

dkoboldt.github.io/varscan/

Simple, heuristic based, handles tumor/normal

GATK Unified

Genotyper

McKenna et al. 2010 Genome Research

www.broadinstitute.org/gatk/

More preprocessing, probabilistic

Tool

Publication

Link

Notes

GATK

HaplotypeCaller

www.broadinstitute.org/gatk/

Multi-sample

, widely used

Freebayes

Garrison &

Marth

2012

arXiv

github.com/ekg/freebayes

Fast,

high accuracy

Platypus

Rimmer

et al.

2014 Nature Genetics

www.well.ox.ac.uk/platypus

Focus

on indels

Slide46

Genotyping STRs from short reads

Slide47

References + further reading

http://www.cs.jhu.edu/~langmea/resources/bwt_fm.pdf

“Fast and accurate short read

aligment

with Burrows-Wheeler transform”

bio-

bwa.sourcefourge.net

gatkforums.broadinstitute.org