Variant Calling PowerPoint Presentation, PPT - DocSlides

Download lois-ondreau | 2018-01-12 | General Chris . Fields. Mayo-Illinois . Computational. . Genomics. Workshop, . June 19, 2017. Up-front acknowledgments. Many figures/slides come from:. GATK Workshop slides: . http://www.broadinstitute.org/gatk/guide/events?id=. ID: 623121

Variant Calling PowerPoint Presentation, PPT - DocSlides Slideshow

software.broadinstitute.org/gatk/download/alpha

Slide108


  • Views 15
Download this presentation

Variant Calling PowerPoint Presentation, PPT - DocSlides

Click below link (As may be) to get this presentation.

Download Note - The PPT/PDF document "Variant Calling PowerPoint Presentation,..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentations text content in Variant Calling PowerPoint Presentation, PPT - DocSlides

Slide1

Variant Calling

Chris

Fields

Mayo-Illinois

Computational

Genomics

Workshop,

June 19, 2017

Slide2

Up-front acknowledgments

Many figures/slides come from:

GATK Workshop slides:

http://www.broadinstitute.org/gatk/guide/events?id=

2038

IGV

Workshop slides:

http://lanyrd.com/2013/vizbi/scdttf

/

Denis

Bauer (CSIRO):

http

://www.allpower.de

/

Many varied publications

Slide3

Background

Variant calling and use cases

Errors vs. actual variants

Experimental design (GATK focus)

Small

variant (

SNV/Small

Indel

) analysis

GATK Pipeline

Formats encountered within

Structural Variation Analysis (SV)

Association analysis (briefly)

Slide4

Variant Calling

As the name implies, we’re looking for differences (variations)

Reference – reference genome (hg19, b37)

Sample(s) – one or more comparative samples

Start with raw sequence data

End with a human ‘diff’ file, recording the variants

Additional information added downstream:

Filters (quality of the calls)

Functional annotation

Slide5

Variations

Difference between 2 individuals : 1 every 1000

bp

~ 2.7 million differences

Small (<50

bp

)

SNV – single nucleotide

Small insertions or deletions (‘

indels

’)

Large (structural variations)

Indels

> 50

bp

Copy Number Variations

Inversions

Translocations

Slide6

Variations

Mainly focus on diploid organisms

Human:

22 pairs of autosomal chromosomes

One from mother, one from father

2 sex chromosomes (female XX, male XY)

One from mother, one from father (where does Y come from for male offspring)

Mitochondrial genome (generally maternally inherited)

100-10,000 copies per cell

Variation can be in

One chromosome (heterozygous, or ‘het’)

Both chromosomes (homozygous

, or ‘

hom

’

)

Slide7

Use cases

Medicine

Hereditary

or genetic

diseases, genetic predisposition to disease

Normal vs. tumor analyses

Heteroplasmy

Population genetics

Slide8

Population geneticsThe 1000 Genomes Project

Slide9

Cancer

E.

Mardis

, Current

Opinion in Genetics & Development 2012, 22:245–250

Slide10

Heteroplasmy

Mixture of more than one type of organelle genome within cell100-10,000 mtDNA copies/cellPredisposing factor for mitochondrial diseasesLate (adult) onsetPossibly beneficial in some casesCentenarians have above avg freq for heteroplasmy

Coble

et al

, PLOS One

, 2009;4(3):e4838

Slide11

Variants vs. Errors

Must distinguish between actual

variation

(real change) and

errors

(artifacts) introduced into the analysis

Errors can creep in on various levels:

PCR artifacts

(amplification of errors)

Sequencing

(errors in base calling)

Alignment

(misalignment,

mis

-gapped alignments)

Variant calling

(low depth of coverage, few samples)

Genotyping

(poor annotation)

Try to control for these when possible to

reduce false positives

w/o incurring (worse) false negatives

Slide12

Example: Initial raw sequence data

Slide13

Sequence quality

Different technologies have different errors, error rates454 – homopolymer track errorsIllumina – substitution errorsPacBio – indels. Lots and lots of ‘em.Represented as a quality score (Phred)Q = -10log10(e)

Phred

Quality Score

Probability of incorrect base call

Base call accuracy

10

1 in 10

90%

20

1 in 100

99%

30

1 in 1000

99.90%

40

1 in 10000

99.99%

50

1 in 100000

~100.00

%

Slide14

Formats: FASTQ – ‘sequence with quality’

Three ‘variants’ – Sanger, Illumina, Solexa (Sanger is most common)May be ‘raw’ data (straight from seq pipeline) or processed (trimmed for various reasons)Can hold 100’s of millions of records per sampleFiles can be very large (100’s of GB) apiece

@HWI-ST1155:109:D0L23ACXX:5:1101:2247:1985 1:N:0:GCCAAT

NTTCCTTTGACAAATATTAAAATTAAGAATCAAATATGGTAGTGTATGCCAAGACCTAGTCTGAGTCAGTAGGAT

+

#1=DDFFFHHHHHJJJJJJIJJJIJJIJIJJJJJJJIJI?

FHFHEIJEIIIEGFFHHGIGHIJEIFGIJHGDIII

Slide15

Formats: FASTQ – ‘sequence with quality’

@HWI-ST1155:109:D0L23ACXX:5:1101:2247:1985 1:N:0:GCCAATNTTCCTTTGACAAATATTAAAATTAAGAATCAAATATGGTAGTGTATGCCAAGACCTAGTCTGAGTCAGTAGGAT+#1=DDFFFHHHHHJJJJJJIJJJIJJIJIJJJJJJJIJI?FHFHEIJEIIIEGFFHHGIGHIJEIFGIJHGDIII

Very low

Phred

score, less than 10

Slide16

@HWI-ST1155:109:D0L23ACXX:5:1101:2247:1985 1:N:0:GCCAATNTTCCTTTGACAAATATTAAAATTAAGAATCAAATATGGTAGTGTATGCCAAGACCTAGTCTGAGTCAGTAGGAT+#1=DDFFFHHHHHJJJJJJIJJJIJJIJIJJJJJJJIJI?FHFHEIJEIIIEGFFHHGIGHIJEIFGIJHGDIII

Formats: FASTQ

– ‘sequence with quality’

Low

Phred score, < 20

Slide17

@HWI-ST1155:109:D0L23ACXX:5:1101:2247:1985 1:N:0:GCCAATNTTCCTTTGACAAATATTAAAATTAAGAATCAAATATGGTAGTGTATGCCAAGACCTAGTCTGAGTCAGTAGGAT+#1=DDFFFHHHHHJJJJJJIJJJIJJIJIJJJJJJJIJI?FHFHEIJEIIIEGFFHHGIGHIJEIFGIJHGDIII

Formats: FASTQ

– ‘sequence with quality’

High quality reads,

Phred

score > 30

Slide18

How do sequencing errors occur?

Slide19

Illumina Sequencing

Slide20

Illumina Sequencing

Slide21

Illumina

Sequencing

Slide22

Illumina

Sequencing

Slide23

Check sequence data!

Slide24

Basic Experimental Design

Slide25

Terminology

Lane

– Physical sequencing laneLibrary – Unit of DNA prep pooled togetherSample – Single individualCohort – Collection of samples analyzed together

lane

flowcell

Slide26

Terminology

Sample (single) vs Cohort (multiple)LibraryLaneFlowcell

lane

flowcell

Library

In general,

Library = Sample

Slide27

Terminology

WGS

vs

Exome

Capture

Whole genome sequencing

– everything

High cost if per sample is deep sequence (>25-30x)

Can run

multisample

low coverage samples

Exome

capture

– targeted sequencing (1-5% of genome)

Deeper coverage of transcribed regions

Miss other important non-coding regions (promoters, introns, enhancers, small RNA,

etc

)

Slide28

Slide29

Data requirements per sample

Variant detection among multiple samples

Target bases

3 Gb

Variants found per sample

~3-5M

Coverage

Avg. 30x

Percent of variation in genome

>99%

# sequenced bases

100 gb

Pr{singleton discovery}

>99%

# per lane (HiSeq 4000)

~1

Pr{common allele discovery}

>99%

Slide30

Data requirements per sample

Variant detection among multiple samples

Target bases

3 Gb

Variants found per sample

~3M

Coverage

Avg. 4x

Percent of variation in genome

~90%

# sequenced bases

20 gb

Pr{singleton discovery}

< 50%

# per lane (HiSeq 4000)

~6

Pr{common allele discovery}

~99%

Slide31

Exome Capture

Slide32

Exome Capture (TruSeq Exome Capture)

Slide33

Data requirements per sample

Variant detection among multiple samples

Target bases

45 Mb

Variants found per sample

~25,000

Coverage

>80% 20x

Percent of variation in genome

0.005

# sequenced bases

5 Gb

Pr{singleton discovery}

~95%

# per lane (HiSeq 4000)

24

Pr{common allele discovery}

~95%

Slide34

General variant calling pipelines

Common pattern:

Align reads

Optimize alignment

Call variants

Filter called variants

Annotate

Slide35

Pipeline examples

Examples:

Broad Genome Analysis Toolkit (GATK)

samtools

mpileup

VarScan2

Specialized

pipelines

Heteroplasmy

, tumor sample analyses

Slide36

GATK Pipeline

Slide37

Phase I : NGS Data Processing

Slide38

Phase I

NGS

Data Processing

Alignment of raw reads

Duplicate marking

Base

quality

recalibration

Local realignment no longer required if you use the

HaplotypeCaller

Slide39

Phase I : Alignment of raw reads

AccuracySensitivity – maps reads accurately, allowing for errors or variationSpecificity – maps to the correct region

Heng

Li’s aligner assessment

Slide40

Phase I : Alignment of raw reads

Accuracy assessed using simulated dataGenerally, BWA-MEM or Novoalign are recommendedUnique vs. multi-mapped readsShould we retain reads mapping to repetitive regions?May depend on the application

Heng

Li’s aligner assessment

Slide41

Phase I : Alignment of raw reads

Almost 100 of these (2016) : http://wwwdev.ebi.ac.uk/fg/hts_mappers/

Slide42

Slide43

Slide44

Alignment output : SAM/BAM

SAM

– Sequence Alignment/Map

format

SAM

file format stores alignment

information

Normally converted into BAM (text format is mostly useless for analysis)

Specification

:

http://samtools.sourceforge.net/SAM1.

pdf

Contains FASTQ reads, quality information, meta data, alignment information, etc.

Files are typically very large:

Many 100’s of GB or more

Slide45

Alignment output : SAM/BAM

BAM – BGZF compressed SAM format

May be unsorted, or sorted by sequence name or genome coordinates

May

be accompanied by an index file (

.

bai

) (only if

coord

-sorted)

Makes the alignment information easily accessible to downstream applications (large genome file not necessary)

Relatively simple format makes it easy to extract specific features, e.g. genomic locations

BAM is the compressed/binary version of SAM and is not human readable. Uses a specialize compression algorithm optimized for indexing and record retrieval (

bgzip

)

Files are typically very large:

1/5 of SAM, but still very large

Slide46

Alignment

SAM format

Alignment output : SAM/BAM

Slide47

SAM format

Alignment output : SAM/BAM

Slide48

SAM format

Slide49

SAM format

Slide50

Bit Flags

Hex 0x80 0x40 0x20 0x10 0x8 0x4 0x2 0x1

Bit 128 64 32 16 8 4 2 1r001 1 1 1 1 = 163

Slide51

SAM format

Slide52

SAM format

Slide53

SAM format

Slide54

CIGAR

Slide55

SAM format

Slide56

SAM format

Slide57

SAM format

Slide58

SAM format

Slide59

Alignment output : SAM/BAM

Too many to go over!!!

Slide60

Alignment output : SAM/BAM

Tools

samtools

Picard

Mining information from a properly formatted BAM file:

Reads in a region (good for RNA-

Seq

,

ChIP-Seq

)

Quality of alignments

Coverage

…and of course, differences (variants)

Slide61

Phase I : Duplicate Marking

Slide62

Phase I : Duplicate Marking

Slide63

Phase I : Sorting, Read Groups

Slide64

Terminology

Read

groups

– information about the samples and how they were run

ID – Simple unique identifier

Library

Sample name

Platform – sequencing platform

Platform unit – barcode or identifier

Sequencing center (optional)

Description (optional)

Run date (optional)

Slide65

Phase I : Sorting, Read Groups

Slide66

Phase I : Local Realignment

Realign around

indels

Common misalignment problem

Can cause problems with base quality recalibration, variant calls

Slide67

Slide68

Phase I : Base Quality Score Recalibration

Quality scores from sequencers are biased and somewhat inaccurate

Quality

scores are

critical

for all downstream analysis

Biases

are a major contributor to bad

variant calls

Caveat:

In practice, generally requires having a known set of variants (

dbSNP

)

Slide69

Slide70

Phase II : Variant Discovery/Genotyping

Slide71

Phase II : Variant Calling

This is where we actually call the variants

Prior steps leading up to this help remove potential causes of variant calling errors

I’ll be covering use of the

UnifiedGenotyper

variant calling tool in GATK

… but you should keep an eye on the newest tool,

HaplotypeCaller

Slide72

Phase II : Variant Calling

In general, uses a probabilistic method,

e.g. Bayesian

model

Determine

the possible SNP and

indel

alleles

Only “good bases” are included:

Those satisfying minimum base quality, mapping read quality, pair mapping quality, etc

.

Compute

, for each sample, for each genotype, likelihoods of

data

given

genotypes

Compute the allele frequency

distribution

to determine most likely allele

count; emit

a variant call if determined

If we are going to emit a variant, assign a genotype to each

sample

Slide73

Het

REF: 50%

ALT: 50%

Hom

REF: 0%

ALT: 100%

??

REF: 77%

ALT: 23%

Slide74

Variant calling output: VCF

VCF (Variant Call Format)

Like SAM/BAM, also has a versioned specification

From the 1000 Genomes Project

http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-

41

Slide75

Formats: VCF

##

fileformat

=VCFv4.1

##

fileDate

=20090805

##source=myImputationProgramV3.1

##reference=file:///

seq

/references/1000GenomesPilot-NCBI36.fasta

##

contig

=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo

sapiens",taxonomy

=x>

##phasing=partial

##INFO=<ID=

NS,Number

=1,Type=

Integer,Description

="Number of Samples With Data">

##INFO=<ID=

DP,Number

=1,Type=

Integer,Description

="Total Depth">

##INFO=<ID=

AF,Number

=

A,Type

=

Float,Description

="Allele Frequency">

##INFO=<ID=

AA,Number

=1,Type=

String,Description

="Ancestral Allele">

##INFO=<ID=

DB,Number

=0,Type=

Flag,Description

="

dbSNP

membership, build 129">

##INFO=<ID=H2,Number=0,Type=

Flag,Description

="HapMap2 membership">

##FILTER=<ID=q10,Description="Quality below 10">

##FILTER=<ID=s50,Description="Less than 50% of samples have data">

##FORMAT=<ID=

GT,Number

=1,Type=

String,Description

="Genotype">

##FORMAT=<ID=

GQ,Number

=1,Type=

Integer,Description

="Genotype Quality">

##FORMAT=<ID=

DP,Number

=1,Type=

Integer,Description

="Read Depth">

##FORMAT=<ID=

HQ,Number

=2,Type=

Integer,Description

="Haplotype Quality">

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3

20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4

20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

20 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Slide76

Formats: VCF

##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Slide77

Formats: VCF

##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Slide78

Formats: VCF

##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Slide79

Formats: VCF

##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Slide80

Formats: VCF

##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Slide81

Formats: VCF

##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Slide82

Formats: VCF

##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Slide83

Formats: VCF

##fileformat=VCFv4.1##fileDate=20090805##source=myImputationProgramV3.1##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>##phasing=partial##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">##FILTER=<ID=q10,Description="Quality below 10">##FILTER=<ID=s50,Description="Less than 50% of samples have data">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA0000320 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:320 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:420 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:220 1234567 microsat1 GTC G,GTCT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3

Samples

Slide84

VCF

Col

Field

Description

1

CHROM

Chromosome name

2

POS

1-based position. For an indel, this is the position preceding the indel.

3

ID

Variant identifier. Usually the dbSNP rsID.

4

REF

Reference sequence at POS involved in the variant. For a SNP, it is a single base.

5

ALT

Comma delimited list of alternative seuqence(s).

6

QUAL

Phred-scaled probability of all samples being homozygous reference.

7

FILTER

Semicolon delimited list of filters that the variant fails to pass.

8

INFO

Semicolon delimited list of variant information.

9

FORMAT

Colon delimited list of the format of individual genotypes in the following fields.

10+

Sample(s)

Individual genotype information defined by FORMAT.

Slide85

Phase II : Filtering

Two basic methods:

Hard filtering

Variant quality score recalibration (VQSR)

Slide86

Phase II : Hard Filtering

Reducing false positives by e.g. requiringSufficient DepthVariant to be in >30% readsHigh qualityStrand balanceEtc etc etcVery high dimensional search space… so, very subjective!

Strand Bias

Slide87

Phase II : Hard Filtering

“The effect of strand bias in Illumina short-read sequencing data”Guo et al., BMC Genomics 2012, 13:666 Last sentence: “Indiscriminant use of strand bias as a filter will result in a large loss of true positive SNPs”

Strand Bias

Slide88

Phase II : Variant Quality Score Recalibration (VQSR)

Considered GATK ‘best practice’Train on trusted variantsRequire the new variants to live in the same hyperspacePotential problems: Over-fittingBiasing to features of known SNPs

Slide89

Phase III : Integrative Analysis

Slide90

Phase III : Functional Annotation

Are these mutations in important regions?Genes? UTR? Are they changing the coding sequence?Would these changes have an affect?Tools:SnpEff/SnpSiftAnnovar

Slide91

The end of the (pipe)line

Slide92

Follow-up Quality Control

Transition/Transversion ratio (Ti/Tv)bcftools can help hereConcordance with known variants: dbSNP, HapMap, 1000genomesOthers?

Condition

Expected T

i

/

T

v

random

0.5

whole genome

2.0-2.1

exome

3.0-3.5

Slide93

Calling variants on cohorts of samples

When running

HaplotypeCaller

, can use a specific output type called

GVCF (Genotype VCF)

Contains genotype likelihood and annotation for each site in genome

Perform joint genotyping calls on cohort

Can rerun as needed if more samples added to cohort

Used for

ExAC

cohort (92K

exomes

)

Slide94

Calling variants on cohorts of samples

Slide95

Structural Variation

Tools like GATK, samtools can’t currently detect larger structural changes easily

Alkan

et al

, Nature Genetics

12

:363, 2011

Slide96

Structural Variation

With the latest releases of GATK (v3.7 or higher), this is changing:

https://software.broadinstitute.org/gatk/best-practices/

Slide97

Structural Variation

Detection using NGS data generally requires multi-layer analyses, may focus on specific SV types

Common tools:

CNVnator

– gross detection of CNVs

BreakDancer

, Cortex – breakpoint detection

Pindel

– large deletions

Hydra-SV – read pair discordance

Recent tools (lumpy-

sv

, GASVPRO) integrate approaches

Slide98

Structural Variation Strategies

Read pair discordance

Insert size is off, orientation of reads is wrong

Read depth

Region deviates from expected read depth

Split reads

Single read is split, parts align in two distinct unique locations

‘Assembly’

Reference-based local assemblies indicate inconsistencies

Slide99

Structural Variation Analyses

Ref

Ref

Slide100

Structural Variation Analyses

Slide101

Structural Variation

Still an active area of research

Problems:

Lots of false positives

Hard to compare methodologies

Slide102

Association Studies

Genome

-wide association studies (GWAS

)

Trying to determine whether specific variant(s) in many individuals can be associated with a trait

Ex:

comparison of groups of people with a disease (cases) and without (controls)

Slide103

Finding the causal variant in ideal situations*

Spot the variant that is common amongst all affected but absent in all unaffectedThis variant is in a gene with known function and causes the protein to be disrupted

* e.g. some rare

autosomal

disease

Slide104

In reality

You can’t spot the differenceYou deal with ~3.5 million SNPsYou need to employ methods that systematically identify variants that stand out: GWASGWAS taught us that it is unlikely to find a causal common variant for complex diseasesRare Variant ?A bunch of rare and common variants ?An even more complex model ?

1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature. 2010 Oct 28;467(7319):1061-73. PubMed PMID: 20981092

Slide105

GWAS

Criticisms of initial GWAS studiesPoor controlsSeverely underpoweredTons of false-positivesNot worth the costLed to some retractionsWikipedia article outlines the limitations very wellhttp://en.wikipedia.org/wiki/Genome-wide_association_study

Slide106

GWAS

Most of these use genotyping arraysUse of NGS may help address some of the past limitations

Slide107

GATK v4

Out now in pre-release

~

5x faster with

HaplotypeCaller

Completely open-source, but has specific software requirements

https://

software.broadinstitute.org/gatk/download/alpha

Slide108

Recommended
Variational
  • 15

Variational

How Current Grape Supply and Demand Affects You
  • 8

How Current Grape Supply and Demand Affects You

Case Study example â 12 marks (+3 SPaG)
  • 19

Case Study example – 12 marks (+3 SPaG)

Service Team presentation
  • 19

Service Team presentation

ÎοβεκÏομη
  • 6

Λοβεκτομη

Ditch the Worksheet:
  • 20

Ditch the Worksheet:

The Finnish Education For All
  • 34

The Finnish Education For All

Progress
  • 8

Progress

Mercedes-Benz âIntelligent Driveâ:
  • 16

Mercedes-Benz “Intelligent Drive”:

Report this Document.