Goal Input Genomic sequence WGS from family Pedigree amp affectedness Disease standard ontology needed Output Genesmutations relevant to the disease Read Mapping BAM prep GRCh37 k ID: 530014
Download Presentation The PPT/PDF document "Personalized genomics" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Personalized genomicsSlide2
Goal
Input
Genomic sequence (WGS) from family
Pedigree & affectedness
Disease (standard ontology needed)
Output
Genes/mutations relevant to the disease.Slide3
Read Mapping
BAM prep
GRCh37
k
nown sites
dbSNP
SNV Calling
SV Calling & Validation
Merge VCFs
Variant filtering
Pedigree &
A
ffectedness
Variant annotation
Disease Gene/Mutation
HGMD
Genomic Sequence
SeattleSeq
BAM
VCF
FASTQSlide4
Steps
Sequencing
Mapping
BAM Prep
Variant calling (SNV)
Variant Calling (SV)
VCF manipulation/merging
Variant annotation
Variant filtering
Disease gene associationSlide5
1
. Sequencing
Platform
HiSeq/MiSeq
PacBio
Ion proton (life)
CGI…ModeWhole GenomeExome
RNA-Seq…Slide6
2. Mapping
Map short reads (FASTQ format) to a reference
Output a BAM file
Mapping tools
BWA
Bowtie
CustomCompute/disk intensive part of the pipeline. WGS file size: ~200Gb per sample. Slide7
3. BAM Prep
Input: BAM file
Output: BAM file
Sorting BAM
Picard Tools
Marking (PCR) Duplicates
Picard ToolsINDEL Re-alignmentGATK
Base-Q Covariates & RecalibrationGATKCompute
intensive part of the pipelineSlide8
4
. Variant Calling
Input
:
multiple BAMs
Output: VCF (loci that differ from the reference)
SNVsBroad’s GATK CallerSVsCustom pipelines neededBrowsing variant callsGenome Savant
Confirming variants via resequencingCompute intensive
part of the pipeline. Integrating SVs and SNVs.Slide9
BreakDancer
CNVer
Bowtie
Reprever
Extract FASTQ
5. SV calling & validation
VCF merging and validation
GQL+Genome
Savant
Zygosity calling Slide10
Push-button pipeline or VM
SNPs (VCF)
SVs
CNVs
BAM
GATK
BreakDancer
CNVer
ISCA
Recombination Blocks
Known SNPs
de novo SNPs
Slide11
6. Merging VCFs
Given multiple VCF files, merge them (each column corresponds to an individual sample).
Can be mostly done by
VCFtools
. Our goal would be to visualize problematic regions for manual validation, and design primers for confirmation automatically.Slide12
7. Variant annotation
Input: variant calls (raw VCF)
Output: annotation of
variants
(annotated VCF)
Coding
SynonymousSplice-variantRegulatoryncRNAAnnotating coding variation for deleteriousness
SIFTPolyphenGERPSeattleSeqSlide13
GERP scoreSlide14
8. Variant Filtering
Input:
VCF (annotated)
Output: set of relevant variants/genes
Filters based on variant annotation
deleterious: missense/nonsense/splice
Filters based on inheritance patterns
Disease model (recessive/dominant/compound het)
Filtering tools:Gemini (http://gemini.readthedocs.org/en/latest/)
FamAnn (https://sites.google.com/site/famannotation/
home)Slide15
9. Annotating genes
Input: collection of genes with mutations.
Output: relevant diseases, functional information
Basic Information
Genecards
Adding pathway
IngenuityDatabases of Disease gene linksHGMD
OMIMClinVarWe are currently using an outdated version of HGMD, but can possibly do better, or just replace it with Step 9. Slide16
9. Identifying Disease genes
Automated machine learning approach to correlating genes with diseases
Standard ontologies for diseases
MeSH
Disease Ontology
Standard vocabulary for gene names
ML approach (parse abstracts to make these connections)Slide17
Disk/sample
CPU/sample
Read Mapping
800 Gb
320 h*
BAM prep
150 Gb
140 h
SNV & INDEL
calling
20
Gb
540 h*
SV & CNV
calling200 Gb
30h + 30hMerging VCF
1.5 Gb
1hVariant Annotation
20 Gb1 h
Variant Filtering
-
1hDisease/Gene
Assoc.?
?
Computational Resource Consumption
*amenable to multithread parallelization (up to a point when memory becomes bottleneck)Slide18
Gene Prioritization
Variant annotation
Variant filtering
Gene Disease connectionSlide19
The HPO aims to act as a central resource to connect several genomics datasets with the diseasome.
Sebastian Köhler et al. Nucl. Acids Res. 2014;42:D966-D974
© The Author(s) 2013. Published by Oxford University Press.Slide20
Human Phenotype Ontology
10,000 terms describing human phenotypic abnormalities, (7300 human hereditary syndromes).
2741 genes used to create DAG (Disease Associated Genes)
3 independent sub-ontologies
mode of inheritance
onset and clinical course
phenotypic abnormalitiesThe phenotypic terms are cross-linkedSlide21
Applications of HPOSlide22
Differential d
iagnosis using
Phenomizer
Slide23
Sequencing
Whole genome sequencing
Exome
sequencing
Disease associated genome sequencingSlide24
Depth of coverage (exome
or disease oriented sequencing)
At 20X coverage, what fraction of het variants will be called?
15% will be missedSlide25
Phenotypic interpretation of eXomes
:
PhenIX
Remove off-target and synonymous variants
Test population frequency of other variantsfrequency score: max(0,1-0.13
exp(100*f))These are known SNPsScores from SIFT/PolyphenMost pathogenic score was taken
Final variant score: pathogenic score X frequency scoreClinical relevance score: semantic similarity between phenotypic abnormalities and 2741 genes.Average (clinical, variant)Slide26
Phenotypic interpretation of eXomes
:
PhenIX
Simulated mutation data from HGMD