1400 College Plaza 8215 112 Street Edmonton Alberta Canada T6G 2C8 September 2017 1 Annotation of SNPs and indels from 1000 bulls project run 60 Input data 42920227 SNPs 1 758 ID: 760057
Download Presentation The PPT/PDF document "Paul Stothard Department of Agricultural..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Paul StothardDepartment of Agricultural, Food and Nutritional Science (AFNS)1400 College Plaza8215 - 112 StreetEdmonton, AlbertaCanada T6G 2C8
September 2017
1
Annotation of SNPs and
indels
from 1000 bulls project run
6.0
Slide2Input data
42,920,227 SNPs1,758,199 indels
2
Slide3Annotation approach
NGS-SNP (Grant et al., 2011)annotate_SNPs.pl script for SNPsannotate_INDELs.pl script for indelsThe following databases were used for annotation:Ensembl release 87Entrez Gene and UniProt used for some annotation fields (March 2017 versions).OMIA (June, 2017).For information about output annotation approach and output fields see: http://www.ualberta.ca/~stothard/downloads/NGS-SNP/
Grant JR, Arantes AS, Liao X, Stothard P (2011) In-depth annotation of SNPs arising from resequencing projects using NGS-SNP. Bioinformatics 27:2300-2301.
3
Slide4Gene and transcript types
All transcript and gene types considered when predicting SNP and indel consequences. Type of gene and transcript given in the “Comments” field, with the “Gene_Status” and “Transcript_Status” keys.Pseudogenes are also included. Variants affecting pseudogenes are given appropriate functional classes (e.g. “nc_transcript_variant” for “non-coding transcript variant”) and have “Gene_Biotype=pseudogene” in “Comments” field.
Grant JR, Arantes AS, Liao X, Stothard P (2011) In-depth annotation of SNPs arising from resequencing projects using NGS-SNP. Bioinformatics 27:2300-2301.
4
Slide5Function class
Each variant is assigned a “function class” using the Ensembl API.The function classes are defined relative to the reference genome sequence. For example:stop_lost means that the non-reference allele in the input variant leads to the loss of a stop codon annotated on the reference genome.stop_gained indicates that the alternative allele adds a stop codon to the coding region of a transcript annotated on the reference genome.
5
Slide6One variant can have multiple function classes
A single variant can be assigned multiple function classes (also called consequences) due to the presence of multiple overlapping transcripts or genes, and due to overlap among the function classes. For example:a SNP can be located in the 3’UTR of one transcript and translated region of another. Thus this SNP could have two consequence types: 3_prime_UTR_variant and missense_variant.a SNP in a start codon can be assigned all of the following function classes: coding_sequence_variant, missense_variant, and initiator_codon_variant.
6
Slide7One variant can have multiple function classes
We report one consequence for each variant, in the “Functional_Class” column.When there are multiple consequences of different types we choose the one we consider to be of the highest importance (e.g. missense_variant will be reported over synonymous_variant).Other consequences for each variant are given using the “Other_Consequences” key in the “Comments” column.
7
Slide8Annotation fields for SNPs
See “Sample annotated SNP” slides. The “Comments” annotation field includes several (>20) key-value pairs providing more information about certain variant types such as SIFT score (for missense_variant SNPs), and length of protein sequence lost (for stop_gained SNPs).The “Model_Annotations” annotation field includes up to five key-value pairs providing information related to the human orthologue of the gene containing the variant. For example, phenotypes associated with the human orthologue are listed, if available.The full list of fields available at:http://www.ualberta.ca/~stothard/downloads/NGS-SNP/annotate_SNPs.html
8
Slide9Annotation fields for indels
See “Sample annotated indel” slides.“Comments” and “Model_Annotations” fields provide additional information in the form of key-value pairs.Indel position ambiguity handled appropriately for annotation and for identifying matches in dbSNP (for discussion of this issue see “Equivalent Indels – Ambiguous Functional Classes and Redundancy in Databases” by Assmus et al., 2013.).Full list of fields available at:http://www.ualberta.ca/~stothard/downloads/NGS-SNP/annotate_INDELs.html
9
Slide10Function class values
For indels and SNPs the consequence type of the variant is given in the “Functional_Class” field.The values that appear in this field are defined by Ensembl and the Sequence Ontology (SO) project.
http://www.sequenceontology.org
10
Slide11Output files
IndelsTab-delimited annotated indels.SNPsTab-delimited annotated SNPs.
11
Slide12Summary of results for SNPs
12
Slide13Known vs. novel SNPs
-“Known” is used here to describe input variants where the variant and all of its alleles exist already in the reference database.
SNP typeCountKnown31,952,868 (74.45%)Novel10,967,359 (25.55%)All42,920,227
13
Slide14Numbers of SNPs in each function class
intergenic_variant 28353891intron_variant 11232495upstream_gene_variant 1510605downstream_gene_variant 1282230missense_variant 185046synonymous_variant 1794423_prime_UTR_variant 99283splice_region_variant 321945_prime_UTR_variant 23431non_coding_transcript_exon_variant 12878stop_gained 3831splice_donor_variant 1876splice_acceptor_variant 1618mature_miRNA_variant 407start_lost 292stop_lost 257coding_sequence_variant 243stop_retained_variant 126non_coding_transcript_variant 82Total 42920227
14
Slide15Sample annotated SNP
Field numberField nameValue1CHROM12POS1451149633ID.4REFT5ALTC6QUAL.7FILTER.8INFO.9Functional_Classmissense_variant10Chromosome1
15
See http://www.ualberta.ca/~stothard/downloads/NGS-SNP/annotate_SNPs.html
Sample values of interest highlighted in red
Slide16Sample annotated SNP cont.
Field numberField nameValue11Chromosome_Position14511496312Chromosome_Strandforward13Chromosome_ReferenceT14Chromosome_ReadsC15Gene_DescriptionIntegrin beta-2 [Source:UniProtKB/Swiss-Prot;Acc:P32592]16Ensembl_Gene_IDENSBTAG0000001706017Entrez_Gene_NameITGB218Entrez_Gene_ID28187719Ensembl_Transcript_IDENSBTAT0000002268720Transcript_SNP_Position488
16
Sample values of interest highlighted in red
Slide17Sample annotated SNP cont.
Field numberField nameValue21Transcript_SNP_ReferenceA22Transcript_SNP_ReadsG23Transcript_To_Chromosome_Strandreverse24Ensembl_Protein_IDENSBTAP0000002268725UniProt_IDP3259226Amino_Acid_Position12827Overlapping_Protein_Domainssuperfamily IPR002035 von Willebrand factor, type A;pfam IPR002369 Integrin beta subunit, N-terminal;smart IPR002369 Integrin beta subunit, N-terminal;pirsf IPR015812 Integrin beta subunit;prints IPR015812 Integrin beta subunit28Overlapping_Protein_FeaturesCHAIN:23:769:Integrin beta-2.;TOPO_DOM:23:700:Extracellular. {ECO:0000255}.;DOMAIN:124:363:VWFA.;DISULFID:33:447:{ECO:0000250}.;VARIANT:128:128:D -> G (in LAD).
17
Sample values of interest highlighted in red
Slide18Sample annotated SNP cont.
Field numberField nameValue29Amino_Acid_ReferenceD30Amino_Acid_ReadsG31Amino_Acids_In_OrthologuesDDDDDDDDDDDDDDDDXDDDDDDDDDDDDXDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD32Alignment_Score_Change-0.45433C_blosum1.034Context_Conservation84.735Orthologue_SpeciesOvis_aries;Tursiops_truncatus;Sus_scrofa;Mustela_putorius_furo;Canis_lupus_familiaris;Felis_catus;Ailuropoda_melanoleuca;Equus_caballus;Pteropus_vampyrus;Myotis_lucifugus;Nomascus_leucogenys;Gorilla_gorilla_gorilla;Pongo_abelii;Callithrix_jacchus;Chlorocebus_sabaeus;Pan_troglodytes;Tupaia_belangeri;Macaca_mulatta;Homo_sapiens;Microcebus_murinus;Ictidomys_tridecemlineatus;Papio_anubis;Dipodomys_ordii;Otolemur_garnettii;Oryctolagus_cuniculus;Ochotona_princeps;Rattus_norvegicus;Mus_musculus;Mus_musculus;Carlito_syrichta;Mus_spretus;Cavia_porcellus;Mus_spretus;Sorex_araneus;Dasypus_novemcinctus;Echinops_telfairi;Procavia_capensis;Loxodonta_africana;Macropus_eugenii;Monodelphis_domestica;Ficedula_albicollis;Meleagris_gallopavo;Gallus_gallus;Pelodiscus_sinensis;Anas_platyrhynchos;Taeniopygia_guttata;Anolis_carolinensis;Xenopus_tropicalis;Ornithorhynchus_anatinus;Latimeria_chalumnae;Danio_rerio;Tetraodon_nigroviridis;Gadus_morhua;Lepisosteus_oculatus;Poecilia_formosa;Oreochromis_niloticus;Takifugu_rubripes;Astyanax_mexicanus;Xiphophorus_maculatus;Oryzias_latipes;Poecilia_formosa;Gasterosteus_aculeatus;Oreochromis_niloticus;Petromyzon_marinus;Ciona_intestinalis;Ciona_savignyi;Ciona_intestinalis;Ciona_savignyi;Drosophila_melanogaster;Ciona_intestinalis;Caenorhabditis_elegans;Ciona_intestinalis;Ciona_savignyi;Drosophila_melanogaster
18
Sample values of interest highlighted in red
Slide19Sample annotated SNP cont.
Field numberField nameValue36Gene_Ontology[GO:0001948]:glycoprotein binding;[GO:0004872]:receptor activity;[GO:0005515]:protein binding;[GO:0006909]:phagocytosis;[GO:0007155]:cell adhesion;[GO:0007159]:leukocyte cell-cell adhesion;[GO:0007160]:cell-matrix adhesion;[GO:0007229]:integrin-mediated signaling pathway;[GO:0008305]:integrin complex;[GO:0009986]:cell surface;[GO:0016020]:membrane;[GO:0016021]:integral component of membrane;[GO:0019901]:protein kinase binding;[GO:0030369]:ICAM-3 receptor activity;[GO:0030593]:neutrophil chemotaxis;[GO:0031623]:receptor internalization;[GO:0034113]:heterotypic cell-cell adhesion;[GO:0034687]:integrin alphaL-beta2 complex;[GO:0035987]:endodermal cell differentiation;[GO:0043113]:receptor clustering;[GO:0043235]:receptor complex;[GO:0046872]:metal ion binding;[GO:0050730]:regulation of peptidyl-tyrosine phosphorylation;[GO:0050839]:cell adhesion molecule binding;[GO:0070062]:extracellular exosome;[GO:0071404]:cellular response to low-density lipoprotein particle stimulus;[GO:1903561]:extracellular vesicle37Model_AnnotationsPhenotypes_Position=Source: OMIM Description: LEUKOCYTE ADHESION DEFICIENCY Variation_names: rs137852615|Source: Uniprot Phenotype_name: LAD1 Description: Leukocyte adhesion deficiency 1 Variation_names: rs137852615|Source: ClinVar Description: ClinVar: phenotype not specified|Source: ClinVar Description: Leukocyte adhesion deficiency type 1|Source: ClinVar Description: LEUKOCYTE ADHESION DEFICIENCY|Source: HGMD-PUBLIC Phenotype_name: HGMD_MUTATION Description: Annotated by HGMD but no phenotype description is publicly available;Phenotypes_Gene=Leukocyte adhesion deficiency type 1|GTR|MedGen|OMIM;
19
Sample values of interest highlighted in red
Slide20Sample annotated SNP cont.
Field numberField nameValue38CommentsGene_Status=KNOWN_BY_PROJECTION;Transcript_Status=KNOWN_BY_PROJECTION;Gene_Biotype=protein_coding;Transcript_Biotype=protein_coding;SIFT_Prediction_Ensembl=deleterious(0);39Ref_SNPsrs44570913140Is_Fully_Knownyes
20
Sample values of interest highlighted in red
Conclusion:
missense variant, predicted to affect protein function, predicted to cause leukocyte adhesion
deficiency.
This example is meant to demonstrate the utility of the various annotation fields.
Slide21Same SNP annotated with Ensembl VEP
21
Field name
Value
Uploaded_variation
1_145114963_T/C
Location
1:145114963-145114963
Allele
C
Consequence
missense_variant
IMPACT
MODERATE
SYMBOL
ITGB2
Gene
ENSBTAG00000017060
Feature_type
Transcript
Feature
ENSBTAT00000022687
BIOTYPE
protein_coding
Slide22Same SNP annotated with Ensembl VEP cont.
22
Field name
Value
EXON
5/16
cDNA_position
488
CDS_position
383
Protein_position
128
Amino_acids
D/G
Codons
gAc
/
gGc
Existing_variation
rs445709131
STRAND
-1
ENSP
ENSBTAP00000022687
SWISSPROT
P32592
Slide23Same SNP annotated with Ensembl VEP cont.
23
Field name
Value
UNIPARC
UPI000012DA0A
SIFT
deleterious(0)
DOMAINS
Pfam_domain:PF00362,PIRSF_domain:PIRSF002512,Prints_domain:PR01186,SMART_domains:SM00187,Superfamily_domains:SSF53300
CLIN_SIG
-
PHENO
-
BLOSUM62
-1
Slide24Same SNP annotated with Ensembl VEP cont.
24
Conclusion:
Ensembl
VEP annotation consistent with NGS-SNP annotation
but relationship to leukocyte adhesion deficiency
not apparent from annotation information provided by VEP.
Slide25Summary of results for indels
25
Slide26Known vs. novel indels
Indel typeCountKnown1,367,963 (77.80%)Novel390,236 (22.20%)All1,758,199
-“Known” is used here to describe input variants where the variant and all of its alleles exist already in the reference database.
26
Slide27Numbers of indels in each function class
27
INTERGENIC 1144901
intron_variant 476122
upstream_gene_variant 66438
downstream_gene_variant 59886
3_prime_UTR_variant 4839
frameshift_variant 1619
inframe_deletion 1325
splice_region_variant 1236
5_prime_UTR_variant 849
non_coding_transcript_exon_variant 346
inframe_insertion 253
splice_acceptor_variant 135
splice_donor_variant 104
coding_sequence_variant 72
stop_gained 27
non_coding_transcript_variant 20
mature_miRNA_variant 17
protein_altering_variant 10
Total
1758199
Slide28Length distribution of all indels and indels in coding regions
28
Slide29Length distribution of all deletions and deletions in coding regions
29
Slide30Length distribution of all insertions and insertions in coding regions
30
Slide31Sample annotated indel
Field numberField nameValue1CHROM22POS62183783ID.4REFGATGAACACTCCA5ALTGA6QUAL.7FILTER.8INFO.9Functional_Classframeshift_variant10Chromosome_ReferenceATGAACACTCC
31
See http://
www.ualberta.ca
/~
stothard
/downloads/NGS-SNP/
annotate_INDELs.html
Slide32Sample annotated indel cont.
Field numberField nameValue11Chromosome_Reads-12Gene_DescriptionGrowth/differentiation factor 8 [Source:UniProtKB/Swiss-Prot;Acc:O18836]13Ensembl_Gene_IDENSBTAG0000001180814Entrez_Gene_NameMSTN15Entrez_Gene_ID28118716Ensembl_Transcript_IDENSBTAT0000001567417Transcript_INDEL_Position95118Transcript_INDEL_ReferenceATGAACACTCC19Transcript_INDEL_Reads-20Transcript_To_Chromosome_Strandforward
32
Slide33Sample annotated indel cont.
Field numberField nameValue21Ensembl_Protein_IDENSBTAP0000001567422UniProt_IDO18836;C6KEF7;MSTN-20123Amino_Acid_Position27324Overlapping_Protein_Domainssuperfamily IPR029034 Cystine-knot cytokine25Overlapping_Protein_FeaturesCHAIN:267:375:Growth/differentiation factor 8.;DISULFID:272:282:{ECO:0000250|UniProtKB:O08689}.;CHAIN:19:375:{ECO:0000256|SAM:SignalP}.;DOMAIN:263:375:TGF_BETA_2.26Gene_Ontology[GO:0005102]:receptor binding;[GO:0005125]:cytokine activity;[GO:0005160]:transforming growth factor beta receptor binding;[GO:0005576]:extracellular region;[GO:0005615]:extracellular space;[GO:0005623]:cell;[GO:0007179]:transforming growth factor beta receptor signaling pathway;[GO:0008083]:growth factor activity;[GO:0008201]:heparin binding;[GO:0010862]:positive regulation of pathway-restricted SMAD protein phosphorylation;[GO:0014732]:skeletal muscle atrophy;[GO:0033673]:negative regulation of kinase activity;[GO:0040007]:growth;[GO:0042803]:protein homodimerization activity;[GO:0042981]:regulation of apoptotic process;[GO:0043408]:regulation of MAPK cascade;[GO:0045662]:negative regulation of myoblast differentiation;[GO:0045893]:positive regulation of transcription, DNA-templated;[GO:0046627]:negative regulation of insulin receptor signaling pathway;[GO:0046716]:muscle cell cellular homeostasis;[GO:0048147]:negative regulation of fibroblast proliferation;[GO:0048468]:cell development;[GO:0048632]:negative regulation of skeletal muscle tissue growth;[GO:0051898]:negative regulation of protein kinase B signaling;[GO:0060395]:SMAD protein signal transduction;[GO:0071549]:cellular response to dexamethasone stimulus
33
Slide34Sample annotated indel cont.
Field numberField nameValue27Model_AnnotationsPhenotypes_Gene=Muscle hypertrophy|GTR|MedGen|OMIM|Myostatin-related muscle hypertrophy|Gene Reviews;Overlapping_Protein_Features=CHAIN:267:375:Growth/differentiation factor 8.|DISULFID:272:282:{ECO:0000250UniProtKB:O08689}.|CHAIN:19:375:{ECO:0000256SAM:SignalP}.|DOMAIN:263:375:TGF_BETA_2.;28CommentsNumber_Of_Equivalent_Indels=1(0,1);Gene_Status=KNOWN;Transcript_Status=KNOWN;Gene_Biotype=protein_coding;Transcript_Biotype=protein_coding;Length_Downstream_Protein=102(27.20);29Ref_INDELs.30Is_Fully_Knownno
34