Marina DiStefano PhD Clinical Molecular Genetics Fellow Harvard Medical School Genetics Training Program Biocurator Call 11019 Outline Why does transcript curation matter Examples from Hearing Loss ID: 917564
Download Presentation The PPT/PDF document "Transcripts: Background and Curation Str..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Transcripts: Background and Curation Strategies
Marina DiStefano, Ph.D.
Clinical Molecular Genetics Fellow,
Harvard Medical School Genetics Training Program
Biocurator Call, 1.10.19
Slide2Outline
Why does transcript curation matter? Examples from Hearing Loss
How are transcripts annotated?
How can you curate transcripts? Examples from Hearing Loss
Slide3Variant interpretation requires gene curation
If gene is not associated with disease, the clinical significance of a variant in that gene cannot be interpreted.
Figure modified from Daniel MacArthur
Slide4Variant Interpretation Requires Transcript Curation
Variant interpretation may be dependent on the transcript (we often want to assess the most severe molecular consequence)
Transcript choice may be dependent on the disease and tissue-specific expression
Clinical labs often choose the longest transcript for annotation, which may not be the most biologically relevant
Slide5Example 1: Tissue-specific Exon Expression
TBC1D24
Associated with
nonsyndromic
hearing loss, DOORS syndrome, and a spectrum of epilepsy conditions
RefSeq
lists 2 curated (NM) transcripts.:
NM_001199107.1 and NM_020705.2
NM_001199107.1:
Expressed in mouse neurons
NM_020705.2:
Expressed in mouse cochlea and non-neuronal tissues
c.969_970delGT (p.Ser324Thrfs)
NM_001199107.1c.969_970delGT (p.Ser324Thrfs) associated with
severe lethal epileptic encephalopathy
without
hearing loss
(
Guven
2013)
Slide6Ex 2: Multiple Molecular Consequences
Clarin
1 (CLRN1/USH3A)
Integral membrane glycoprotein, definitive for Usher Syndrome Type 3 (
ClinGen
HL
CDWG
3.2.17)
RefSeq
lists 4 Transcripts: NM_001195794, NM_052995, NM_174878, NM_001256819
Longest Transcript:
NM_001195794
Reporting Transcript
(
LMM
, ClinVar
and HGMD
): NM_174878 Reported LP/P variants are a mixture of missense and nonsense variants
NM_001195794.1: c.368C>A (p.Ala123Asp)
NM_001256819.1: c.540C>A (
p.Cys
180X)
NM_052995.2: c.140C>A (p.Ala47Asp)
NM_174878.2: c.368C>A (p.Ala123Asp)
Functional studies showed that the variant protein is not correctly localized in the cell and is rapidly degraded (
Isosomppi
2009). We cannot be sure which molecular consequence is causative for disease.
Slide7Transcript Annotation Efforts
Ensembl
/GENCODE (EMBL-EBI)
: Transcripts designated by automatic and manual annotation. Sequence is predictive, pulled directly from the genome build.
ENST
label (
ENSP
is for protein)
References that use
Ensembl
transcripts:
gnomAD/ExAC
, 100,000Genomes ProjectGTExDecipher
COSMIC
Slide8Ensembl
Evidence
Flags
Corresponding
RefSeq
https://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000162065;r=16:2475051-2509560
Slide9Transcript Annotation Efforts
RefSeq
(NCBI): T
ranscripts are designated by a curation. These annotations are independent of the genome build (GenBank).
Refseqgene
are the most thoroughly curated
Prefixes:
N_: Transcript
supported by some evidence, whether it is published literature or GenBank cDNA or EST data.
X_
: Predicted transcripts that are not confirmed with curation or evidence.
M: mRNA, R: non-coding RNA, P: protein e.g. NM: mRNA transcript supported by evidence
Predominant transcript set for Clinical Labs/Clinical Annotation Pipelines, Research publications (gene curation)
Slide10RefSeqGene
https://www.ncbi.nlm.nih.gov/gene/57465#reference-sequences
Slide11Transcript Annotation Efforts: Collaboration
LRG, Locus Reference Genomic (EMBL-EBI/NCBI
collab
):
Aim is to ensure efficient and consistent variant reporting by defining reference sequences.
This is usually a single transcript per gene
It is stable and does not change
There are currently 1182 LRGs.
LRG has curators, but also solicits curation expertise from the community
Slide12Collaboration on Hearing Loss genes:
Manually reviewed suggested NMs and updated, as necessary
Updated corresponding GENCODE annotation
Harmonized NMs and corresponding ENSTs (100% identity) – foster bi-directional exchange of data
Created LRGs using harmonized transcripts
In future, curation effort will feed into new Matched Annotation from NCBI and EMBL-EBI (MANE) project
Talk in this session describes MANE in more detail
Records:
Display reference sequences and include additional annotation at the locus
Collaboration with the LRG project
Slide from Joannella Morales
Slide13Transcript selection
Harmonized transcripts
Slide from Joannella Morales
Slide14Newest Collaboration: MANE Project
Matched annotation from NCBI and EMBL-EBI project
A collaborative transcript set (best of
RefSeq
and
Ensembl
):Aligned to GRCh38, which is the newer genome build100% identical for
RefSeq
and
Ensembl
(UTRs and coding)
Transcripts are well-supported, expressed, and conservedTranscripts are fairly stableClinical or biological data is manually curatedMANE Select: 1 transcript per locus (50% released in Dec 2018)MANE Plus: this is evolution of the LRG project (well-supported, tissue specific, relevant to certain user groups)Modified from Fiona Cunningham
Slide15MANE Project Resources
Recently a webinar on this effort was held
The 50% of the MANE Select transcript are available on NCBI’s FTP site (
ftp://ftp.ncbi.nlm.nih.gov/refseq/MANE
)
Slides from the webinar can be found here:
https://drive.google.com/drive/folders/1HyY0vvJ9e-Ocm14JOm7Y2p8QgXU4bokt
A recording will be available soon
Slide16How to map across transcript annotations
Resources with nomenclature will help you map
ClinGen allele registry
ClinGen
VCI
(VEP)
References themselves (Ensembl)
Slide17Allele registry
http://reg.clinicalgenome.org/redmine/projects/registry/genboree_registry/allele?hgvs=NM_001009921.2%3Ac.3130G%3EA
Slide18Slide19Ensembl
Corresponding
RefSeq
Slide20Where can you find information for curating transcripts?
Slide21Transcript curation resources
The goal is to figure out the expression profile/levels of each transcript to make an informed choice
OMIM
Pubmed
/Google search
RefSeqGene
GTExDisease-specific efforts (RNA-seq datasets,
etc
)
Slide22OMIM
“Cloning and Expression” section is often most useful, but go to primary source
“Gene Structure” and “Gene Function” sections can also provide relevant information
Slide23Pubmed
/Google
“XX Gene Transcript Expression”
“XX Gene Human Transcript Expression”
“XX Gene Isoforms”
“XX Gene Splicing”
Slide24GTEx
Genotype-Tissue Expression project
53 non-diseased tissue sites from ~1000 adults, post-mortem
WGS
, WES, RNA-Seq
https://gtexportal.org/home/gene/TBC1D24
You can evaluate expression level of the gene and transcript/exon-level expression
Slide25How do you determine which transcript to use?
DiStefano et al J Mol Diagn. 2018
Slide26Transcript Curation Process
Gene-level curation,
Variant spectrum
DiStefano et al. J Mol Diagn. 2018
Impact on
LoF
Variant Interpretation
LoF
if mRNA doesn’t escape NMD
Must be further defined
Is transcript biologically relevant?
Is exon biologically relevant?
PVS1
Slide27Categorization Results
Slide28Exon-level Curation to Support Variant Interpretation
Classify Exons
Evaluate interpreted variation
(
ClinVar
, HGMD)
GTEx
does not have cochlea
Slide29Example 2: Exons of Uncertain Significance
Endothelin 3
(EDN3)
-
Waardenburg
syndrome (autosomal recessive)
RefSeq
lists 5 Transcripts, all share exons 1-3, 5
There is a frameshift variant in exon 4 in 0.6% of Finnish European alleles in gnomAD (including 2 homozygotes)
GTEx
predicts exon 4 to be spliced out where
EDN3
is expressed
Calls into question the pathogenicity of two DM variants in HGMD that are located in exon 4
Slide30Impact on Interpretation
6% of all exons were of “uncertain significance”
These exons contained 124 "clinically significant" variants
These variants require further evaluation to determine if other data supports pathogenicity
Slide31Technically challenging regions
Although it may not be a curator’s job to evaluate technical aspects of testing, it is important to be aware of sequencing issues with a gene to score variants appropriately. Experts can help with this. It could even be part of the
precuration
, as transcript choice might be.
If it’s a common problem, publications will often allude to it in the methods.
Example of technically challenging regions in hearing loss
Slide32Technically Challenging Regions
NGS data from Partners Laboratory for Molecular Medicine and Children’s Hospital of Philadelphia were used to calculate average mapping quality and depth of coverage for 109 hearing loss genes
43 technically challenging exons in 20 different genes had inadequate coverage and/or homology issues which might lead to false variant calls
Slide33Technically Challenging Regions
http://exomeslicer.chop.edu/
Niazi et al; J Mol. Diag. 2018
Slide34Future Directions and Scaling
Scaling transcript curation is a challenge
Discussion of scaling this process is in progress
Genes can be auto-categorized by
RefSeq
gene annotations (C1, C2, C3)
Exons can be filtered by high frequency LoF variants in gnomADExons can be filtered by GTex
expression
Limitations:
Literature curation is still a manual process
GTEx only contains adult, post-mortem tissueRelevant tissues and timepoints may be missing (e.g. cochlea, neonatal)
Slide35Conclusions
Transcript curation is critical for variant interpretation
Annotation efforts are harmonizing
Resources are available to curate transcripts
Slide36Acknowledgements
Heidi Rehm
Ahmad Abou Tayoun
Andrea Oza
Sarah Hemphill, Brandon Cushman, Andy Grant, Becky
Siegert
Sami Amr
Mark Bowser, Beth Hynes, Mike Gonzalez
Joannella Morales, Fiona Cunningham, LRG team
DiStefano et al J Mol Diagn. 2018