omics data integration Michael Tress CNIO Predictions Ensembl GENCODE automatic pipelines HBM and GENCODE RNA seq data individual largescale studies Coding potential ID: 546635
Download Presentation The PPT/PDF document "Considerations for multi-" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Considerations for multi-omics data integration
Michael TressCNIO,Slide2
Predictions: Ensembl
/GENCODE automatic pipelines, HBM and GENCODE RNA-seq data,
individual large-scale studies. Coding potential is determined from similarity to known
proteins, conservation, the presence of Pfam functional domains.
Some transcripts that are annotated as coding or non-coding
based on
the balance of probabilities. Good proteomics evidence could help here.A few years ago the human reference genome was missing a number of coding genes, in part due to gaps in the reference build used for Ensembl and RefSeq. Now the coding genes are probably almost complete.
GENCODE genome annotationSlide3
We collected peptides from
a number of
large
scale proteomics resources
The NIST database
3
We
wanted to make sure that we had reliably identified peptidesSlide4
The
older the
ancestral gene, the higher the chance of detecting
peptides.
Genes
that
appeared since primates are practically not detected!4
Gene family a
ges
based
on ENSEMBL
Compara
Ezkurdia
, Juan
et
al
,
Hum
Mol
Gen, 2014Slide5
G
enes with no protein features at all (structure, function, etc.) were not detected
Y
-
axis
% of genes in each bin detected in proteomics experimentsSlide6
Paralogues
Ancestor
ACSL1, ACSL6
Jawed vertebrates
ACTN1, ACTN2, ACTN4
One AS in
fruitfly
, one in vertebrates
.
ATP2B1, ATP2B2, ATP2B3, ATP2B4
Bilateria
DNM1, DNM2
Vertebrates
GNAL, GNAS
Jawed vertebrates
ITGA3, ITGA6
Vertebrates
PDLIM3, LDB3
Chordates
TPM1, TPM2, TPM3, TPM4
Vertebrates
All 60 homologous
exons
were
conserved in jawed vertebrates, e.g.
fugu
and
zebrafish
, which implies
that they evolved
at
least 460 million years
ago
. As a comparison mouse and human conserve fewer than 20% of AS exons.
Abascal et al, PLoS Comp Biol, 2015
We found evidence for just 282 splice events - many were of ancient originSlide7
Most detected alternative isoforms would not break
Pfam
domains
ISE = isoforms detected with peptide evidence – GENCODE20 is background of whole genome, AI genes are all isoforms annotated for the 246 genes with detected alternative isoforms.Slide8
What does that mean for
proteogenomics analyses? Most (but not all!) detected novel coding genes/isoforms are likely to have little evolutionary history and few protein
features.
We find that standard proteomics experiments are less likely to detect peptides for these regions.If
many novel regions are identified in the study quality control is needed because many will have been
identified by less reliable peptides (
semi-tryptic peptides, low scoring PSM, poor spectra).Multi-omics considerationsSlide9
XXX ORFs – no protein features
Results:
More than 200 previously uncharacterized coding regions
A recent paper that identified many peptides for these new
ORFs.These candidates are short and have no protein features.
Problem
: Peptides were cleaved by trypsin in the experiment, yet more than 80% of the peptides are semi-tryptic or non-tryptic.Caveat: that is not to say that these novel regions do not code for proteins, just that they are not found in standard proteomics experiments.Slide10
Nesvizhskii AI. Proteogenomics
: concepts, applications and computational strategies. Nat Methods. 2014
Novel peptides identified using proteogenomics should be held to a higher standard of evidence than known peptides (spectra!).
it is important to use a
a multi-stage data analysis strategy
If you search with a combined database and few modifications you will find that many pseudogenes express peptides.
Initial searches should be first be carried out against known coding genes (with a range of possible modifications) and possibly known SAV. Proteogenomics strategySlide11
Spectrum matched (incorrectly) to peptide EITALAPSI
MK from putative POTEPK gene. The match is nearly perfect.
The same spectrum matched (probably correctly) to actin peptide
EITALAPSTMK with a lysine dimethylation. This peptide is identified 63,000
times in PeptideAtlas.
Pseudogene
detection - PeptideAtlasSlide12
Dominant isoformsSlide13
We found evidence of AS in just over 1% of human genes, so 98% of protein coding genes have evidence for just a single isoform
Can we predict this isoform?Slide14
LONGEST
CCDS
RNASEQ
APPRIS
Five methods for selecting a
reference isoform
5-fold dominant
transcripts from
HBM data Gonzalez
-
Porta
et al
,
Gen. Biol. 2013
P
rincipal
isoforms
based
on
structure
, function and conservation
(Rodriguez et al, NAR, 2012)
Unique CCDS.
CCDS variants are
consensus between
RefSeq
,
and
Ensembl
/
GENCODE
HCI
Highest connected isoforms
trained on RNA-
seq
data in Li
et
al
,
JPR, 2015
Standard reference isoform
in all databases/large scale experimentsSlide15
98.6%
97.8%
77.2%
77.7%
Five means
of selecting reference isoforms
78%
We calculated
%
agreement between the
main proteomics isoform we found
and
the five reference methods:
the
longest
sequence
,
APPRIS principal isoforms
,
unique
CCDS
variants
,
the
dominant
RNAseq
transcripts
and
the
Highest Connected IsoformsSlide16
For those 3,000+ genes with a main experimental isoform, an APPRIS principal isoform and a unique CCDS variant, all three isoforms agreed over 99% of the genes.
The clear agreement between three orthogonal sources (and the large number of tissues sampled) suggests that the main proteomics isoform is the dominant protein isoform in the cell.
Indeed alternative isoforms (non-
APPRIS principal
isoforms) “are
significantly enriched in amino acid-changing variants, particularly those that have a strong impact on protein
function“Liu et al, Molecular
BioSystems
,
2015
Ezkurdia
et al
,
J. Proteome
Res
,
2015