/
Considerations for multi- Considerations for multi-

Considerations for multi- - PowerPoint Presentation

luanne-stotts
luanne-stotts . @luanne-stotts
Follow
375 views
Uploaded On 2017-05-10

Considerations for multi- - PPT Presentation

omics data integration Michael Tress CNIO Predictions Ensembl GENCODE automatic pipelines HBM and GENCODE RNA seq data individual largescale studies Coding potential ID: 546635

peptides isoforms coding genes isoforms peptides genes coding isoform protein proteomics detected evidence reference identified data ccds dominant features

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Considerations for multi-" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Considerations for multi-omics data integration

Michael TressCNIO,Slide2

Predictions: Ensembl

/GENCODE automatic pipelines, HBM and GENCODE RNA-seq data,

individual large-scale studies. Coding potential is determined from similarity to known

proteins, conservation, the presence of Pfam functional domains.

Some transcripts that are annotated as coding or non-coding

based on

the balance of probabilities. Good proteomics evidence could help here.A few years ago the human reference genome was missing a number of coding genes, in part due to gaps in the reference build used for Ensembl and RefSeq. Now the coding genes are probably almost complete.

GENCODE genome annotationSlide3

We collected peptides from

a number of

large

scale proteomics resources

The NIST database

3

We

wanted to make sure that we had reliably identified peptidesSlide4

The

older the

ancestral gene, the higher the chance of detecting

peptides.

Genes

that

appeared since primates are practically not detected!4

Gene family a

ges

based

on ENSEMBL

Compara

Ezkurdia

, Juan

et

al

,

Hum

Mol

Gen, 2014Slide5

G

enes with no protein features at all (structure, function, etc.) were not detected

Y

-

axis

% of genes in each bin detected in proteomics experimentsSlide6

Paralogues

Ancestor

ACSL1, ACSL6

Jawed vertebrates

ACTN1, ACTN2, ACTN4

One AS in

fruitfly

, one in vertebrates

.

ATP2B1, ATP2B2, ATP2B3, ATP2B4

Bilateria

DNM1, DNM2

Vertebrates

GNAL, GNAS

Jawed vertebrates

ITGA3, ITGA6

Vertebrates

PDLIM3, LDB3

Chordates

TPM1, TPM2, TPM3, TPM4

Vertebrates

All 60 homologous

exons

were

conserved in jawed vertebrates, e.g.

fugu

and

zebrafish

, which implies

that they evolved

at

least 460 million years

ago

. As a comparison mouse and human conserve fewer than 20% of AS exons.

Abascal et al, PLoS Comp Biol, 2015

We found evidence for just 282 splice events - many were of ancient originSlide7

Most detected alternative isoforms would not break

Pfam

domains

ISE = isoforms detected with peptide evidence – GENCODE20 is background of whole genome, AI genes are all isoforms annotated for the 246 genes with detected alternative isoforms.Slide8

What does that mean for

proteogenomics analyses? Most (but not all!) detected novel coding genes/isoforms are likely to have little evolutionary history and few protein

features.

We find that standard proteomics experiments are less likely to detect peptides for these regions.If

many novel regions are identified in the study quality control is needed because many will have been

identified by less reliable peptides (

semi-tryptic peptides, low scoring PSM, poor spectra).Multi-omics considerationsSlide9

XXX ORFs – no protein features

Results:

More than 200 previously uncharacterized coding regions

A recent paper that identified many peptides for these new

ORFs.These candidates are short and have no protein features.

Problem

: Peptides were cleaved by trypsin in the experiment, yet more than 80% of the peptides are semi-tryptic or non-tryptic.Caveat: that is not to say that these novel regions do not code for proteins, just that they are not found in standard proteomics experiments.Slide10

Nesvizhskii AI. Proteogenomics

: concepts, applications and computational strategies. Nat Methods. 2014

Novel peptides identified using proteogenomics should be held to a higher standard of evidence than known peptides (spectra!).

it is important to use a

a multi-stage data analysis strategy

If you search with a combined database and few modifications you will find that many pseudogenes express peptides.

Initial searches should be first be carried out against known coding genes (with a range of possible modifications) and possibly known SAV. Proteogenomics strategySlide11

Spectrum matched (incorrectly) to peptide EITALAPSI

MK from putative POTEPK gene. The match is nearly perfect.

The same spectrum matched (probably correctly) to actin peptide

EITALAPSTMK with a lysine dimethylation. This peptide is identified 63,000

times in PeptideAtlas.

Pseudogene

detection - PeptideAtlasSlide12

Dominant isoformsSlide13

We found evidence of AS in just over 1% of human genes, so 98% of protein coding genes have evidence for just a single isoform

Can we predict this isoform?Slide14

LONGEST

CCDS

RNASEQ

APPRIS

Five methods for selecting a

reference isoform

5-fold dominant

transcripts from

HBM data Gonzalez

-

Porta

et al

,

Gen. Biol. 2013

P

rincipal

isoforms

based

on

structure

, function and conservation

(Rodriguez et al, NAR, 2012)

Unique CCDS.

CCDS variants are

consensus between

RefSeq

,

and

Ensembl

/

GENCODE

HCI

Highest connected isoforms

trained on RNA-

seq

data in Li

et

al

,

JPR, 2015

Standard reference isoform

in all databases/large scale experimentsSlide15

98.6%

97.8%

77.2%

77.7%

Five means

of selecting reference isoforms

78%

We calculated

%

agreement between the

main proteomics isoform we found

and

the five reference methods:

the

longest

sequence

,

APPRIS principal isoforms

,

unique

CCDS

variants

,

the

dominant

RNAseq

transcripts

and

the

Highest Connected IsoformsSlide16

For those 3,000+ genes with a main experimental isoform, an APPRIS principal isoform and a unique CCDS variant, all three isoforms agreed over 99% of the genes.

The clear agreement between three orthogonal sources (and the large number of tissues sampled) suggests that the main proteomics isoform is the dominant protein isoform in the cell.

Indeed alternative isoforms (non-

APPRIS principal

isoforms) “are

significantly enriched in amino acid-changing variants, particularly those that have a strong impact on protein

function“Liu et al, Molecular

BioSystems

,

2015

Ezkurdia

et al

,

J. Proteome

Res

,

2015