/
Some statistical musings Some statistical musings

Some statistical musings - PowerPoint Presentation

marina-yarberry
marina-yarberry . @marina-yarberry
Follow
398 views
Uploaded On 2017-04-07

Some statistical musings - PPT Presentation

Naomi Altman Penn State 2015 Dagstuhl Workshop Some topics that might be interesting Feature matching across samples and platforms Preprocessing number of features gtgt number of samples ID: 534967

samples features number multiple features samples multiple number matching feature replication statistical mixture seq platforms data methods components selectors

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Some statistical musings" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Some statistical musings

Naomi Altman

Penn State

2015

Dagstuhl

WorkshopSlide2

Some topics that might be interesting

Feature matching across

samples and platforms

Preprocessing

number of features >> number of samples

feature screening

replication and possibly other design issues

PCA and relatives

mixture modeling Slide3

Feature Matching

e.g. (simple) should we match RNA-

seq

with a gene expression microarray by “gene” or by “

oligo

?

protein MS with

RNA-

seq

or

ribo

-Seq

how should we match features such as methylation sites, protein binding regions, SNPs, transcripts and proteins? Slide4

Preprocessing

These plots show the concordance of 3 normalizations of the same

Affymetrix

microarray.

Dozens of methods are available for each platform.

Matching features across platforms is going to be very dependent on which set of normalizations are selected.Slide5

p

>>

n

When the number of features > number of samples:

correlations of magnitude very close to 1 are common

we can always obtain a multiple “

perfect”predictors

so selecting “interesting” features is difficult

“extreme”

p

-values,

Bayes

factors, etc become common

singular matrices occur in optimization algorithmsSlide6

p

>>

n

New statistical methods for feature selection such as “sparse” and “sure screening” selectors may be useful.

The idea of “

sure screening”

selectors is that prescreening brings us to

p

<n-1.

But … we have some high probability that all the “important” features are selected (along with others which we will screen out later).Slide7

Experimental Design

Randomization, replication and matching enhance our ability to reproduce research

In particular, replication ensures the results are not sample specific while blocking allows variability in the samples without swamping the effects

Multi-

omics

is best done on single samples measured on multiple platforms

Technical replication is seldom worth the cost compared to taking more biological replicatesSlide8

Dimension Reduction

PCA (or SVD) have many relatives that can be used to reduce the number of features using projections onto a lower dimensional space

The components are often not interpretable.

Many variations are available from both the machine learning and statistics communities.

Machine learning stresses fitting the data.

Statistics stresses fitting the data generating process.Slide9

Mixture Modeling

In many cases we can think of a sample as a mixture of subpopulations

We can use the EM algorithm or Bayesian methods to

deconvolve

into

the components.Slide10

Some other statistical topics already mentioned

missing features (present but not detected) which differ between samples

mis

-identified features

do

p

-values (or FDR estimates) matter?

multiple times; multiple cells; multiple individuals

biological variation

vs

measurement noise & error propagation

how can be enhance reproducibility (statistical issues)

can we fit complex models? should we?

the data are too big for most statistically trained folks

how are we going to train the current and next generation?