/
Probe analysis and data preprocessing Probe analysis and data preprocessing

Probe analysis and data preprocessing - PowerPoint Presentation

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
395 views
Uploaded On 2016-06-01

Probe analysis and data preprocessing - PPT Presentation

Affymetrix Probe level analysis Normalization Constant Loess Rank invariant Quantile normalization Expression measure MAS 40 LIWong dChip MAS 50 RMA Background adjustment PMMM PM only RMA GCRMA ID: 344308

normalization genes expression analysis genes normalization analysis expression array data gene missing lsa level probe imputation intensities log probes rma model cdna

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Probe analysis and data preprocessing" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Probe analysis and data preprocessing

Affymetrix Probe level analysisNormalizationConstant, Loess, Rank invariant, Quantile normalizationExpression measureMAS 4.0, LI-Wong (dChip), MAS 5.0, RMABackground adjustmentPM-MM, PM only, RMA, GC-RMAStatistical analysis of cDNA arrayImage analysisNormalizationAssess expression level (A case study with Bayesian hierarchical model)Experimental design Source of variations; Calibration and replicate; Choice of reference sample; Design of two-color arrayPreprocessingData transformationFiltering (in all platforms)Missing value imputation (in all platforms)

1Slide2

From experiment to down-stream analysis

2Slide3

Experimental design

Image analysisPreprocessing(Normalization, filtering, MV imputation)Data visualizationIdentify differentially expressed genesRegulatory networkClustering

Classification

Statistical Issues in Microarray Analysis

Pathway

analysis

Integrative analysis & meta-analysis

3Slide4

Data PreprocessingPreliminary analyses extract and summarize information from the microarray experiments.

These steps are irrelevant to biological discovery But are for preparation of meaningful down-stream analyses for targeted biological purposes. (i.e. DE gene detection, classification, pathway analysis…)From scanned images Image analysis (extract intensity values from the images) Probe analysis (generate data matrix of expression profile) Preprocessing (data transformation, gene filtering and missing value imputation)4Slide5

1. Affymetrix probe level analysis

5Slide6

Hybridization

                                                                                                                                       from Affymetrix Inc.Overview of the technology6Slide7

25-mer unique oligo

mismatch in the middle nuclieotidemultiple probes (11~16) for each genefrom Affymetrix Inc.Array Design7Slide8

Background

adjustmentNormalizationSummarizationGive an expression measure for each probe set on each array (how to pool information of 16 probes?)The result will greatly affect subsequent analysis (e.g. clustering and classification). If not modeled properly, => “Garbage in, garbage out”Array Probe Level AnalysisWe will leave the discussion of “backgound adjustment” to the last because there’re more new exciting & technical advances.NormalizationBackground adjustmentSummarization

8Slide9

1.1 Normalization

The need for normalization:Intensities of array 2 is intrinsically larger than array 1. (about two fold)9Slide10

1.1. Normalization

Reason:Different labeling efficiency.Different hybridization time or hybridization condition.Different scanning sensitivity.…..Normalization is needed in any microarray platform. (including Affy & cDNA)10Slide11

Constant scaling

1.1. NormalizationDistributions on each array are scaled to have identical mean.Applied in MAS 4.0 and MAS 5.0 but they perform the scaling after computing expression measure.11Slide12

Constant scaling: Underlying reasoning

1.1. Normalization12Slide13

1.1 Normalization

M-A plotAMM=013Slide14

1.1 Normalization

M-A plot shows the need for non-linear normalization. The normalization factor is a function of the expression level.14Slide15

1.1 Normalization

Fit by ‘Lowess’ function in S-PlusNormalized Log ratio:Replicate arraysThe same pool of sample is applied15Slide16

Non-linear scaling: Underlying reasoning

1.1 Normalizationlog relativeexpression level16Slide17

Suppose we know the green genes are non-differentially expressed genes,

Non-linear scaling: Underlying reasoning (cont’d)1.1 NormalizationThe problem is: we usually don’t know which genes are constantly expressed!!17Slide18

1.1. Normalization

Loess (Yang et al., 2002)Using all genes to fit a non-linear normalization curve at the M-A plot scale. (believe that most genes are constantly expressed)Perform normalization between arrays pairwisely.Has been extended to perform normalization globally without selecting a baseline array but then is time-consuming.18Slide19

1.1. Normalization

Invariant set (dChip)Select a baseline array (default is the one with median average intensity).For each “treatment” array, identify a set of genes that have ranks conserved between the baseline and treatment array. This set of rank-invariant genes are considered non-differentially expressed genes.Each array is normalized against the baseline array by fitting a non-linear normalization curve of invariant-gene set.Tseng et al., 200119Slide20

Advantage:

More robust than fitting with all genes as in loess. Especially when expression distribution in the arrays are very different.Disadvantage:The selection of baseline array is important.Invariant set (dChip)1.1. Normalization20Slide21

1.1. Normalization

Quantile normalization (RMA) (Irizarry2003)Given n array of length p, form X of dimension p×n where each array is a col21umn.2. Sort each column of X to give Xsort.3. Take the means across rows of Xsort and assign this mean to each element in the row to get Xsort.4. Get Xnormalized by rearranging each column of X

sort

to have the same ordering as original

X.

237

283

341

397

401

198

329

335

237

198

329

283

341

335

401

397

217.5

217.5

306

306

338

338

399

399

X

X

sort

X

sort

217.5

306

338

399

399

217.5

306

338

X

normalized

21Slide22

1.1. Normalization

Bolstad, B.M., Irizarry RA, Astrand, M, and Speed, TP (2003), A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance Bioinformatics. 19(2):185-193A careful comparison of different normalization methods and concluded that quantile normalization generally performs the best.22Slide23

1.2. Summarize Expression Index

There’re multiple probes for one gene (11 PM and 11 MM) in U133.How do we summarize the 24 intensity values to a meaningful expression intensity for the target gene? 23Slide24

MAS 4.0

For each probe set, (I: # of arrays, J: # of probes) PMij-MMij= i + ij, i=1,…,I, j=1,…,J i estimated by average differenceNegative expressionNoisy for low expressed genesNot account for probe affinity1.2. Summarize Expression Index24Slide25

dChip (DNA chips)

For each probe set, (I: # of arrays, J: # of probes) PMij=j + ij + ij + ij MMij=j + ij + ij PMij - MMij= ij + ij, i=1,…,I, j=1,…,J j = J, ij ~ N(0, 2)Account for probe affinity effect, j.

Outlier detection through multi-chip analysis

Recommended for more than 10 arrays

Multiplicative model:

PM

ij

- MM

ij

=

i

j

+ 

ij

(better)

Additive model:

PM

ij

- MM

ij

=

i

+ 

j

+ 

ij

1.2. Summarize Expression Index

Li and Wong (PNAS, 2001)

25Slide26

MAS 5.0

For each probe set, (I: # of arrays, J: # of probes) log(PMij-CTij)=log(i)+ij, i=1,…,I, j=1,…,J CTij=MMij if MMij<PMij if MMijPMij i estimated by a robust average (Tukey biweight).No more negative expressionTaking log adjusts for dependence of variance on the mean.less than PMij1.2. Summarize Expression Index26Slide27

RMA (Robust Multi-array Analysis)

For each probe set, (I: # of arrays, J: # of probes) log(T(PMij))= i + j + ij, i=1,…,I, j=1,…,J T is the transformation for background correction and normalization ij ~ N(0, 2)Log-scale additive modelSuggest not to use MMFit the linear model robustly (median polish)1.2. Summarize Expression IndexIrizarry et al. (NAR, 2003)27Slide28

1.25

g1.25g1.25g20g20g20gR2=0.85R2=0.95R2=0.97from Irizarry et al. (NAR, 2003)

Affymetrix Latin square data

28Slide29

from Irizarry et al. (NAR, 2003)

Affymetrix Latin square data29Slide30

Direct subtraction: PM-MM

MAS4.0, dChip, MAS5.0 Assume the following deterministic model: PM=O+N+S (O: optical noise, N: non-specifi binding) MM=O+N => PM-MM=S>0Is it true?1.3. Background Adjustment30Slide31

Yeast sample hybridized to human chip

If MM measures non-specific binding of PM well, PMMM. R2 only 0.5.MM does not measure background noise of PM86 HG-U95A human chips, human blood extractsTwo fork phenomenon at high abundance1/3 of probes have MM>PMMany MM>PM31Slide32

Reasons MM should not be used:

MM contain non-specific binding information but also include signal information and noiseThe non-specific binding mechanism not well-studied.MM is costly (take up half space of the array) Ignore MMdChip has an option for PM-only modelIn general, PM-only is preferred for both dChip or RMA methods.1.3. Background Adjustment32Slide33

95% of (MM>PM) have purine (A, G) in the middle base.

In the current protocol, only pyrimidines (C, T) have biotin-labeled florescence.Consider sequence informationNaef & Magnasco, 20031.3. Background Adjustment33Slide34

Fit a simple linear model:

C > G  T > ABoundary effect1.3. Background AdjustmentNaef & Magnasco, 200334Slide35

PM

CGTA

MM

G

C

A

T

labeling

Yes (+)

No (-)

Yes (+)

No(-)

Labeling impedes binding

Yes (-)

No

Yes (-)

No

Hydrogen bonds

3 (+)

3 (+)

2

2

Sequence specific brightness

High

average

average

Low

Some chemical explanation of the result:

1.3. Background Adjustment

See next page

35Slide36

From: Lodish et al. Fig 4-4

Double strandRemember from the first lecture:G-C has three hydrogen bonds. (stronger)A-T has two hydrogen bonds. (weaker)1.3. Background Adjustment36Slide37

GC-RMA h: a smooth (almost linear) function.: the sequence information weight computed form the simple linear model.O: optical noise, log-normal dist.N: non-specific binding1.3. Background AdjustmentWu et al., 2004 JASA37Slide38

AccuracyIn well-controlled experiment with spike-in genes (such as Latin Square data), accuracy of estimated log-fold changes compared to the underlying true log-fold changes are concerned.

(only available in data with spike-in genes)PrecisionIn data with replicates, the reproducibility (SD) of the same gene in replicates is concerned.(available in data with replicates)Criterion to compare diff. methods38Slide39

39Slide40

40Slide41

GC-RMA

41Slide42

Fee

GUIFlexibility to programming and mining

Audience

MAS 4.0

Commercial

Yes

No

Average Difference

Manufacturer default

dChip

Free

Yes

Some extra tools

Li-Wong model

Biologists

MAS 5.0

Commercial

Yes

No

Robust average of log difference

Manufacturer default

RMAExpress

Free

Yes

No

RMA

Biologists

Bioconductor

Free

Some

Best

All of above

Statistician, programmer

ArrayAssist

Commercial

Yes

No

RMA, GC-RMA

Biologists

42Slide43

Background

MethodsNormalizationMethodsPM correctionMethods

Summarization

Methods

none

rma/rma2

mas

quantiles

loess

contrasts

constant

invariantset

qspline

mas

pmonly

subtractmm

avgdiff

liwong

mas

medianpolish

playerout

Probe level analysis in Bioconductor

(affy package)

43Slide44

A Simple Case Study

http://www.affymetrix.com/analysis/download_center2.affxLatin Square Data59 HG-U95A arrays14 spike-in genes in 14 experimental groupsM, N, O, P are replicates and Q, R, S, T another replicates44Slide45

M

1521m99hpp_av06.CEL1521q99hpp_av06.CELQN

1521n99hpp_av06.CEL

1521r99hpp_av06.CEL

R

O

1521o99hpp_av06.CEL

1521s99hpp_av06.CEL

S

P

1521p99hpp_av06.CEL

1521t99hpp_av06.CEL

T

Take the following two replicate groups.

Use Bioconducotr to perform a simple evaluation of different probe analysis algorithms.

Note: This is only a simple demonstration. The evaluation result in this presentation is not conclusive.

A Simple Case Study

45Slide46

Average log intensities vs SD log intensities. (M, N, O, P)

A Simple Case StudyMAS5.0dChip(PM only)dChip(PM/MM)RMAGC-RMA(PM/MM)GC-RMA(PM only)46Slide47

A Simple Case Study

Average log intensities vs SD log intensities. (Q, R, S, T)MAS5.0dChip(PM only)dChip(PM/MM)RMAGC-RMA(PM/MM)GC-RMA(PM only)47Slide48

M, N, O, P

Q, R, S, TMAS50.8930

0.9002

dChip (PM/MM)

0.9604

0.9621

dChip (PM-only)

0.9940

0.9966

RMA

0.9978

0.9978

GC-RMA(PM/MM)

0.9988

0.9990

GC-RMA(PM-only)

0.9993

0.9994

Average pair-wise correlations

between replicates

A Simple Case Study

Replicate correlation performance:

GCRMA(PM-only)

>GC-RMA(PM/MM)> 

RMA

>  dChip(PM-only)

>>dChip(PM/MM)>>MAS5

48Slide49

A Simple Case Study

RMA greatly improves dChip(PM/MM) but dChip(PM-only) model generally seems a little better than RMA.Average replicate correlations of RMA (0.9978) is a little better than dChip(PM only) (0.9940 & 0.9966)dChip(PM only) suffers from a number of outlying genes in the model.Outlying genes that do not fit Li-Wong model49Slide50

Conclusion:

Technological advances have been made to have smaller probe size and better sequence selection algorithms to reduce # of probes in a probe set. This will enable more biologically meaningful genes on a slide and reduce the cost.Recent analysis advances have been focused on understanding and modelling hybridization mechanisms. This will allow a better use of MM probes or eventually suggest to remove MMs from the array.The probe analysis is relatively settled in the field. In the second lab session next Friday, we will introduce dChip and RMAexpress for Affymetrix probe analysis.50Slide51

2. cDNA probe level analysis

51Slide52

From Y. Chen

et al. (1997) cDNA Microarray Review52Slide53

48 grids in a 12x4 pattern.

Each grid has 12x16 features.Total 9216 features.Each pin prints 3 grids.Probe (array) printingcDNA Microarray Review53Slide54

Probe design and printing

cDNA Microarray Review54Slide55

From Y. Chen

et al. (1997) cDNA Microarray Review55Slide56

cDNA

GeneChipProbe preparationProbes are cDNA fragments, usually amplified by PCR and spotted by robot.

Probes are short oligos synthesized using a photolithographic approach.

colors

Two-color

(measures relative intensity)

One-color

(measures absolute intensity)

Gene representation

One probe per gene

11-16 probe pairs per gene

Probe length

Long, varying lengths

(hundreds to 1K bp)

25-mers

Density

Maximum of ~15000 probes.

38500 genes * 11 probes = 423500 probes

Comparison of cDNA array and GeneChip

cDNA Microarray Review

56Slide57

Advantage and disadvantage of

cDNA array and GeneChipcDNA microarrayAffymetrix GeneChipThe data can be noisy and with variable qualitySpecific and sensitive. Result very reproducible.Cross(non-specific) hybridization can often happen.

Hybridization more specific.

May need a RNA amplification procedure.

Can use small amount of RNA.

More difficulty in image analysis.

Image analysis and intensity extraction is easier.

Need to search the database for gene annotation.

More widely used. Better quality of gene annotation.

Cheap. (both initial cost and per slide cost)

Expensive (~$400 per array+labeling and hybridization)

Can be custom made for special species.

Only several popular species are available

Do not need to know the exact DNA sequence.

Need the DNA sequence for probe selection.

57Slide58

Identify spot area

: Each spot contains around 100100 pixels. Spot image may not be uniformly and roundly distributed. Some software (like ScanAlyze or ImaGene) have algorithms to “help” placing the grids and identify spot and background area locally.Still semi-automatic: a very time-consuming job.Extract intensities (data reduction) : Aim to extract the minimum most informative statistics for further analysis. Usually use the median signal minus the median background. Some spot quality indexes (e.g. Stdev or CV) will be computed. 2.1. Image Analysis58Slide59

ScanAlyze

2.1. Image Analysis59Slide60

Input the number of rows and columns in each sector; input the approximate location and distances between spots.

May need to tilt the gridsSome local adjustments may be needed.Once the spot grids are close enough to the real spot physical location, computer image algorithms will help to find the optimal spot area (spherical or irregular shapes) and background area.May take 10~30 minutes for an array. Usually the biologists will do it.2.1. Image Analysis60Slide61

http://www.techfak.uni-bielefeld.de/ags/ai/projects/microarray/

2.1. Image Analysis61Slide62

Result file

from image analysisSummarized intensities for further analysis: median(spot intensities)-median(background intensities)2.1. Image Analysis62Slide63

AffymetrixNormalization done across arrays

After normalization, the expression data matrix shows absolute expression intensities.cDNANormalization between two colors in an array.After normalization, the expression data matrix shows comparative expression intensities (log-ratios).2.2. Normalization63Slide64

Calibration

: apply the same samples on both dyes (E. Coli grown in glucose). Purple and orange represent two replicate slides.2.2. Normalization Same sample on both dyes. Each point is a gene. Orange is one array and purple is another array.64Slide65

Normalization:

General idea:Dye effect : Cy5 is usually more bleached than Cy3. Slide effect The normalization factor is slide dependent. Usually need to assume that most genes are not differentially expressed or up- and down-regulated genes roughly cancel out the expression effect.2.2. Normalization65Slide66

Normalization:

Current popular methods:House-keeping genes : Select a set of non-differentially expressed genes according to experiences. Then use these genes to normalize. Constant normalization factor : Use mean or median of each dye to normalize.ANOVA model (Churchill’s group)Average-intensity-dependent normalization:Robust nonlinear regression(Lowess) applied on whole genome. (Speed’s group)Select invariant genes computationally (rank-invariant method). Then apply Lowess. (Wong’s group)2.2. Normalization66Slide67

Loess Normalization:

Pin-wise normalization using all the genes. It requires the assumption that up- and down-regulated genes with similar average intensities (denoted A) are roughly cancelled out or otherwise most genes remain unchanged. AMFrom Dudoit et al. 20002.2. Normalization67Slide68

Rank Invariant Normalization:

Rank-invariant method (Schadt et al. 2001, Tseng et al. 2001):Idea: If a particular gene is up- or down- regulated, then its Cy5 rank among whole genome will significantly different from Cy3 rank. Iterative selection helps to select a more conserved invariant set when number of genes is large.2.2. Normalization68Slide69

Rank Invariant Normalization:

Blue points are invariant genes selected by rank-invariant method. Red curves are estimated by Lowess and extrapolation. Data: E. Coli. Chip, ~4000 genes, from Liao lab.2.2. Normalization69Slide70

Data Truncation

Data TruncationIn cDNA microarry, the intensity value is between 0~216=65536.Measurement of low intensity genes are not stable. Extremely highly expressed genes can saturate.For example, we can truncate genes with intensity smaller than 200 or larger than 65000.70Slide71

Approaches to assess expression level:

Single slide:Normal model (Chen et al. 1997)Gamma model with empirical Bayes approach (Newton et al. 2001)With replicate slides:Traditional t-test.ANOVA model (Kerr et al. 2000)Permutation t-test (Dudoit et al. 2000)Hierarchical structure:Linear hierarchical model (Tseng et al. 2001)2.3. Assess Expression Level71Slide72

Single slide analysis:

2.3. Assess Expression Level72Slide73

Case study: (Tseng et al. 2001

)125-gene project: each gene is spotted four timesCalibration: E. Coli grown in acetate v.s. actate C1S1~2 E. Coli grown in glucose v.s. glucose C2S1~4, C3S1~2, C4S1~3Comparative: E. Coli grown in acetate v.s. glucose R1S1~2, R2S1~24129-gene project: each gene is singly spottedCalibration: E. Coli grown in acetate v.s. actate C1S1~2, C2S1~2Comparative: E. Coli grown in acetate v.s. glucose R1S1~2, R2S1~22.3. Assess Expression Level73Slide74

Reversed transcription

& labelingHybridize onto different slidesHierarchical structure in experiment design2.3. Assess Expression Level74Slide75

2.3. Assess Expression Level

75Slide76

2.3. Assess Expression Level

Basics of Bayesian AnalysisMeaning how much we can say about  given the data 76Slide77

Baysian Hierarchical Model

2.3. Assess Expression Level

77Slide78

2.3. Assess Expression Level

78Slide79

How to specify the prior?

Empirical Bayes:2.3. Assess Expression Level79Slide80

Another version of EB:

2.3. Assess Expression Level80Slide81

((0.75, 0.67), (0.45, 0.51))

Getting prior distribution: (when we have calibration experiments)Calibration(normal vs normal)

Comparative

(cancer vs normal)

2.3. Assess Expression Level

81Slide82

1. Compute

2. 3. 4. 5. MCMC for hierarchical model:2.3. Assess Expression Level82Slide83

2.3. Assess Expression Level

95% probability interval of the posterior distribution of the underlying expression level.83Slide84

2.4. Experimental design

Biological variationTechnical variations: Amplification Labeling Hybridization Pin effect Scanning84Slide85

Calibration:

Use the same sample on both dyes for hybridization.Calibration experiments help to validate experiment quality and gene-specific variability.2.4.1 Calibration and replicateComparative:Tumor vs RefCalibration:Ref vs Ref85Slide86

Replicates: (replicate spots, slides)

Multiple-spotting helps to identify local contaminated spots but will reduce number of genes in the study.Multi-stage strategy: Use single-spotting to include as many genes as possible for pilot study. Identify a subset of interesting genes and then use multiple-spotting.Replicate spots and slides help to verify reproducibility on the spot and slide level.2.4.2. Calibration and replicate86Slide87

2.4.2. Calibration and replicate

Biological replicateFrom Yang, UCSF87Slide88

Technical replicate

2.4.2. Calibration and replicateFrom Yang, UCSF88Slide89

(iii) Reverse labelling:

Advantage:Cancel out linear normalization scaling and simplifies the analysis. However, the linear assumption is often not true.Help to cancel out gene-label interactions if it exists. Sample ASample B2.4.2. Calibration and replicate89Slide90

Different choices of reference sample:

Normal patient or time 0 sample in time course studyPool all samples or all normal samplesEmbryonic cellsCommercial kit2.4.3. Choice of reference sampleIdeally we want all genes expressed at a constant moderate level in reference sample.90Slide91

2.4.4. Design issue

From Yang, UCSF91Slide92

Design issues:

Reference designLoop designBalance design2.4.4. Design issueReference sample is redundantly measured many times.92Slide93

(c)

v samples with v+2 experimentsv samples with 2v experimentsSee Kerr et al. 200193Slide94

Conclusion of cDNA array

Affymetrix GeneChip is more preferred if available.Unlike GeneChip, cDNA array data is usually more noisy and careful quality control (replicates and calibration) is important. But occasionally custom arrays are needed for some specific research.Analysis of cDNA microarray is also applicable to other two-color technology such as array CGH and similar two-color oligo arrays.Conservative “Reference design” is usually more robust although it’s not statistically most efficient.94Slide95

3. Data preprocessing

95Slide96

3.1. Data Truncation and Transformation

TransformationLogarithmic transformation (most commonly used)-- tend to get an approximately normal distribution-- should avoid negative or 0 intensity before transformationSquare root transformation-- a variance-stabilizing transformation under Poisson model.Box-Cox transformation familyAffine transformationGeneralized-log transformationDetails see chapter 6.1 in Lee’s book; Log10 or Log2 transformation is the most common practice.96Slide97

3.2. Filtering

Filter is an important step in microarray analysis:Without filtering, many genes are irrelevant to the biological investigation and will add noise to the analysis. (among ~30,000 genes in the human genome, usually only around 6000 genes are expressed and varied in the experiment)But filtering out too many genes will run the risk to eliminate important biomarkers.Three common aspects of filtering:Genes of bad experimental quality.Genes that are not expressedGenes that do no fluctuate across experiments.97Slide98

3.2. Filtering

Filter out genes with bad quality in cDNA array: Outputs from imaging analysis usually have a quality index or flag to identify genes with bad quality image.Three common sources of bad quality probes:Problematic probes: probes with non-uniform intensities.Low-intensity probes: genes with low intensities are known to have bad reproducibility and hard to verify by RT-PCR. Normally genes with intensities less than 100 or 200 are filtered.Saturated probes: genes with intensities reaching scanner limit (saturation) should also be filtered.For Affymetrix and other platforms, each probe (set) also has a detection p-value, quality flag or present/absent call.98Slide99

3.2. Filtering

Filtering by quality index: different array platform and image analysis have different formatlow intensity99Slide100

Filtering by quality index:

3.2. FilteringArray 1Array 2Array SArray 1

Array 2

Array S

Gene 1

NA

Gene 2

49.4225

Gene 3

58.7938

Gene 4

196.236

Gene 5

146.344

Gene 6

93.5549

:

:

:

:

Gene G-2

768.63

Gene G-1

30.3535

Gene G

15.9003

342.061

267.247

72.2798

54.2583

69.6987

73.8338

163.73

197.419

136.412

140.536

131.405

96.128

:

:

:

:

763.445

936.445

NA

34.7477

12.5406

13.648

NA: not applicable

Missing values due to bad quality, low or saturated intensities

100Slide101

3.2. Filtering

Filter genes with low information content:Small standard deviation (stdev)Small coefficient of variation (CV: stdev/mean)stdev=6.45CV=0.29stdev=6.45CV=0.05325153020125115130120

Note: CV is more reasonable for original intensities. But for log-transformed intensities,

stdev

is enough

Why?

101Slide102

A simple gene filtering routine (I usually use) before down-stream analyses:Take log (base 2) transformation.

Delete genes with more than 20% missing values among all samples.Delete genes with average expression level less than, say α=7 (27=128). For Affymetrix and most other platforms, intensities less than 100-200 are simply noises.Delete genes with standard deviation smaller than, say β=0.4 (20.4=1.32, i.e. 32% fold change). Might adjust β so that the number of remaining genes are computationally manageable in downstream analysis. (e.g. around ~5000 genes)3.2. FilteringGene filtering102Slide103

3.2. Filtering

Sample filtering (detecting problematic slides)Compute correlation matrix of the samples:Arrays of replicates should have high correlation. (m,n,o,p are replicates and q,r,s,t are another set of replicates)A problematic array is often found to have low correlation with all the other arrays.Heatmap is usually plotted for better visualization.103Slide104

m,n,o,p

q,r,s,tWhite: high correlationDark gray: low correlation3.2. FilteringDiagnostic plot by correlation matrix104Slide105

3.3. Missing Value Imputation

Reasons of missing values in microarray: spotting problems (cDNA) dust finger prints poor hybridization inadequate resolution fabrication errors (e.g. scratches) image corruptionMany down-stream analysis require a complete data.“Imputation” is usually helpful.105Slide106

Array 1

Array 2Array S

Gene 1

NA

Gene 2

49.4225

Gene 3

58.7938

Gene 4

196.236

Gene 5

146.344

Gene 6

93.5549

:

:

:

:

Gene G-2

768.63

Gene G-1

30.3535

Gene G

15.9003

342.061

267.247

72.2798

54.2583

69.6987

73.8338

163.73

197.419

136.412

140.536

131.405

96.128

:

:

:

:

763.445

936.445

NA

34.7477

12.5406

13.648

It is common to have ~5% MVs in a study.

5000(genes)

50(arrays) 5%=12,500

3.3. Missing Value Imputation

106Slide107

Na

ïve approachesMissing values = 0 or 1 (arbitrary signal)missing values = row (gene) averageSmarter approaches have been proposed:K-nearest neighbors (KNN)Regression-based methods (OLS)Singular value decomposition (SVD)Local SVD (LSVD)Partial least square (PLS) More (Bayesian Principal Component Analysis, Least Square Adaptive, Local Lease Square)Assumption behind: Genes work cooperatively in groups. Genes with similar pattern will provide information in MV imputation.3.3. Missing Value ImputationExisting methods107Slide108

Arrays

Expression

?

randomly missing datum

choose

k

genes that are most

similar

to the gene with the missing value (MV)

estimate MV as the weighted mean of the neighbors

considerations:

number of neighbors (

k

)

distance metric

normalization step

3.3. Missing Value Imputation

KNN.e & KNN.c

108Slide109

parameter k

10 usually works (5-15)distance metriceuclidean distance (KNN.e)correlation-based distance (KNN.c)normalization?not necessary for euclidean neighborsrequired for correlation neighborsArraysExpression

?

3.3. Missing Value Imputation

KNN.e & KNN.c

109Slide110

regression-based approachKNN+OLS

algorithm:choose k neighbors (euclidean or correlation; normalize or not)the gene with the MV is regressed over the neighbor genes (one at a time, i.e. simple regression)for each neighbor, MV is predicted from the regression modelMV is imputed as the weighed average of the k predictions3.3. Missing Value ImputationOLS.e & OLS.c110Slide111

Arrays

Expression

?

randomly missing datum

y

1

=

a

1

+

b

1

x

1

y

2

=

a

2

+

b

2

x

2

y

= w

1

y

1

+ w

2

y

2

3.3. Missing Value Imputation

OLS.e & OLS.c

111Slide112

Algorithmset MVs to row average (need a starting point)

decompose expression matrix in orthogonal components, “eigengenes”.use the proportion, p, of eigengenes corresponding to largest eigenvalues to reconstruct the MVs from the original matrix (i.e. improve your estimate)use EM approach to iteratively imporove estimates of MVs until convergenceAssumption:The complete expression matrix can be well-decomposed by a smaller number of principle components.3.3. Missing Value ImputationSVD112Slide113

KNN+SVDchoose k neighbors (euclidean or correlation; normalize or not)

Perform SVD on the k nearest neighbors and get a prediction of the missing value.3.3. Missing Value ImputationLSVD.e & LSVD.c113Slide114

PLS: Select linear combinations of genes (PLS components) exhibiting high covariance with the gene having the MV.

The first linear combination of genes has the highest correlation with the target gene.The second linear combination of genes had the greatest correlation with the target gene in the orthogonal space of the first linear combination. MVs are then imputed by regressing the target gene onto the PLS components 3.3. Missing Value ImputationPLS114Slide115

3.3. Missing Value Imputation

Types of missing mechanism:Missing completely at random (MCAR)Missingness is independent of the observed values and their own unobserved values. Spot missing due to mis-printing or dust particle. Spot missing due to scratches.Missing at random (MAR)Missingness is independent of the unobserved data but depend on the observed data.Missing not at random (MNAR)MIssingness is dependent on the unobserved data1. Spots missing due to saturation or low expression.Currently imputation methods only work for MCAR, not MNAR. 115Slide116

Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes

Guy N. Brock1, John R. Shaffer2, Richard E. Blakesley3, Meredith J. Lotz3, George C. Tseng2,3,4§BMC Bioinformatics, 2008116Slide117

Data set

Full Dim.Used Dim.CategoryOrganism

Expression Profiles

Alizadeh (ALI)

13412 x 40

5635 x 40

multiple exposure

H. sapiens

diffuse large B-cell lymphoma

Alon (ALO)

2000 x 62

2000 x 62

multiple exposure

H. sapiens

colon cancer and normal colon tissue

Baldwin (BAL)

16814 x 39

6838 x 39

time series, non-cyclic

H. sapiens

epithelial cellular response to L. monocytogenes

Causton (CAU)

4682 x 45

4616 x 45

multiple exposure x time series

S. cerevisiae

response to changes in extracellular environment

Gasch (GAS)

6152 x 174

2986 x 155

multiple exposure x time series

S. cerevisiae

cellular response to DNA-damaging adgents

Golub (GOL)

7129 x 72

1994 x 72

multiple exposure

H. sapiens

acute lymphoblastic leukemia

Ross (ROS)

9706 x 60

2266 x 60

multiple exposure

H. sapiens

NCI60 cancer cell lines

Spellman, AFA (SP.AFA)

7681 x 18

4480 x 18

time series, cyclic

S. cerevisiae

cell-cycle genes

Spellman, ELU (SP.ELU)

7681 x 14

5766 x 14

time series, cyclic

S. cerevisiae

cell-cycle genes

9 data sets: multiple exposure, time series or both

7 methods were compared: KNN, OLS, LSA, LLS, PLS, SVD, BPCA

3.3. MV imputation comparative study

117Slide118

Global-based methods (PLS, SVD, BPCA): Estimate the global structure of the data to impute MV.

Neighbor-based methods (KNN, OLS, LSA, LLS): Borrow information from correlated genes (neighbors).Intuitively global-based methods require that dimension reduction of the data can be effectively performed.We define an entropy measure for a given data D to determine how well the dimension reduction of the data can be done: (i are the eigenvalues).Entropy low: the first few eigenvalues dominate and the data can be reduced to low-dimension effectively.3.3. MV imputation comparative study118Slide119

LRMSE is the performance measure, the lower the better.

KNN, OLS, LSA, LLS are neighbor-based methods and work better in low-entropy data sets.PLS and SVD are global-based methods and work better in high-entropy data sets.3.3. MV imputation comparative study119Slide120

Simulation II

Simulation III

Data set

Entropy

Optimal

EBS

Accuracy

Optimal

STS

Accuracy

BAL

0.819

LSA (38), LLS (12)

LSA (50)

76%

LSA (9), LLS (1)

LSA (10)

90%

CAU

0.838

LLS (45), LSA (5)

LSA (50)

10%

LLS (10)

LSA (10)

0%

ALO

0.872

LSA (50)

LSA (50)

100%

LSA (10)

LSA (10)

100%

GOL

0.876

LSA (50)

LSA (50)

100%

LSA (10)

LSA (10)

100%

SP.ELU

0.909

LLS (41), BPCA (9)

LSA (50)

0%

LLS (10)

BPCA (10)

0%

GAS

0.911

LSA (50)

LSA (50)

100%

LSA (10)

LSA (10)

100%

SP.AFA

0.94

LLS (40), BPCA (10)

LSA (50)

0%

LLS (9), BPCA (1)

BPCA (10)

10%

ROS

0.944

LSA (50)

LSA (50)

100%

LSA (10)

LSA (10)

100%

ALI

0.947

LSA (50)

LSA (50)

100%

LSA (10)

LSA (10)

100%

Overall

65%

Overall

67%

Three methods (LSA, LLS, BPCA) performed best but none dominated.

Performed two selection schemes (entropy-based scheme and self-training scheme) to select the best imputation method.

3.3. MV imputation comparative study

120