/
DATASET DESCRIPTION                  PCA DATASET DESCRIPTION                  PCA

DATASET DESCRIPTION PCA - PowerPoint Presentation

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
388 views
Uploaded On 2016-06-23

DATASET DESCRIPTION PCA - PPT Presentation

Dataset 1 RNA Seq of neural cells MiSeq 2 65 cells Ground truth clusters Group I Neural Progenitors Group II Radial Gilia Group III Newborn Neurons Group IV Maturing Neurons ID: 373924

cell clustering number methods clustering cell methods number data index single seq parameters expression group method truth 500 clusters

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "DATASET DESCRIPTION PCA" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

DATASET DESCRIPTION PCA RESULTS

Dataset #1

RNA-

Seq

of neural cells (

MiSeq) [2]65 cellsGround truth clusters:Group I (Neural Progenitors), Group II (Radial Gilia), Group III (Newborn Neurons), Group IV (Maturing Neurons)Feature selection: Top 500 genes from [2]Best parameters: K=4; d= 1.4; metric= Euclidean; method= complete; I = 200; S= 200; n= 5; r = 0.7; m= 0.5 Dataset #2RNA-Seq of neural cells (HiSeq) [2]65 cellsGround truth clusters:Group I (Neural Progenitors), Group II (Radial Gilia), Group III (Newborn Neurons), Group IV (Maturing Neurons)Feature selection: Top 500 genes from [2]Best parameters: K=4; d= 1.4; metric= Cityblock; method= complete; I = 200; S= 200; n= 5; r = 0.4; m= 0.4Dataset #3RNA-Seq of mouse sematosensory cortex and hippocampal CA1 cells [6]3005 cellsGround truth clusters:Astrocytes_ependymal, Endothelial-mural, Interneurons, Microglia, Oligodendrocytes, Pyramidal CA1, Pyramidal SS and 47 subtypesFeature selection: Top 500 genes using ceftools published by [6]Best parameters: K=47; d= 1.2; metric= Seulidean; method= ward; I = 200; S= 200Dataset #4qPCR of mouse hematopoietic system [1]327 cellsGround truth clusters:HSC (Hematopoietic stem cells)CMP (Common myeloid progenitorsGMP (Granulocyte/monocyte progenitors )MEP (Megakaryocyte/erythroid progenitors)CLP (Common lymphoid progenitors)MPP (Multipotent progenitor cells)Feature selection: Top 280 genes from [1] Best parameters: K= 6; d= 1.2; metric= correlation; method= complete; I = 500; S= 500; n= 5; r = 0.6; m=0.7Dataset #5RNA-Seq of mouse distal lung epithelial [4]80 cellsGround truth clusters:Clara (Scgb1a1), Ciliated (Foxj1), AT1 (Pdpn, Ager), AT2 (Sftpc, Sftpb), BP (alveolar bipotential progenitor) Number of genes: 8,578Feature selection: Top 8,578 from [4] and then choose 10 first PCs after applying PCABest parameters: K=5; d= 1.2; metric= correlation; method= complete; I = 500; S= 500; n= 5; r = 0.7; m= 0.9

University of Connecticut

School of Engineering

We selected five clustering methods: K-means, fuzzy c-means, hierarchical and EM clustering and SNN-

Cliq

[5] as tools to reveal heterogeneity of cell type at six different datasets. The last one is recently developed to do clustering on RNA-

seq

data. Some datasets have their own rules to reduce features. Even dimensionality will reduce too much in this way, but sometimes applying feature extraction methods like PCA leads to more improvement at final performance of the algorithms. There are some parameters for each algorithm which is in below table. The results are reported base on best parameter setting founded for different datasets. Below are list of parameters for the algorithms.

METHODS

Algorithm

Parameters

K-means

K = Number of clusters

Fuzzy c-means Clustering (FCM)K = number of clustersd = Degree of fuzzinessHierarchical Clustering (HCS)Metric = euclidean, seuclidean, cityblock, minkowski, chebychev, cosine, correlation, spearmanMethod = average, centroid, complete, median, singleEM Clustering K = Number of clustersS = Number of initial seedsI = Number of iterationSNN-Cliqn = Size of the nearest neighbor listr = Density threshold of quasi-cliques m = Threshold on the overlapping rate for merging.

HC

and model-based clustering methods (EM) performed well on most datasets; the other clustering methods had less consistent performance.Best method and best parameters depend on dataset.A limitation of most current methods is that they do not model the noise in the expression level estimates. Density based clustering may be a good way of handling noise.An additional advantage of model-based methods is that they can incorporate prior knowledge in the inference process; the value of incorporating such prior knowledge is currently under evaluation.Further increases in accuracy may benefit from time-series data such as SCUBA [7].With increased number of cells scalability becomes a concern. In ongoing work we will explore scalable alignment-free clustering methods.

CONCLUSION & FUTURE WORK

Seven measures

are considered to evaluate clustering

algorithms which their definitions are as following.

EVALUATION METRICS

Recent RNA-

seq

technologies have facilitated generation of single-cell transcriptome data, but it should be considered that sometime tools and computational strategies that work for analyzing bulk-cell population RNA-

seq

data cannot successfully applied to study of gene expression at sing-cell level. So, we need to develop new tools for analyzing them. At this work, we look at application of clustering methods in analyzing different RNA-

seq data, especially identification of heterogeneous cell types from single-cell transcriptome. Clustering multiple tissues samples by their bulk expression profiles have done at previous studies. Clustering methods also can be used to identify hidden tissue heterogeneity on the basis of expression profiles. There are two groups of clustering methods in this context [3]. It mainly depends on availability of prior knowledge or expectation regarding relationship between cells. Gene expression profiles are input as clustering method. Most of times the values are measured or estimated imprecisely due to biological or technical noise which really affect accuracy of clustering result. We applied different clustering methods on different datasets and evaluate the methods to see how well they can basically address noisy and high dimensional data.

INTRODUCTION

Application of Clustering to Identify

Cell

Types from

Single-Cell mRNA Expression Data

Elham

Sherafat

,

Ion

Mandoiu

Department

of Computer Science and Engineering, University of Connecticut, Storrs, CT 06269

[1]

Guo

,

Guoji

, et al. "Mapping cellular hierarchy by single-cell analysis of the cell surface repertoire."

Cell stem cell

13.4 (2013): 492-505.[2] Pollen, Alex A., et al. "Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex." Nature biotechnology (2014).[3] Stegle, Oliver, et al. "Computational and analytical challenges in single-cell transcriptomics." Nature Reviews Genetics 16.3 (2015): 133-145.[4] Treutlein, B. et al. “Reconstructing lineage hierarchies of the distal lung epithelium using single-cell RNA-seq.” Nature 509, 371–375 (2014). [5] Xu, Chen, and Zhengchang Su. "Identification of cell types from single-cell transcriptomes using a novel clustering method." Bioinformatics (2015): btv088.[6] Zeisel, Amit, et al. "Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq." Science 347.6226 (2015): 1138-1142.[7] Marco, Eugenio, et al. "Bifurcation analysis of single-cell gene expression data reveals epigenetic landscape." Proceedings of the National Academy of Sciences 111.52 (2014): E5643-E5650

REFERENCES

Purity

U: set ofground truth classes; V: set of the computed clusters; N:total # of objects in datasetAdjusted Rand Index (AR)Rand Index (RI)RI= (TP+TN)/(TP+FP+FN+TN) F1ScoreF1 Score= 2×TP/(2×TP+FP+FN)Mirkin’s index (MI)It counts the number of disagreements in data pairs between two clustering. It is the ratio of the number of disagreeing pairs to the total number of pairs. Lower value of Mirkin’s index indicates better clustering. Hubert’s index (HI)HI = RI – MI CorrMaximum weighted Pearson correlation between average expression value of each class at ground truth and computed cluster

Purity

Adjusted Rand

Index (AR)

Rand Index (RI)

RI=

(

TP+TN

)/(

TP+FP+FN+TN)

F

1

Score

F

1

Score=

2×TP/(

2×TP+FP+FN

)

Mirkin’s

index (MI)

It counts the number of disagreements in data pairs between two clustering. It is the ratio of the number of disagreeing pairs to the total number of pairs. Lower value of

Mirkin’s

index indicates better clustering.

Hubert’s index (HI)

HI = RI – MI

Corr

Maximum weighted Pearson correlation between average expression value of each class at ground truth and computed cluster