/
SC1: A Web-based Single Cell RNA- SC1: A Web-based Single Cell RNA-

SC1: A Web-based Single Cell RNA- - PowerPoint Presentation

tabitha
tabitha . @tabitha
Follow
65 views
Uploaded On 2024-01-29

SC1: A Web-based Single Cell RNA- - PPT Presentation

Seq Analysis Pipeline Marmar Moussa Ion Mandoiu Computer Science amp Engineering Department University of Connecticut Outline Motivation and challenges Locality sensitive imputation ID: 1042337

cells cell based amp cell cells amp based cycle idf selection sequencing 2017 challengeslow feature sensitive gene data genes

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "SC1: A Web-based Single Cell RNA-" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. SC1: A Web-based Single Cell RNA-Seq Analysis PipelineMarmar MoussaIon MandoiuComputer Science & Engineering DepartmentUniversity of Connecticut

2. OutlineMotivation and challengesLocality sensitive imputationTF-IDF based feature selection& clusteringCell cycle analysisSC1 pipelineConclusions and ongoing work

3. Tissue samples are heterogeneous mixtures of cellsSingle Cell vs. Bulk RNA-SeqBulk RNA-Seq averages expression over thousands to millions of cellsConflates molecular and compositional changesSingle cell data captures sample heterogeneity…But is very noisy!

4. Single Cell RNA-Seq Growth

5. 3’-end Sequencing w/ UMIs (10X Genomics)Encapsulates up to 48,000 cells in 10 minutes

6. Primary AnalysisProduces UMI (cDNA molecule) counts

7. ChallengesLow RT efficiency & sequencing depthResults in zero-inflated data due to drop-out effects Hicks et al. 2015, http://biorxiv.org/content/early/2017/05/08/025528

8. ChallengesLow RT efficiency & sequencing depthResults in zero-inflated data due to drop-out effects Log Ptprc expression

9. ChallengesLow RT efficiency & sequencing depthPCR amplification biasUMIs helpZiegenhain et al. 2017, Mol. Cel. 65(4), pp. 631–643.e4

10. ChallengesLow RT efficiency & sequencing depthPCR amplification biasCell quality Live/deadStress responseContaminant mRNAEmpty dropletsMultiple cells

11. ChallengesLow RT efficiency & sequencing depthPCR amplification biasCell quality Stochastic effectsCells captured in different cell cycle phasesTranscriptional bursting hard to distinguish from technical artifacts

12. ChallengesLow RT efficiency & sequencing depthPCR amplification biasCell quality Stochastic effectsCell capture biasCapture rates may not be representative of population frequencies

13. ChallengesLow RT efficiency & sequencing depthPCR amplification biasCell quality Stochastic effectsCell capture biasAnalysis tools lagging behindProtocols still evolving rapidlyMany methods do not scale well with #cells

14. OutlineMotivation and challengesLocality sensitive imputationTF-IDF based feature selection & clusteringCell cycle analysisSC1 pipelineOngoing work

15. Imputation for scRNA-Seq DataCan drop-outs be recovered by imputation? Ptprc=0

16. Existing Imputation MethodsBISCUIT (Azizi et al., GCB 2017)CIDR (Lin, Troup, & Ho, Genome Biol. 2017)DRImpute (Kwak et al., bioRxiv 2017)LSImpute (Moussa & Mandoiu, ISBRA 2018)MAGIC (van Dijk et al. bioRxiv. 2017)netSmooth (Ronen & Akalin, F1000Res. 2018)scImpute (Li & Li, Nat. Comm. 2018)…

17. LSImputeStep 1: Select pairs of cells with highest similarity (O(n) using Locality Sensitive Hashing)Step 2. Group selected cells into clusters using spherical k-meansStep 3. For each cluster, replace zeros with median/mean expression of the gene within the clusterStep 4. Collapse selected cells into cluster centroids and repeat until highest pair similarity drops below a given threshold

18. Evaluation on Subsampled Data209 somatosensory neurons isolated from the mouse dorsal root ganglion (Li et al., Cell research 2016) 31.5M reads/cell 10,950 +/-1,218 genes/cell Read subsampling 50k-20M readsGround truth: TPM values determined by running IsoEM2 (Mandric et al., Bioinformatics 2017) on full set of reads 

19. Raw DataDrImputescImputeKNNImpute LSImputeMedLSImputeMeanGene Detection Fraction 100k 1M 10M

20. Accuracy on 10x Data True vs. down-sampled DrImputeLSImpute638 MethA cells: 500k reads/cell, down-sampled to 50k reads/cell Detection Accuracy : Raw 0.97; DrImpute 0.95; LSImpute 0.974

21. OutlineMotivation and challengesLocality sensitive imputationTF-IDF based feature selection & clusteringCell cycle analysisSC1 pipelineConclusions and ongoing work

22. TF-IDF TransformationBorrowed from information retrievalProduct of two factors:Term frequency: How frequently a term occurs in a document?Inverse document frequency: How uncommon the term is in the document collection?For scRNA-Seq data:For gene i in cell j with count fij :If gene i is detected in ni out of N cells:TF-IDF score: 

23. TF-IDF Based Feature Selection5: Monocytes,6: Natural Killer4: B cells7,8: naïve cytotoxic, cytotoxic, activated cytotoxic1: helper2: regulatoryPBMC, 10XGenes with highest avg. TF-IDF (highest mean GMM in red)

24. TF-IDF Based ClusteringCells QC, Genes QC, Gap-Statistics AnalysisData Transformation: Log2(x+1) or noneFeature Selection: PCA, tSNE, highly variable genes* or noneSeurat (K-means)*Seurat (SNN)*GMMK-meansSph. K-meansHC (E/P)Louvain (E)Data Transformation: TF-IDFFeature Selection: High avg. TFIDF score (Top) or Highly variable TF-IDF (Var)GMMK-meansSph. K-meansHC (E/P/C)Data Binarization:Cutoff threshold per cell based on cell avg. TF-IDF(Bin)HC (E/P/C/J)Greedy (E/P/C/J)Louvain (E/P/C/J)

25. Evaluation on Synthetic PBMC MixturesFACS sorted blood cells of 7 types [Zheng et al., Nat. Comm. 2017]7-way mixtures, equal proportions (7000 cells/mix)

26. Clustering Accuracy on PBMC Mixture

27. Pancreatic Cells Dataset2045 Pancreatic cells of 7 types [Segerstolpe et al. 2016]Annotated based on known markers (removed for clustering)Capture proportions: 185 acinar cells, 886 alpha cells, 270 beta cells, 197 gamma cells, 114 delta cells, 386 ductal cells, and 7 epsilon cells

28. Clustering Accuracy on Pancreatic Cells

29. OutlineMotivation and challengesLocality sensitive imputationTF-IDF based feature selection & clusteringCell cycle analysisSC1 pipelineConclusions and ongoing work

30. Cell Cycle AnalysisGoal: disentangle cell type effects from cell cycle effectsExisting Methods:Cyclone (classifies cells into G1, S, and G2 cycle phases)Oscope/reCAT: identifies oscillatory genes/orders cells based on cell cycle pseudo-timeccRemover: attempts to subtract cell cycle signal; fails on mix of Jurkat & 293 cells

31. SC1CC MethodPCA of normalized RNA-Seq counts sub-matrix for cell cycle marker genes onlyCaptures global variation due to cell cycle 3 component t-SNE transformation using first few PCsCaptures local cell similarityHierarchical clusteringUsing average linkage, cosine similarityReorder leaves of HC dendrogram using the Optimal Leaf Ordering algorithm (OLO)

32. Cluster ScoringDividing/non-dividing clusters identified based on gene smoothness scores:N is the number of genes. SCord(gi) is the serial-correlation of gene i under OLO orderSCrand(gi) is the serial-correlation of gene i under a random orderCell phase assigned based on max mean (normalized) expression of cell cycle marker genes

33. Evaluation on Labeled DataUndifferentiated human embryonic stem cells sorted by FACS into G1, S and G2/M phases

34. OutlineMotivation and challengesLocality sensitive imputationTF-IDF based feature selection & clusteringCell cycle analysisSC1 pipelineConclusions and ongoing work

35. SC1 Analysis Workflowhttps://sc1.engr.uconn.edu

36. OutlineMotivation and challengesLocality sensitive imputationTF-IDF based feature selection & clusteringCell cycle analysisSC1 pipelineConclusions and ongoing work

37. ConclusionsThe range of single-cell applications continues to expand rapidly, fueled by advances in technologyAnalysis tools lagging behindOur contributionsScalable imputation method based on LSHTF-IDF based gene selection and clusteringRobust methods for cell cycle analysisWeb-based workflow for scRNA-Seq analysis and interactive visualization

38. Ongoing WorkAdditional pipeline components in progressCorrection for mRNA contaminationBatch effect removalLineage inferenceRNA velocityCell type matching…

39. Marmar MoussaAcknowledgments

40. Questions?