/
Cankun Wang  Department of Agronomy, Horticulture & Plant Science Cankun Wang  Department of Agronomy, Horticulture & Plant Science

Cankun Wang Department of Agronomy, Horticulture & Plant Science - PowerPoint Presentation

jacey
jacey . @jacey
Follow
27 views
Uploaded On 2024-02-09

Cankun Wang Department of Agronomy, Horticulture & Plant Science - PPT Presentation

7172019 Development of Computational Techniques for Identification of Regulatory DNA Motif Outline Introduction WTSA a novel DNA motif identification program for ChIPexo data Applications of DNA motif identification ID: 1044998

chip motif exo dna motif chip dna exo data identification wang wtsa cankun method sequence seq cell qin prediction

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Cankun Wang Department of Agronomy, Hor..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Cankun Wang Department of Agronomy, Horticulture & Plant Science7/17/2019Development of Computational Techniques for Identification of Regulatory DNA Motif

2. OutlineIntroductionWTSA, a novel DNA motif identification program for ChIP-exo dataApplications of DNA motif identification on big biological data

3. OutlineIntroductionWTSA, a novel DNA motif identification program for ChIP-exo dataApplications of DNA motif identification on big biological data

4. Transcription regulationTranscription factor binding sites (TFBSs) is key to the mediation of transcriptional regulationInformation on experimentally validated functional TFBSs is limited

5. DNA Binding Site Motif (DNA Motif)DNA motifs are short, recurring patterns that are presumed to have a regulatory functionUsually 8 – 20 base pairs(bp) longmotifmotifmotif

6. Representation of DNA Motif1. Counts of nucleotides at each positionPWM(position weight matrix) And Motif LogoVisual representation2. Information content, adjusted to backgroundRegulation by OXygen1(ROX1) binding sites and sequence motif:Binding sites of a given transcription factor does not have to have the exact same sequence in every caseRef:D'haeseleer, What are DNA sequence motifs? Nature Biotechnology, 2016Nucleotide codes tableConsensus sequences

7. Identification of DNA Motifs is difficultReal DNA promoter sequence:TGTGAAAGACTGTTTTTTTGATCGTTTTGACAAAAATGGAAGTCCACAAAGTCCACATTGATTATTTGCACGGCGTCACACTTTGCTATCCCATAGTGATGTACTGCATGTATGCAAAGGACGTCAGATTACCGTGCAGTACAGTAAACGATTCCACTAATTTATTCCATGTCACTCTTTTCGCATCTTTGTACATTACCGCCAATTCTGTAACAGAGATCACACAAAGCGACGGTGGGGACTTTTTTTTCATATGCCTGACGGAGTTGACACTTGTAAGTTTTCAACToo many combinations to consider!If sequence’s length=200 , number of sequences 30 ~ 1065 combinationsIf 10000 combinations/sec > 1044 centuries

8. How have we identified DNA motifs in the past?2001200420072009201020112017BioProspector ChIP-seq protocolHOMERBoBro 2.0DMINDAMEMEBoBro 1.02014DMINDA 2WTSA2019rGADEM, ChIPMonkChIP-exo protocolChIP-ChIP protocolDESSORecent years, chromatin immunoprecipitation (ChIP) technologies provide an unprecedented opportunity to discover DNA motif

9. These peak regions will be served aspotential binding sites for motif identificationIn-vitro motif locating from ChIP-Seq ChIP-sequencing(ChIP-seq), one of the most popular method used to analyze protein interactions with DNADNA-bound protein mapped to genome to generate the ‘peaks’

10. The ChIP-exo method adds a unique exonuclease (exo) digestion stepChIP-exo - A modification from ChIP-Seq

11. ChIP-exo method has the best resolutionThe ChIP-exo method obtains a near base-pair resolution

12. Objective: Using ChIP-exo data to improve DNA Motif prediction accuracyCTGTTTTTTTGATTCGTCGACAAAAATGGAAATTGATTATTTGATCGTTCGTCATACTTTGCTChIP-exo is more closely related to actual DNA motif lengthChIP-exo peak score can be used to enhance DNA nucleotide signalThe method called Weighted Two-Stage Alignment tool (WTSA)ScoreScoreFigure: Distribution of weight scores for each nucleotide

13. OutlineIntroductionWTSA, a novel DNA motif identification program for ChIP-exo dataApplications of DNA motif identification on big biological data

14. WTSA workflowData pre-processingWeighted two stage alignmentExpansionOptimization evaluationMatrix approximation & graph construction

15.

16. WTSA format Similar to the FASTA format, the wtsa format begins with a single-line description, followed by lines of sequence datathird line represents the weighted scores extracted from bedtools.>NC_000913.0:3073606-3074021CCGCCCGGCGTCCGGATTCATACAAAGCACGAACCACATTAC0,0,0,0,0,0,20.0,16.0,24.0,8.0,20.0,0,0,0,8.0,41.0,0,20.0,0,0,0>NC_000913.0:2565089-2565533GGCAACAACGCAGGGTTACAGCAGAAGATCACTGTGTTGGATA4.0,8.0,4.0,0,0,0,0,0,0,0,0,0,0,0,8.0,0,49.0,0,0,0,16.0,8.0,0,0……

17. WTSA format Calculate match score based on a binomial distribution modelFor two length L segments si and sj with k position identity:, B(.) is binomial distribution, and p=0.25Integrate weight score from ChIP-exo dataWhere and are weight score of each nucleotide  

18. Construct a weighted graph G and find clique Vertices represent nucleotide start positionEdges connect every pair of motif start position with the largest scoresEdge weight is the score based on two sequence pairs similarity and ChIP-exo weight scoresBuild approximation matrixMark both elements among the top t scores across all the L-segment sequence alignments between two promoters

19. Motif expansionP-value is very close to a Poisson distributionApproximate p(x) asP-value by simply summing up p(x) in its motif closures

20. Motif length adjustmentMotif 1Instance 1Instance 2Motif 2extendInstance 1Instance 2From previous Motif expansion, we have a temporary result with default motif length=10Instance 7Instance 3...Instance 4Instance 5Instance 6motifLength101314Frequency142Iterate all motif instance and calculate the overlap frequencyWe set 13 as the optimized motif length and use it to perform motif expansion again

21. Motif optimizationA optimized motif is obtained, overlapped motif instances are removed

22. Experiment datasets3 datasets from publications in Escherichia coli using ChIP-exo method Transcription factorBioProject IDPublish Date#of Bases#of Identified SitesFurPRJNA2380032014.9193.4M556CraPRJNA2745712018.488.6M387ArgRPRJNA2585212015.3768.7M462We downloaded Fur, Cra, ArgR, GadE, GadW, GadY, OxyR, UvrY, SoxR and SoxS TFs. (10 datasets)3 used for motif evaluation, the others had limited annotation thus could not be used for evaluation

23. WTSA performance evaluation against published tools List of tools:BioProspector, 2001, ‎Cited by 959 BoBro, 2011, Cited by 24MEME-ChIP, 2009, Cited by 3363Homer, 2010, cited by 3150ChIPMunk, 2014, Cited by 23rGADEM, 2011, Cited by 42“We evaluated using ENCODE-ChIP-Seq data and showed that rGADEM was the best-performing tool.” (BMC Bioinformatics. 2016. Evaluating tools for transcription factor binding site prediction)WTSA (with ChIP-exo weight data)Comparison of tools using evaluation method xx, xy, yy

24. Evaluation method 1: profile levelRef:Discovering Gap Patterns from Protein Sequencesthrough Pattern-Directed Aligned Pattern Clustering, IEEE, 2018Submit the motif result to TOMTOM suite, compares DNA motif results against a database of known motifsSave the E-value, Q-value for comparison TOMTOM query result example

25. Performance comparison of results on TOMTOM profile levelWTSA provides stable prediction results on the -log2(E-value) and -log2(Q-value) metrics, WTSA outperforms on Fur and ArgR TF than all other methods, MEME-ChIP slightly performed better than WTSA on Cra TF

26. Evaluation method 2: TFBS levelRef:M Tompa, et al. Assessing computational tools for the discovery of transcription factor binding sites, Nature Biotech, 2005 TruePredictedTruePredictedTrue positive(TP)False positive(FP)True negative(TN)overlapped >25%

27. Performance comparison of results on TFBS levelWTSA achieved a stable high motif prediction performance on the TFBS level F-score comparisonsThe rGADEM program outperforms on the Cra TF data at the TFBS level F-score, while WTSA has the best positive prediction value on the Cra TF data

28. Evaluation method 3: Nucleotide levelRef:M Tompa, et al. Assessing computational tools for the discovery of transcription factor binding sites, Nature Biotech, 2005 TruePredictedPredictedFalse positive(FP)True positive(TP)False negative(FN)False positive(FP)

29. Performance comparison of results on nucleotide levelWTSA achieved the highest F-scores, sensitivity and positive prediction value on all three datasets

30. Unique features of the WTSAA workflow to perform DNA motif analysis directly from ChIP-exo dataBinomial distribution and nucleotide level weight score from ChIP-exo for optimizing motif instancesUnique DNA motif result length auto-optimizationImproved performance advantages in the comparison with popular existing tools

31. OutlineIntroductionWTSA, a novel DNA motif identification program for ChIP-exo dataApplications of DNA motif identification on big biological data

32. IRIS3 - Integrated Cell-type-specific Regulon Inference Server from Single-cell RNA-Seq How gene expression programs are controlled requires identifying regulatory relationships between transcription factors (TF) and target genes? To identify the cell-type-specific regulons (CTS-Rs), a group of genes co-regulated by the same transcription regulator in a specific cell type. We developed Integrated Cell-type-specific Regulon Inference Server from Single-cell RNA-Seq (IRIS3)

33. IRIS3 overall pipelineIntegrate WTSA

34. IRIS3 web serverScreenshot for an example cell type specific regulonGene list including marker genesDNA motif detailsA list of button of toolkits for further analysis

35. IRIS3 web server functionsScreenshot for heatmap and t-SNE plot from the previous regulon example

36. DESSO Prediction of Regulatory Motifs from Human ChIP-Sequencing Data using a Deep Learning FrameworkEstablished web server interface and test data

37. Tools developed from BMBLBoBro command-line tookit DMINDA web servermotif identificationmotif scanningmotif comparison analyzing co-occurring motifsRegulon predictionProkaryotic & eukaryotic data baseWTSADNA motif identification on ChIP-exo data. DESSOMotif prediction using Deep Learning

38. PublicationXia, Ye, Seth DeBolt, Qin Ma, Adam McDermaid, Cankun Wang, Nicole Shapiro, Tanja Woyke, and Nikos C. Kyrpides. “Improved Draft Genome Sequence of Bacillus Sp. Strain YF23, Which Has Plant Growth-Promoting Activity.” Edited by David Rasko. Microbiology Resource Announcements 8, no. 15 (April 11, 2019). https://doi.org/10.1128/MRA.00099-19.Xia, Ye, Seth DeBolt, Qin Ma, Adam McDermaid, Cankun Wang, Nicole Shapiro, Tanja Woyke, and Nikos C. Kyrpides. “Improved Draft Genome Sequence of Pseudomonas Poae A2-S9, a Strain with Plant Growth-Promoting Activity.” Edited by Irene L. G. Newton. Microbiology Resource Announcements 8, no. 15 (April 11, 2019). https://doi.org/10.1128/MRA.00275-19.Monier, Brandon, Adam McDermaid, Cankun Wang, Jing Zhao, Allison Miller, Anne Fennell, and Qin Ma. “IRIS-EDA: An Integrated RNA-Seq Interpretation System for Gene Expression Data Analysis.” PLOS Computational Biology 15, no. 2 (February 14, 2019): e1006792. https://doi.org/10.1371/journal.pcbi.1006792.Wang, Yan, Sen Yang, Jing Zhao, Wei Du, Yanchun Liang, Cankun Wang, Fengfeng Zhou, Yuan Tian, and Qin Ma. “Using Machine Learning to Measure Relatedness Between Genes: A Multi-Features Model.” Scientific Reports 9, no. 1 (December 2018). https://doi.org/10.1038/s41598-019-40780-7.Han, Siyu, Yanchun Liang, Qin Ma, Yangyi Xu, Yu Zhang, Wei Du, Cankun Wang, and Ying Li. “LncFinder: An Integrated Platform for Long Non-Coding RNA Identification Utilizing Sequence Intrinsic Composition, Structural Information and Physicochemical Property.” Briefings in Bioinformatics. Accessed November 24, 2018. https://doi.org/10.1093/bib/bby065.McDermaid, Adam, Xin Chen, Yiran Zhang, Cankun Wang, Shaopeng Gu, Juan Xie, and Qin Ma. “A New Machine Learning-Based Framework for Mapping Uncertainty Analysis in RNA-Seq Read Alignment and Gene Expression Estimation.” Frontiers in Genetics 9 (2018). https://doi.org/10.3389/fgene.2018.00313.

39. Submission planWTSA : a weighted two stage alignment tool for DNA motif identification on ChIP-exo data. (Bioinformatics or Nucleic Acids Research)Cankun Wang (First author)IRIS3: Integrated cell-type-specific regulon inference server from single-cell RNA-Seq. (Genome Biology)Anjun Ma*, Cankun Wang* (Co-first author), Adam McDermaid1,2, Bingqiang Liu3, and Qin Ma1,$Prediction of Regulatory Motifs from Human ChIP-Sequencing Data using a Deep Learning Framework. (Nucleic Acids Research)Jinyu Yang, Anjun Ma, Adam D. Hoppe, Cankun Wang, Yang Li, Yan Wang, Bingqiang Liu, and Qin Ma

40. AcknowledgmentCommitteesDr. Anne FennellDr. Qin MaDr. Trevor RoigerBMBL members Anjun Ma, Juan Xie, Yuzhou Chang, Adam McDermaid, Jinyu Yang, Shaopeng Gu, Zhaoqian Liu, Jing Jiang, Junyi Chen, Weiliang Liu, Zichun Zhang, Minxuan Sun, Jennifer Xu

41. ThanksCankun Wang7/17/2019