/
1 An Introduction to Bioinformatics 1 An Introduction to Bioinformatics

1 An Introduction to Bioinformatics - PowerPoint Presentation

bery
bery . @bery
Follow
27 views
Uploaded On 2024-02-09

1 An Introduction to Bioinformatics - PPT Presentation

and its application in ProteinDNAProtein  Interactions Research and Drug Discovery CMSC5719 Dr Leung Kwong Sak Professor of Computer Science and Engineering Mar 26 2012 2 Outline ID: 1045593

tfbs protein binding dna protein tfbs dna binding data discovery interactions sequence bioinformatics approximate drug patterns ligand side sequences

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "1 An Introduction to Bioinformatics" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. 1An Introduction to Bioinformatics and its application in Protein-DNA/Protein  Interactions Research and Drug Discovery CMSC5719Dr. Leung, Kwong SakProfessor of Computer Science and EngineeringMar 26, 2012

2. 2OutlineI. Introduction to BioinformaticsII. Protein-DNA InteractionsIII. Drug DiscoveryIV. Discussion and Conclusion

3. 3I. Introduction to Bioinformatics Bioinformatics Research Areas Biological Basics

4. 4Introduction Bioinformatics More and more crucial in life sciences and biomedical applications for analysis and new discoveriesHuge noisy dataCostly annotationsIndividual & specificBiologyInformatics (e.g. Computer Science)Curated and well-organizedEffective and efficient analysisGeneralized knowledgeBioinformaticsBridging

5. 5Bioinformatics Research AreasMany (crossing) areas: (Genome-scale) Sequence AnalysisSequence alignments, motif discovery, genome-wide association (to study diseases such as cancers)Computational Evolutionary BiologyPhylogenetics, evolution modelingAnalysis of Gene RegulationGene expression analysis, alternative splicing, protein-DNA interactions, gene regulatory networksStructural BiologyDrug discovery, protein folding, protein-protein interactionsSynthetic BiologyHigh throughput Imaging Analysis…

6. 6Our Research Roadmap

7. Genome-wide Association…Human DNA sequencesSNPs (single nucleotide polymorphism; >5% variations)NormalDisease!Targets: SNPs that are associated with genetic diseases; Diagnosis and healthcare for high-risk patent Methods: Feature selection; mutual information; non-linear integrals; Support Vector Machine (SVM);!KS Leung, KH Lee, (JF Wang), (Eddie YT Ng), Henry LY Chan, Stephen KW Tsui, Tony SK Mok, Chi-Hang Tse, Joseph JY Sung, “Data Mining on DNA Sequences of Hepatitis B Virus”. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2011

8. HBV Project (Example)…HBV sequencesHepatitis B (Hep B)NormalHep B  Cancer!???Feature SelectionNon-linear Integral(Problem Modeling)Optimization and ClassificationExplicit Diagnosis Rules(if sites XX & YY are A & T, then …)SNPs are not known and to be discovered by alignments

9. 9Biological BasicsCellChromosomeDNA SequenceGenome5’...AGACTGCGGA...3’http://www.jeffdonofrio.net/DNA/DNA%20graphics/chromosome.gifhttp://upload.wikimedia.org/wikipedia/commons/7/7a/Protein_conformation.jpg3’...TCTGACGCCT...5’Base PairsA-TC-GGene...AGACTGCGGA...A string with alphabetRNAA string of amino acidsTranscriptionProteinTranslationRegulatory functionsOther functions:Protein-proteinProtein-ligand

10. 10Protein-ligand Interactions Drug DiscoveryProteinOther functions:Protein-proteinProtein-ligand InteractionsDetailed in III. drug discovery

11. 11Transcriptional Regulation Binding for Transcriptional Regulation Transcription Factor (TF): TF Binding Site (TFBS): Transcription rate (gene expression):TranscriptionTranslationGeneRNADNA SequenceTranscription Factor(TF)ProteinTATAAATFBSATGCTGCAACTG…The binding domain (core) of TF the protein as the keythe DNA segment as the key switchthe production rateDetailed in II. protein-DNA interactions

12. 1212II. Protein-DNA Interactions Introduction Approximate TF-TFBS rule discovery Results and Analysis DiscussionTak-Ming Chan, Ka-Chun Wong, Kin-Hong Lee, Man-Hon Wong, Chi-Kong Lau, Stephen Kwok-Wing Tsui, Kwong-Sak Leung, Discovering Approximate Associated Sequence Patterns for Protein-DNA Interactions. Bioinformatics, 2011, 27(4), pp. 471-478

13. 13IntroductionWe focus on TF-TFBS bindings which are primary protein-DNA interactions Discover TF-TFBS binding relationship to understand gene regulationExperimental data: 3D structures of TF-TFBS bindings are limited and expensive (Protein Data Bank PDB); TF-TFBS binding sequences are widely available (Transfac DB)Further bioengineering or biomedical applications to manipulate or predict TFBS and/or TF (esp. cancer targets) given either side Existing MethodsMotif discovery: either on protein (TF) or DNA (TFBS) side. No linkage for direct TF-TFBS relationshipOne-one binding codes: R-A, E-C, K-G, Y-T? No universal codes!Machine learning: training limitation (limited 3D data) and not trivial to interpret or applySequences: widely available3D: limited, expensive

14. 14Conservation TFBSs, Genes  merely A,C,G,Ts; The binding domains of TFs  merely amino acids (AAs) What distinguish them from the others? Functional sequences are less likely to change through evolutionAssociation rule miningExploit the overrepresented and conserved sequence patterns (motifs) from large-scale protein-DNA interactions (TF-TFBS bindings) sequence dataBiological mutations and experimental noises exist!—Approximate rulesConservation similar Patterns across genes/species Bioinformatics!

15. 15Motivations Problem IntroductionInput: given a set of TF-TFBS binding sequences (TF: hundreds of AAs; TFBS: tens of bps depending on experiment resolution), discover the associated patterns of width w (potential interaction cores within binding distance) Associated TF-TFBS binding sequence patterns (TF-TFBS rules) —given binding sequence data (Transfac) ONLY, predict short TF-TFBS pairs verifiable in real 3D structures of protein-DNA interactions (PDB)!Previous method: exact Association Rule Mining based on exact counts (supports) Prohibited for sequence variations common in realitySimple counts can happen by chance (no elaborate modeling)Motivations: Approximation is critical!Small errors should be allowed!Model “overrepresented” biologically (probabilistic model VS counts/supports)!Kwong-Sak Leung, Ka-ChunWong, Tak-Ming Chan, Man-Hon Wong, Kin-Hong Lee, Chi-Kong Lau, Stephen Kwok-Wing Tsui, Discovering Protein-DNA Binding Sequence Patterns Using Association Rule Mining. Nucleic Acids Research. 2010, 38(19), pp. 6324-6337

16. 16Overall MethodologyUse the available TFBS motifs C from Transfac DB—already approximate with ambiguity code representation—TFBS side done!Group TF sequences with different motif C similarity thresholds TY=0.0, 0.1, 0.3Approximate TF Core Motif Discovery for T (instance set {ti,j}) give W and E—TF side doneA progressive approach:Associating T ({ti,j}) with C Customized Algorithm

17. 1717TF Side: Core TF Motif Discovery The customized algorithmInput: width W and (substitution) error E, TF Sequences SFind W-patterns (at least 1 hydrophilic amino acid) and their E approximate matchesIteratively find the optimal match set {ti,j} based on the Bayesian scoring function f for motif discovery:Top K=10 motifs are output, each with its instance set {ti,j}

18. 1818 Verification on Protein Data Bank (PDB) Check the approximate TF-TFBS rules T({ti,j})-CApproximate appearance in binding pairs from PDB 3D structure data : width W bounded by ETF side (RTF): instance oriented—{ti,j} evaluatedTFBS side (RTF-TFBS): pattern oriented—C evaluated [0,1] higher the betterResults and AnalysisMost representative database of experimentally determined protein-DNA 3D structure data* expensive and time consuming* most accurate evidence for verification

19. 19Biological verification Recall the challengeGiven sequence datasets of tens of TF sequences, each hundreds of AA in length, grouped by TFBS consensus C (5~20bp), Predict W(=5,6) substrings ({ti,j}) associated with CPDB Verified examples in Rule NRIAA(NKIAA; NRAAA; NREAA; NRIAA)-TGACGTYAWhich can be verified in actual 3D TF-TFBS binding structures as well as homology modeling (by bio experts)!NRIAANKIAA

20. 2020Results and AnalysisM00217: ERKRR(ERKRR; ERQRR; ERRRR)-CACGTG1NKP:One more verified example

21. 2121Results and Analysis Quantitative Comparisons with Exact RulesMore informative (verified) rules (110 VS 76 W=5; 88 VS 6 W=6)Improvement on exact ones (AVG R* 29%, 46% W=5)

22. 2222Results and Analysis Comparisons with MEME as TF side discovery tool73%-262% improvement on AVG R*33%-84% improvement on R*>0 RatioCustomized TF core motif discovery is necessary

23. 2323Discussion For the first time we generalize the exact TF-TFBS associated sequence patterns to approximate ones The discovered approximate TF-TFBS rulesCompetitive performance with respect to verification ratios (R∗) on both TF and TF-TFBS aspectsStrong edge over exact rules and MEME resultsDemonstration of the flexibility of specific positions TF-TFBS binding (further biological verification with NCBI independent protein records!)

24. 2424DiscussionGreat and promising direction for further discovering protein-DNA interactions Future WorkFormal models for whole associated TF-TFBS rulesAdvanced Search algorithms for motifsAssociating multiple short TF-TFBS rulesHandling uncertainty such as widths

25. 2525III. Drug DiscoveryBackground: Docking VS SynthesisSmartGrowExperimental ResultsDiscussion

26. 26Drug discovery by computational dockingBackground

27. 27DockingTranslate and rotate the ligandPredict binding affinityAutoDock VinaRankConformationFree energy(kcal/mol)1-7.02-6.13-6.04-5.95-5.96-5.87-5.88-5.79-5.6DockingComputational Docking

28. 28Search spaceBlind dockingCatalytic site orallosteric siteComputational Docking

29. 29Virtual screening Synthesis strategy1060 – 10100 drug-like molecules.Grows an initial scaffold by adding fragments.Selection of linker hydrogen atomsPlacement of fragment in 3D spaceSelection of fragments out of dozensCombinatorial optimization problemSingle bond lengthC-C: 1.530 ÅN-N: 1.425 ÅC-N: 1.469 ÅO-O: 1.469 ÅSynthesize ligands that have higher binding affinities.Computational SynthesisGenetic Algorithm(GA)

30. 30AutoGrow (GA based)MutationCrossoverComputational Synthesis

31. 31Disadvantages of AutoGrowFunctionallyCan only form single bondNo way for double bond, ring joiningNo support for P and 2-letter elements, e.g. Cl, BrATP, EtravirineDrug-like properties ignoredExcessively largeNot absorbableComputationallyExtremely slow, > 3h for one run on an 8-core PCFail to run under Windows Motivations

32. 32ObjectivesDevelopment of SmartGrowFunctionallyLigand diversitySplit, merging, ring joiningSupport for P and Cl, BrDruggability testingLipinski’s rule of fiveComputationally> 20% fasterC++ over JavaCross platformLinux and Windows

33. 33Computational dockingor scoring onlyNumber of generationsWeighted sum of molecular weightsExcessively largeRankEnergy1-7.02-6.13-6.04-5.95-5.96-5.87-5.88-5.79-5.6FlowchartYes

34. 34MutationCrossoverIACBIDBIACGenetic Operators

35. 35SplitMergingIACICBICIACIBGenetic Operators

36. 36Experimental Results Data Preparation Experiment Settings Results and Comparisons

37. 373 proteins from PDB8 unique initial ligands3 from PDB complexes5 from ZINCFragment librarySmall-fragment libraryProvided by AutoGrow ADAIDSAIDSData Preparation

38. 38Initial LigandFree Energy(kcal/mol)Molecular Weight(Da)TRS-3.8122ZINC01019824-6.9194ZINC08442219-6.5224ZINC09365179-8.3278ZINC18153302-4.2142ZINC20030231-5.6209T27-13.9373ZINC01019824-9.6194ZINC08442219-9.1224ZINC09365179-11.0278ZINC18153302-5.8142ZINC20030231-7.12094DX-4.0114ZINC01019824-5.9194ZINC08442219-5.6224ZINC09365179-6.4278ZINC18153302-4.1142ZINC20030231-4.920918 Test Cases

39. 39Small-fragment libraryProvided by AutoGrow46 fragments3 to 15 atomsAverage 9.6 atomsStandard deviation 2.8Fragment Library

40. 40Dual Xeon Quad Core 2.4GHz, 32GB RAM, UbuntuAutoGrow: 18 testcases × 9 runs × 3.0 h = 486 hSmartGrow: 18 testcases × 9 runs × 2.4 h = 388 hParametersAutoGrowSmartGrowNumber of elitists1010Number of children2020Number of mutants2020Number of generations824Docking frequency13Max number of atoms80803.0 h2.4 hRun time for one testcaseParameter Settings

41. 41Initial ligand-6.9, 194AutoGrow-11.9, 572SmartGrow-11.2, 505Results: GSK3β-ZINC01019824

42. 42Initial ligand-9.1, 224AutoGrow-11.3, 433SmartGrow-11.8, 392Results: HIV RT-ZINC08442219

43. 43Initial ligand-4.9, 209AutoGrow-7.3, 683SmartGrow-7.5, 489Results: HIV PR-ZINC20030231

44. 44Results: GSK3β

45. 45Results: HIV RT

46. 46Results: HIV PR

47. 4730%Results: Execution Time

48. 48Synthesized ligandby SmartGrowInitial ligandResults: Handling Phosphorus(P)

49. 49DiscussionSmartGrowFunctionallyAn efficient tool for computational synthesis of potent ligands for drug discoveryEnriched ligand diversitySplit, merging, ring joiningSupport for P and Cl, BrDruggability testingLow molecular weight, < 500 DaComputationally20% ~ 30% faster, avg. 2.4 h for one runCross platformLinux and Windows

50. 50Future ImprovementsIntegrate SmartGrow into AutoDock VinaReceptor structureUniform interfaceFile I/O reducedADMETAdsorption, Distribution, Metabolism, Excretion, and ToxicityParallelization by GPU hardwareWeb interfaceReal life applicationsAlzheimer’s Disease, HIV/AIDS, HBV (liver cancer)

51. 51IV. Discussion and Conclusion Summary Discussion

52. 52SummaryIn this lectureA brief introduction to Bioinformatics research problems Discovering approximate protein-DNA interaction sequence patterns for better understanding gene regulation (the essential control mechanisms of life)Drug synthesis based on synthesizing drug candidates and optimizing the conformations of 3D protein-ligand interactions effectively and efficiently with computersEncouraging results have been achieved and promising direction has been pointed out

53. 53Discussion Bioinformatics becomes more and more important in life sciences and biomedical applications Most computational fields (ranging from string algorithms to graphics) have applications in Bioinformatics Still long way to go (strong potentials to explore) Massive data are available but annotations are still limited Lack of full knowledge in many biological mechanisms Biological systems are very complicated and stochastic

54. 54The End Thank you! Q&A

55. 55II: Results and Analysis: Statistical Significance III: The details for the 3 proteins and the 8 ligands used in the experimentsAppendix

56. 5656II: Results and Analysis Statistical Significance (W=5)Simulated on over 100,000 rules for each settingThe majority (64%-79%) for RTF-TFBS are statistically significantFor E=0, although the 0.05<p(RTF≥1)<0.07, the majority (74%-82%) achieve the best possible p-values

57. 57III: 3 proteinsGlycogen synthase kinase 3 beta (GSK3β)Alzheimer's disease (AD), Type-2 diabetesSource: Protein Data Bank

58. 583 proteinsHIV reverse transcriptase (HIV RT)AIDSSource: Protein Data Bank

59. 593 proteinsHIV protease (HIV PR)AIDSSource: Protein Data Bank

60. 608 unique initial ligandsTRS122 DaT27373 DaZINC01019824194 Da4DX114 DaZINC08442219224 DaZINC09365179278 DaZINC20030231209 DaZINC18153302142 Da