/
Selecting genomics assays Selecting genomics assays

Selecting genomics assays - PowerPoint Presentation

jalin
jalin . @jalin
Follow
66 views
Uploaded On 2023-07-08

Selecting genomics assays - PPT Presentation

William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington ENCODE didnt study my favorite cell type We cant do all possible genomics assays ID: 1006739

function representative cell set representative function set cell protein objective quality genomics submodular data sets assays panel redundancy types

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Selecting genomics assays" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. Selecting genomics assaysWilliam Stafford NobleDepartment of Genome SciencesDepartment of Computer Science and EngineeringUniversity of Washington

2. ENCODE didn’t study my favorite cell type!

3. We can’t do all possible genomics assaysCell typesAssay typesHeartLiverLeukemia…DNase1H3K4me3CTCFAs of March, 2016, The ENCODE and Roadmap Epigenomics Consortia have generated 3373 human data sets.

4. Which assays should we perform on a new cell type?A high-quality panel shouldmeasure all types of activityat minimal total cost.Cell typesAssay typesNew cell type

5. We are interested in choosing representative subsetsChoosing which genomics assays to perform on a new cell type.Selecting a representative set of existing genomics data sets.Selecting a representative set of protein sequences.

6. An optimization framework for representative set selectionChoosing diverse panels of genomics assaysRemoving redundancy in protein sequence data sets

7. The teamJeff BilmesMax LibbrechtKai Wei

8. Submodular optimization provides a formal framework for selecting representative subsetsContinuous optimization : Convexity :: Discrete optimization : ?SemidefinitenessSubmodularityDifferentiabilityPerspicacity

9. A set function assigns a real number to every subset of a given setGround setSet functionQuality of the subset

10. How long will it take to find the highest quality subset?O(2n)

11. Submodular set functions satisfy the property of diminishing returnsGain from adding v to AGain from adding v to A ⋃ B f( ) - f( ) ≤ f( ) -f( ) 

12. A greedy algorithm maximizes a submodular function within a constant factor of optimalInitialize empty representative set AFor 1…k:Add the item v that most improves the quality of A Guaranteed to produce a solution within a factor of 1-1/e ≈ 0.63 (Nemhauser Mathematical Programming 1978).Best possible approximation ratio unless P=NP (Feige SIAM 1998).Can run in practically O(k log n) time (Minoux Optimization Techniques 1978).Key result!

13. A submodular optimization framework for representative set selectionChoosing diverse panels of genomics assaysRemoving redundancy in protein sequence data sets

14. Problem: What panel of genomics assays should we perform on a new cell type?Assay typeCell typeGenomic positionNew cell typeSelection procedureAssay similaritiesBudget = 3Panel:H3K4me3H3K79me2H3K9me3

15. Estimate the similarity between pairs of assay types via correlationCell types where bothhave been performed

16. The facility location problem has a submodular objective

17. = sum of similarities to most-similar representativeFacility location function

18. The quality of a panel of genomics assays is estimated via facility location.Cell types where bothhave been performedSimilarity of two assay types:Facility location objective function:Set of all assay types

19. Submodular selection of assays (SSA) selects diverse panels1 MBGenesH3K27acH3K9acH3K4me1H3K4me2► H3K4me3►H3K79me2H2A.Z► H3K36me3H4K20me1► H3K9me3► H3K27me3

20. SSA recapitulates the panel manually selected by the Roadmap Epigenomics consortiumChosen by Roadmap

21. The submodular function measures the quality of each panelrank assayconditional gainmodular valuefunction valuerank assayconditional gainmodular valuefunction value1H3K4me33.1833.1833.1830H3K27ac2.6152.6152.6152H3K79me20.9712.4034.1541H3K79me20.8362.4033.4513H3K9me30.340.7034.4942H3K4me30.7883.1834.2394H3K27me30.2460.8624.7393H3K9me30.340.7034.5795H3K36me30.2271.2144.9664H3K27me30.2460.8624.8246H3K4me10.1791.8645.1455H3K36me30.2271.2145.0517H3K4me20.0833.1225.2276H3K4me10.1531.8645.2038H3K9ac0.0743.1285.3017H3K4me20.0833.1225.2869H2A.Z0.052.4725.3528H2A.Z0.052.4725.33610H4K20me10.0061.4125.3579H3K9ac0.0153.1285.35211H3K27ac02.6155.35710H4K20me10.0061.4125.357H3K27Ac is redundant

22. What happens if we start with H3K27Ac?rank assayconditional gainmodular valuefunction valuerank assayconditional gainmodular valuefunction value1H3K4me33.1833.1833.1830H3K27ac2.6152.6152.6152H3K79me20.9712.4034.1541H3K79me20.8362.4033.4513H3K9me30.340.7034.4942H3K4me30.7883.1834.2394H3K27me30.2460.8624.7393H3K9me30.340.7034.5795H3K36me30.2271.2144.9664H3K27me30.2460.8624.8246H3K4me10.1791.8645.1455H3K36me30.2271.2145.0517H3K4me20.0833.1225.2276H3K4me10.1531.8645.2038H3K9ac0.0743.1285.3017H3K4me20.0833.1225.2869H2A.Z0.052.4725.3528H2A.Z0.052.4725.33610H4K20me10.0061.4125.3579H3K9ac0.0153.1285.35211H3K27ac02.6155.35710H4K20me10.0061.4125.357The resulting panels are lower quality.

23. Selecting TFs in diverse regulatory pathways

24. How can we objectively evaluate the quality of a panel of genomics assays?Solution: Evaluate the panel’s performance on the most common downstream applications of a panel.Impute the results of assays outside the panel.Identify functional elements (e.g. promoters, enhancers).Use the panel for semi-automated genome annotation.

25. Evaluation involves holding out some data and comparing to the representative set.

26. Imputing missing data

27. Functional element prediction

28. Semi-automated genome annotation

29. The facility location objective function is correlated with the evaluation metricsPanels of size 5.

30. SSA selects panels that are high quality according to all three evaluation metricsCell type GM12878Panel sizeAssay imputationFunctional element predictionAnnotation-based evaluationNormalized performance

31. SSA performs well on various cell typesVertical axis is percentile over 40 random choices.

32. Problem variant: Which existing assays should you use for a computationally expensive analysis?Other variants:“Given these assays, which experiment should I do next?” “Which cell types should I use?”“What panel should I use for multiple cell types?”Existing dataMissing data

33. Using information about the target cell type (SSA-past) usually increases performance

34. A submodular optimization framework for representative set selectionChoosing diverse panels of genomics assaysRemoving redundancy in protein sequence data sets

35. Many people want to find representative sets of protein sequences>4,000 citations (Google Scholar)

36. A slightly smaller teamJeff BilmesMax Libbrecht

37. Existing solutions use a heuristic algorithmSort sequences in decreasing order of length.Initialize representative set A as emptyFor each sequence s:If the most similar sequence to s in A has similarity <T (e.g. 40% identity), add s to A.CD-HIT (Li Bioinformatics 2006) and Pisces (Wang Bioinformatics 2003).

38. We need a submodular estimate of the quality of a subset of protein sequencesWe tested 29 other objective functions on a development set (21% of the data).Sum-redundancy objective function:Approaches zero when no sequences are similar to each other.

39. Optimizing sum-redundancy effectively removes redundancy in the representative setCD-HIT 40%CD-HIT 90%

40. Optimizing sum-redundancy chooses very few sequences with detectable similarity at the cost of choosing a small number of closely-related pairs40%

41. Selection using submodular optimization chooses proteins with diverse structuresSCOPe: “Structural Classification of Proteins—extended”SCOPe assigns protein domain sequences to protein structure families.Quality of a subset = Number of families it coversToxin’s membrane translocation domainsBacterial photosystem II reaction centerLeukocidin-likeAquaporin-likeFamily A G protein-coupled receptor-likeClc choride channelTransmembrane beta-barrels; PorinSingle transmembrane helixVoltage-gated potassium channelsTransmembrane beta barrels;Ligand-gated protein channelProjection via t-SNE (Van der Maaten JMLR 2008)

42. Optimizing sum-redundancy results in representative sets with high structural diversity

43. Optimizing sum-redundancy results in representative sets with high structural diversityCompared to CD-HIT with 40% threshold, we leave 27% fewer families uncovered.

44. Using a different objective and similarity measure results in better performance for small subsetsFacility-location objective function:Rankprop similarity function:

45. Using a different objective and similarity measure results in better performance for small subsetsFacility-location objective function:Rankprop similarity function:(Weston PNAS 2004)

46. A mixture of objective functions performs better than any single function

47. What data sets do you want to summarize?

48.