Asish Ghoshal 1 Ananth Grama 1 Saurabh Bagchi 2 Somali Chaterji 1 Purdue University West Lafayette IN MicroRNA miRNA The genomes guiding hand miRNA are 22 nucleotide ID: 919892
Download Presentation The PPT/PDF document "An Ensemble SVM Model for the Accurate P..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
An Ensemble SVM Model for the Accurate Prediction of Non-Canonical MicroRNA Targets
Asish Ghoshal
1
,
Ananth
Grama
1
,
Saurabh
Bagchi
2
,
Somali
Chaterji
1
Purdue University, West Lafayette, IN
MicroRNA (miRNA): “The genome’s guiding hand?”
miRNA are 22 nucleotide (nt) strings of RNA, base-pairing with messenger RNA (mRNA) to cause mRNA degradation or translational repression
Can be thought of as biology’s dark matter: small regulatory RNA that are abundant and encoded in the genomeDysregulation of miRNA may contribute to diverse diseases
Canonical (i.e., exact) matches involve the miRNA’s seed region (nt 2-7) and the 3’ untranslated region (UTR) of mRNA and were thought of as the only form of interactionRecent high-throughput experimental studies have indicated the high-preponderance of “non-canonical” miRNA targets
miRNA
Slide3Background
microRNAmRNA
ArgonauteRISC (complex)
3
Slide4We are attempting computational prediction of miRNA-mRNA interactions
Challenging because of:
Large number of features in miRNAs and mRNAsNoisy and incomplete data sets from experimental approachesWide variety of positive miRNA-mRNA interactions Prior computational approaches are deterministic and rely on perfect complementarity of the miRNA and mRNA nucleotides
“Avishkar”: Computational miRNA target prediction
Slide5Background: CLIP Technology and Non-Canonical miRNA-mRNA Interactions
An experimental method to map the binding sites of RNA-binding proteins across the transcriptome. Proteins are crosslinked
to RNA using ultraviolet light, and an antibody is used to specifically isolate the RNA-binding protein of interest together with its RNA interaction partners, which are subjected to sequencing.5
Slide6Our ContributionsAvishkar suite comprises of 4 different models: a global linear SVM model,
a global non-linear model, an ensemble linear model, an
ensemble non-linear model (radial basis function kernel)Feature computation and transformationSpatial features at two resolutionsSeed enrichment metricLinear classifier had a moderate amount of bias, so we switched to a non-linear, kernel-SVM classifier. This achieves a TPR of 76% and an FPR of 20%, with the AUC (ROC curve) for the ensemble non-linear model being 20% higher than for the simple linear model.
This is an improvement of over 150% in the TPR over the best competitive protocol.Since training kernel SVMs is computationally expensive, we provide a general-purpose and efficient implementation of Cascade SVM, on top of Apache Spark: https://bitbucket.org/cellsandmachines/kernelsvmspark
Parallel Support Vector Machines: The Cascade SVM: Hans Peter Graf, Eric Cosatto, Leon Bottou, Igor Durdanovic, Vladimir Vapnik
, NIPS 2005
Slide7Problem StatementPredict if a miRNA targets an mRNA (segment)
miRNAs
mRNA segments
1 or 0 ?
7
Slide8Problem StatementIdeally we would want some experimentally verified edge labels to train on.
miRNAs
mRNA segments
1 or 0 ?
1
0
1
Edges with ground truth labels
Edges whose labels have to be predicted
8
Slide9Problem StatementWe have labels on vertices from CLIP-Seq experiments, i.e., which mRNA segments were targeted.We do not know which miRNA targeted a particular mRNA segment.
miRNAs
mRNA segments
1 or 0 ?
1
0
1
0
9
Slide10Methods: Generating Ground Truth DataConsider only the most highly expressed miRNAs20 in humans, 10 in mouseAdd an edge if
the binding between a miRNA and mRNA segment is strong enough, i.e., ΔG is below a certain threshold or,There is at least a 6-mer seed match between the mRNA segment and miRNA.
miRNAs
mRNA segments
1
0
1
0
10
Slide11Methods: Generating Ground Truth DataLabel the edge 1 (or 0) if it is incident on a mRNA segment that has label 1 (or 0).
miRNAs
mRNA segments
1
0
1
0
1
1
0
0
0
1
11
Slide12Methods: Schematic of Features (Feature Engineering)
The alternating
blue
and green regions denote the thirteen consecutive windows around the miRNA target site (red). These are the windows where the average thermodynamic and sequence features are computed.
We hypothesize that the curves are different for the positive and negative samples.
The
miRNA seed region
is the heptametrical sequence at positions 2-7 from the 5’ miRNA end.
12
Slide13Methods: Capturing Spatial Interaction using Smooth CurvesCompute thermodynamic interaction profile upstream and downstream of the target regionData is collected for fixed-size windows on both sides of the target region
Averaging is done within each windowCompute smooth curves to remove noise in the biological data setUsed B-spline interpolation for smoothing out the points
13
Slide14Methods: Capturing Spatial Interaction using Smooth CurvesCompute interaction profiles for features
Binding energy ΔGAccessibility ΔΔGLocal AU contentCompute interaction profiles at two different resolutionsWindow size of 40 and using the entire miRNA: “site” curvesWindow size of 9 and only using the seed region of the miRNA:
“seed” curvesUse coefficients of B-spline basis functions as features for classifier14
Slide15Methods: Capturing all Possible Non-Canonical MatchesEncode the alignment of the miRNA seed region with mRNA nucleotides using a string of 1s (matches), 2s (mismatches), 3s (gaps), and 4s (GU wobbles).
Compute the probability that the alignment string (e.g., 1111121) is not a random occurrence.Define enrichment score for the seed match pattern as 1 – probability.The “seed enrichment metric” captures, in a single numeric feature, the relative efficacy of various kinds of seed matches in a single, unified metric.This in turn converts the categorical feature of seed match or non-canonical match into a numerical feature (probability value) that can be better handled by ML methods.
15
Slide16Methods: CLIP Data
Species# Positive examples (Seed, Seedless)
# Negative examples# mRNA
# miRNA
Positive target sites
3’ UTR
CDS
5’ UTR
HITS-CLIP (Mouse)
861,208 (6%,
94%)
35,608,333
4,059
119
56%
43%
1%
PAR-CLIP (Human)
141,109 (8%,92%)
2,659,748
1,211
35
57%
39%
4%
16
Slide17Methods: Our ClassifierDistributed linear SVM using gradient descent (Apache Spark)9 Intel X86 nodes, 8 GB memory, 4 cores.17
Slide18Methods: Validation Protocol used to Evaluate Avishkar18
Slide19Results: ROC Curves19
Slide20Results: Feature Importance20
Slide21Methods: Improving Classification PerformanceLinear classifier suffers from high bias (large training error).Use more complex learning modelKernel SVMSVMs suffer from a widely recognized scalability problem in both memory use and compute time.
Kernel SVM computational cost: O(n3)Does not scale beyond a few thousand examples for feature vector of dimension ~ 150.21
Slide22Methods: Distributed TrainingCascading SVM [Graf et al. NIPS 2005]Key ideas:Train on partitions of the whole data set and do this in parallel
Merge SVs from each partition in a hierarchical mannerFinal step is serial and it is hoped that the number of SVs is reasonably small at that stageImplemented on top of Apache Spark
22
1 2 3
Slide23Distributed Training: Our ContributionsUse non-linear SVM for classification since we observe high bias when using linear SVM, leading to lower accuracyBut non-linear SVM does not scale well even with cascading SVM
Biological insight: miRNAs within an miRNA family share structural similaritiesTherefore, we create a separate non-linear classifier for each miRNA family, plus, within each family we speed up the optimization using the Cascading approachTraining for the model for each miRNA family can occur in parallel23
Slide24Results: Misclassification Rate for Linear and Non-Linear Classifier
Five-fold cross-validation misclassification rate for different miRNA families.
Mean test error of the non-linear SVMs for each of the miRNA families is less than the corresponding linear
models.
Benefit of non-linear SVM is more pronounced for larger miRNA families: e.g., for let-7 and miR-320 families, benefits are 50% and 69.9% over linear model.
Insight
: Non-linear model can remove the prediction bias inherent in linear model.
24
Slide25Results
ROC curves for the ensemble linear model and ensemble non-linear
model, obtained by varying the probability threshold of for the output of the SVM.
The
misclassification error, true positive rate, and false positive rate were computed using 5-fold stratified cross-validation for seedless sites in
the human data set
.
One
possible operating region
is with an FPR of 0.2, the TPR for the linear model is 0.469, while the TPR for the non-linear model is 0.756.
25
Slide26Results26
Slide27ResultsDoes the classification performance improve due to clustering by miRNA family or due to the use of a more complex model?Performance of linear classifier does not improve by clustering data.
27
Slide28Take-Away PointsWe have developed “Avishkar”, a machine-learning, support vector machine-based model, to predict both
canonical and non-canonical miRNA-mRNA interactions Avishkar extracts thermodynamic and sequence features and does smoothening through curve fitting, in order to extract enriched features from CLIP dataWe use non-linear SVM to minimize bias and scale it up to the large biological data sizes through a biologically-driven parallelization strategyWe achieve the best-in-class precision (true positive rate), with an improvement of over 150%, over the best competitive protocol
The code runs on Apache Spark and is open sourced on bitbucket: https://bitbucket.org/cellsandmachines/avishkarWe also provide a general-purpose, scalable
implementations of kernel SVM using Apache Spark, which can be used to solve large-scale non-linear binary classification problems: https://bitbucket.org/cellsandmachines/kernelsvmspark
28
Slide29Thanks!29
Slide30EXTRA + Notes30
Slide3131
Slide32Background: RNA interference32