Avdesh Mishra Manisha Panta Md Tamjidul Hoque Joel Atallah Computer Science and Biological Sciences Department University of New Orleans Presentation Overview 4102018 ID: 735323
Download Presentation The PPT/PDF document "Prediction of Hierarchical Classificatio..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Prediction of Hierarchical Classification of Transposable Elements using Machine Learning Approach
Avdesh Mishra, Manisha Panta, Md Tamjidul Hoque, Joel Atallah
Computer Science and Biological Sciences Department, University of New OrleansSlide2
Presentation Overview
4/10/20182
Introduction
Data Collection
Machine Learning Methods for the Prediction of Hierarchical Categories
Feature Extraction
Hierarchical Classification Approaches
Results
ConclusionSlide3
Transposable Elements
4/10/20183Transposable elements (TEs) or jumping genes are the DNA sequences that haveIntrinsic capability to move within a host genome from one genomic location to another
Genomic location can either be same or different chromosome
TEs were first discovered
by Barbara McClintock
(a.k.a. maize scientist) in 1948
TEs play an important role in:
Modifying functionalities of genes
E.g. insertion of L1 type TEs in tumor suppressor genes could lead to cancer.
Hence, proper classification of identified TEs in a genome is important to understand their particular role in
germline
and somatic evolution.Slide4
Illustration of TEs Taxonomy Proposed by Wicker et al.
4/10/20184
1
1.1
1.1.2Slide5
Data Collection
4/10/20185
For our study, we collected pre-annotated DNA sequences of TEs.
The hierarchical annotations of TEs were performed based on Wicker’s taxonomy.
For the annotation of TEs, the repetitive DNA sequences were obtained from two different public repositories:
Repbase
PGSB
Repbase
repository contains TEs from different eukaryotic species.
PGSB is a compilation of plant
repetative
sequences from different databases:
TREP
TIGR repeats
PlantSat
Genbank
PGSB
Repbase
Fasta Sequences
18680
34561Slide6
Feature Extraction
4/10/20186
Each TE in a dataset is represented by a set of k-
mers
Which are obtained by frequency count of substring of length k
E.g. for k=2, all combinations of (AA, AT, AG, AC….CC) in the sequence are extracted
For k=2
AA = 2
CC = 2
TT = 2
For each TE, k-
mers
with k sizes of 2, 3 and 4 were used as features.
T
C
C
G
C
A
A
A
A
G
T
G
T
C
For k=3
CCG = 1
CAA = 1
AAG = 1
For k=4
CCGC = 1
AAAA = 1
GTTG = 1
Feature values were standardized such that the mean = 0 and standard deviation = 1Slide7
Hierarchical Classification Approaches
4/10/20187
Classification of TEs can be treated as hierarchical classification problem
The hierarchical classification can be represented by a directed acyclic graph or a tree
Hierarchical classification of TEs is performed based on top-down strategies
Two recent top-down strategies for the hierarchical classification of TEs are:
non-Leaf Local Classifier per Parent Node (
nLLCPN
)
Local Classifier per Parent Node and Branch (LCPNB)Slide8
non-Leaf Local Classifier per Parent Node Approach
4/10/20188
In
nLLCPN, a multi-class classifier is implemented at each non-leaf node of the graph.
…CCGCAAAAGTTGTC…
Is classified as either 1 or 2
…CCGCAAAAGTTGTC…
Is classified as either itself or 2.1
…CCGCAAAAGTTGTC…
Is classified as either itself or 2.1.1
…CCGCAAAAGTTGTC…
Is classified as 2.1.1.2Slide9
Local Classifier per Parent Node and Branch Approach
4/10/20189
In LCPNB, a multi-class classifier is implemented at each non-leaf node of the graph and prediction probabilities are obtained for all the classes.
The path leading to final classification:
2(
0.6
) 2.1(
1
) 2.1.1(
0.8
) 2.1.1.1(
0.4
)
Average = (0.6+1+0.8+0.4)/4 = 0.7
0.4
0.6
0.2
0.4
0.4
0.2
0.2
0.6
1
0.2
0.8
0.2
0.4
0.2
0.2Slide10
Machine Learning Methods for the Prediction of Hierarchical
Categories4/10/201810
We applied several machine learning methods at each non-leaf node of the directed acyclic graph.
Artificial Neural Network (ANN)
ExtraTree
Classifier (ET)
Gradient Boosting Classifier (GBC)
Logistic Regression (
LogReg
)
Random Forest (RF)
Support Vector Machines (SVM)Slide11
Machine Learning Methods for the Prediction of Hierarchical
Categories4/10/201811
The state-of-the-art method implements ANN which single hidden layer consisting of 200 nodes as a multi-class classifier
Whereas, in this study we propose a SVM based multi-class classification
We implemented SVM with RBF kernel and optimized the cost and gamma parameters using grid search approach for optimal performance.Slide12
Performance Measures
4/10/201812
Here,
C
i
and
Z
i
represents the set of true and predicted classes for an instance
i
respectively.
The performance of each of the classifier is evaluated using 3-fold cross-validation strategy.Slide13
Results
4/10/201813Table I – Shows comparative results of different machine learning approaches in the PGSB hierarchical datasets.
nLLCPN
is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch.
MIPS -
nLLCPN
SVM
ANN
GBC
ExtraTree
Random Forest
LogReg
hP
88.21%
82.13%
86.75%
76.03%
76.98%
76%
hR
86.51%
85.51%
86.25%
78.94%
79.55%
78.89%
hF
0.873518029
0.837699065
0.864972486
0.774524643
0.782458818
0.774172489
MIPS - LCPNB
hP
87.34%
82.93%
86.11%
84.50%
84.12%
83.55%
hR
86.10%
83.44%
86.45%
85%
84.69%
84.21%
hF
0.867151847
0.831846433
0.862758219
0.847494297
0.844037783
0.838769007Slide14
Results
4/10/201814Table II – Shows comparative results of different machine learning approaches in the
Repbase
hierarchical datasets. nLLCPN is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch.
Repbase
-
nLLCPN
SVM
ANN
GBC
ExtraTree
Random Forest
LogReg
hP
85.44%
80.27%
81.98%
76.02%
76.98%
75.99%
hR
86.64%
83.32%
84.04%
78.93%
79.55%
78.89
hF
0.860347824
0.817704912
0.830022352
0.774524643
0.782458818
0.774172489
Repbase
- LCPNB
hP
85.75%
80.57%
81.94%
76.95%
77.67%
76.12%
hR
87.05%
83.26%
84.59%
79.99%
80.27%
79.16%
hF
0.863959027
0.818944098
0.832277949
0.78444174
0.789473439
0.776128202Slide15
Results
4/10/201815Fig.1. – Shows hierarchical f-measure comparison between different machine learning approaches for
nLLCPN
and LCPNB hierarchical classification methods in PGSB dataset.Slide16
Results
4/10/201816Fig.2
. – Shows hierarchical f-measure comparison between different machine learning approaches for
nLLCPN and LCPNB hierarchical classification methods in Repbase dataset.Slide17
Conclusion and Future Work
4/10/201817Advanced Machine Learning approach improves the prediction accuracy
of
hierarchical classification of TEs
Optimization of the cost and gamma parameters of support vector machine (SVM) with radial basis function (RBF) kernel leads to a better hierarchical classification of transposable elements
We plan to improve the classification accuracy by following approaches:
Addition of biochemical related features
Implementing advanced machine learning techniques
Implementing novel hierarchical classification
approacheSlide18
4/10/2018
18