/
Prediction of Hierarchical Classification of Transposable Elements using Machine Learning Prediction of Hierarchical Classification of Transposable Elements using Machine Learning

Prediction of Hierarchical Classification of Transposable Elements using Machine Learning - PowerPoint Presentation

mitsue-stanley
mitsue-stanley . @mitsue-stanley
Follow
352 views
Uploaded On 2018-12-04

Prediction of Hierarchical Classification of Transposable Elements using Machine Learning - PPT Presentation

Avdesh Mishra Manisha Panta Md Tamjidul Hoque Joel Atallah Computer Science and Biological Sciences Department University of New Orleans Presentation Overview 4102018 ID: 735323

classification hierarchical tes classifier hierarchical classification classifier tes machine learning node local nllcpn lcpnb results approaches leaf repbase parent

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Prediction of Hierarchical Classificatio..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Prediction of Hierarchical Classification of Transposable Elements using Machine Learning Approach

Avdesh Mishra, Manisha Panta, Md Tamjidul Hoque, Joel Atallah

Computer Science and Biological Sciences Department, University of New OrleansSlide2

Presentation Overview

4/10/20182

Introduction

Data Collection

Machine Learning Methods for the Prediction of Hierarchical Categories

Feature Extraction

Hierarchical Classification Approaches

Results

ConclusionSlide3

Transposable Elements

4/10/20183Transposable elements (TEs) or jumping genes are the DNA sequences that haveIntrinsic capability to move within a host genome from one genomic location to another

Genomic location can either be same or different chromosome

TEs were first discovered

by Barbara McClintock

(a.k.a. maize scientist) in 1948

TEs play an important role in:

Modifying functionalities of genes

E.g. insertion of L1 type TEs in tumor suppressor genes could lead to cancer.

Hence, proper classification of identified TEs in a genome is important to understand their particular role in

germline

and somatic evolution.Slide4

Illustration of TEs Taxonomy Proposed by Wicker et al.

4/10/20184

1

1.1

1.1.2Slide5

Data Collection

4/10/20185

For our study, we collected pre-annotated DNA sequences of TEs.

The hierarchical annotations of TEs were performed based on Wicker’s taxonomy.

For the annotation of TEs, the repetitive DNA sequences were obtained from two different public repositories:

Repbase

PGSB

Repbase

repository contains TEs from different eukaryotic species.

PGSB is a compilation of plant

repetative

sequences from different databases:

TREP

TIGR repeats

PlantSat

Genbank

PGSB

Repbase

Fasta Sequences

18680

34561Slide6

Feature Extraction

4/10/20186

Each TE in a dataset is represented by a set of k-

mers

Which are obtained by frequency count of substring of length k

E.g. for k=2, all combinations of (AA, AT, AG, AC….CC) in the sequence are extracted

For k=2

AA = 2

CC = 2

TT = 2

For each TE, k-

mers

with k sizes of 2, 3 and 4 were used as features.

T

C

C

G

C

A

A

A

A

G

T

G

T

C

 

 

For k=3

CCG = 1

CAA = 1

AAG = 1

For k=4

CCGC = 1

AAAA = 1

GTTG = 1

Feature values were standardized such that the mean = 0 and standard deviation = 1Slide7

Hierarchical Classification Approaches

4/10/20187

Classification of TEs can be treated as hierarchical classification problem

The hierarchical classification can be represented by a directed acyclic graph or a tree

Hierarchical classification of TEs is performed based on top-down strategies

Two recent top-down strategies for the hierarchical classification of TEs are:

non-Leaf Local Classifier per Parent Node (

nLLCPN

)

Local Classifier per Parent Node and Branch (LCPNB)Slide8

non-Leaf Local Classifier per Parent Node Approach

4/10/20188

In

nLLCPN, a multi-class classifier is implemented at each non-leaf node of the graph.

…CCGCAAAAGTTGTC…

Is classified as either 1 or 2

…CCGCAAAAGTTGTC…

Is classified as either itself or 2.1

…CCGCAAAAGTTGTC…

Is classified as either itself or 2.1.1

…CCGCAAAAGTTGTC…

Is classified as 2.1.1.2Slide9

Local Classifier per Parent Node and Branch Approach

4/10/20189

In LCPNB, a multi-class classifier is implemented at each non-leaf node of the graph and prediction probabilities are obtained for all the classes.

The path leading to final classification:

2(

0.6

) 2.1(

1

) 2.1.1(

0.8

) 2.1.1.1(

0.4

)

Average = (0.6+1+0.8+0.4)/4 = 0.7

0.4

0.6

0.2

0.4

0.4

0.2

0.2

0.6

1

0.2

0.8

0.2

0.4

0.2

0.2Slide10

Machine Learning Methods for the Prediction of Hierarchical

Categories4/10/201810

We applied several machine learning methods at each non-leaf node of the directed acyclic graph.

Artificial Neural Network (ANN)

ExtraTree

Classifier (ET)

Gradient Boosting Classifier (GBC)

Logistic Regression (

LogReg

)

Random Forest (RF)

Support Vector Machines (SVM)Slide11

Machine Learning Methods for the Prediction of Hierarchical

Categories4/10/201811

The state-of-the-art method implements ANN which single hidden layer consisting of 200 nodes as a multi-class classifier

Whereas, in this study we propose a SVM based multi-class classification

We implemented SVM with RBF kernel and optimized the cost and gamma parameters using grid search approach for optimal performance.Slide12

Performance Measures

4/10/201812

 

 

 

Here,

C

i

and

Z

i

represents the set of true and predicted classes for an instance

i

respectively.

The performance of each of the classifier is evaluated using 3-fold cross-validation strategy.Slide13

Results

4/10/201813Table I – Shows comparative results of different machine learning approaches in the PGSB hierarchical datasets.

nLLCPN

is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch.

MIPS -

nLLCPN

 

SVM

ANN

GBC

ExtraTree

Random Forest

LogReg

hP

88.21%

82.13%

86.75%

76.03%

76.98%

76%

hR

86.51%

85.51%

86.25%

78.94%

79.55%

78.89%

hF

0.873518029

0.837699065

0.864972486

0.774524643

0.782458818

0.774172489

MIPS - LCPNB

hP

87.34%

82.93%

86.11%

84.50%

84.12%

83.55%

hR

86.10%

83.44%

86.45%

85%

84.69%

84.21%

hF

0.867151847

0.831846433

0.862758219

0.847494297

0.844037783

0.838769007Slide14

Results

4/10/201814Table II – Shows comparative results of different machine learning approaches in the

Repbase

hierarchical datasets. nLLCPN is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch.

Repbase

-

nLLCPN

 

SVM

ANN

GBC

ExtraTree

Random Forest

LogReg

hP

85.44%

80.27%

81.98%

76.02%

76.98%

75.99%

hR

86.64%

83.32%

84.04%

78.93%

79.55%

78.89

hF

0.860347824

0.817704912

0.830022352

0.774524643

0.782458818

0.774172489

Repbase

- LCPNB

hP

85.75%

80.57%

81.94%

76.95%

77.67%

76.12%

hR

87.05%

83.26%

84.59%

79.99%

80.27%

79.16%

hF

0.863959027

0.818944098

0.832277949

0.78444174

0.789473439

0.776128202Slide15

Results

4/10/201815Fig.1. – Shows hierarchical f-measure comparison between different machine learning approaches for

nLLCPN

and LCPNB hierarchical classification methods in PGSB dataset.Slide16

Results

4/10/201816Fig.2

. – Shows hierarchical f-measure comparison between different machine learning approaches for

nLLCPN and LCPNB hierarchical classification methods in Repbase dataset.Slide17

Conclusion and Future Work

4/10/201817Advanced Machine Learning approach improves the prediction accuracy

of

hierarchical classification of TEs

Optimization of the cost and gamma parameters of support vector machine (SVM) with radial basis function (RBF) kernel leads to a better hierarchical classification of transposable elements

We plan to improve the classification accuracy by following approaches:

Addition of biochemical related features

Implementing advanced machine learning techniques

Implementing novel hierarchical classification

approacheSlide18

4/10/2018

18