/
The Prediction of the Hierarchical The Prediction of the Hierarchical

The Prediction of the Hierarchical - PowerPoint Presentation

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
358 views
Uploaded On 2018-12-04

The Prediction of the Hierarchical - PPT Presentation

Classification of Transposable Elements using a Machine Learning Approach Introduction Transposable Elements TEs or jumping genes are DNA sequences that have an intrinsic capability to move within a host genome from one genomic location ID: 735321

classification hierarchical learning machine hierarchical classification machine learning methods transposable pgsb elements tes dataset lcpnb approaches repbase classifier node

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "The Prediction of the Hierarchical" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

The Prediction of the Hierarchical

Classification of Transposable Elements using a Machine Learning Approach

Introduction

Transposable Elements (TEs) or jumping genes

are DNA

sequences that have an intrinsic capability to move within a host genome from one genomic location to another. The new genomic location can either be on the same chromosome or a different chromosome. With the discovery of transposable element in maize by Barbara McClintock in 1948, there are numerous ongoing research efforts on the identification and classification of TEs, along with their effects in the genome. These studies show that TEs have a role in genome function and evolution, as their presence can modify the functionality of genes and increase the size of the genome. Thus, proper classification of the identified jumping genes in a genome is important to understand their particular role in germline and somatic evolution. The classification of TEs is usually based on the mode of transposition, number and type of genes they contain and similarities in sequence. For the hierarchical classification of Transposable Elements, a unified hierarchical classification system proposed by Wicker et al. has been popular. In this classification system, the classes are organized in a tree structure which includes: Class I (retrotransposons) and Class II (DNA transposons). Class I further divided into five orders (LTR, DIRS, LINE, SINE) and Class II into Subclass 1 and Subclass 2. Each order is divided into several superfamilies.In this work, we studied publicly available hierarchical datasets. We developed a machine learning based method to improve prediction of hierarchical classification of transposable elements using support vector machines (SVMs). To generate an effective classifier, we used k-mers as features, a common practice in bioinformatics. Our major contribution is identifying the appropriate advanced machine learning method in the prediction of hierarchical classes of Transposable Elements. Furthermore, we performed a comparative study of different machine learning methods on the datasets. We compared the proposed SVM with the existing methods based on Neural Networks. The comparative results indicate that the proposed method significantly outperforms the state-of-the-art methods.

Fasta

sequence

extraction: DNA sequences

are available

in the public repositories of repetitive DNA sequences. The

repositories are the Repbase and PGSB (Plant Genome and Systems Biology) repeat element databases.Feature Encoding: K-mers are often used as features in bioinformatics. Here, frequency counting of substrings has been used as features. For each TE (collected from two public repositories, Repbase and PGSB), all k-mers with sizes k=2,3,4 are extracted. Therefore, the total number of features used is 336.  The dataset is then organized as a hierarchical dataset with classes per level for each TEs. The dataset has been extracted from the work that has been completed and available publicly. Hierarchical classification methods: The hierarchical classification methods are based on local approach. Two top-down strategies for the hierarchical classification of TEs have been used. The approaches are non-Leaf Local Classifier per Parent Node (nLLCP) and Local Classifier per Parent Node and Branch (LCPNB). nLLCP allows non-leaf node classification with a multi-class classifier to each internal node of the hierarchy and learns to distinguish among its subclasses. LCPNB allows correction of possible mistakes at a higher level as the final classification is given by the highest average probability of the path to the leaf node.Application of Machine Learning Methods: Different Machine Learning methods are used in order to determine the best approach for the prediction. The following classifiers are used for the prediction: Support Vector Machines, Gradient Boosting Classifier, Neural Networks, Extra Tree, Random Forest, LogReg, and Bagging. 3-fold cross validation strategy is used and the average hf (balanced mean between precision and recall - hierarchical f-measure) over 3 iterations is reported.

TEs play an important role in modifying functionalities of genes. Hence, proper classification of the identified jumping genes (TEs) in a genome is important to understand their particular role in germline and somatic evolution. The existing machine learning method for hierarchical classification of transposable elements does not have a satisfying f-measure (balanced mean between precision and recall).

We gratefully acknowledge the Louisiana Board of Regents

through two

Board of Regents Support Funds: LEQSF (2016-19)-RD-B-07 & LEQSF(2017-20)-RD-A-26. Start-up funds from the University of New Orleans to Joel Atallah also provided support for this project.

Motivation

Methods

Discussion

Results

Acknowledgements

Conclusions and Future Work

References

Table I represents the total number of instances and features extracted

as K-mer

frequency. The PGSB dataset contains la lower number of TE than the Repbase dataset. Table II presents the results of hP (hierarchical precision), hR (hierarchical recall) and hF (hierarchical f-measure) obtained by different machine learning methods in PGSB and Repbase datasets for two different hierarchical classification approaches.The state of the art method used artificial neural network(ANN) for the hierarchical classification of TEs.We analyzed the performance of six different classifiers, SVM, GBC, Random Forest, ExtraTree, ANN, and LogReg.For PGSB dataset, Fig.1 and Fig.3 and for Repbase dataset, Fig.2 and Fig.4 represents the higher precision, recall and f-measure for both the classification methods (nLLCPN and LCPNB) for our proposed SVM.Our proposed machine learning approach SVM, with optimized parameters for both the classification methods, generated better balanced-accuracy.

Avdesh

Mishra, Manisha Panta, Md Tamjidul Hoque, Joel AtallahEmails: amishra2@uno.edu, mpanta1@uno.edu, thoque@uno.edu, jattallah@uno.eduDepartment of Computer Science & Department of Biological Sciences, University of New Orleans, New Orleans, LA, USA

PGSB

RepbaseFasta Sequences1868034561Features336336Classes Per Level2 / 4 /3 /52 / 5 /12 /9

Table I

PGSB - nLLCPN SVMANNGBCExtraTreeRandom ForestLogReghP88.21%82.13%86.75%76.03%76.98%76%hR86.51%85.51%86.25%78.94%79.55%78.89%hF0.8735180290.8376990650.8649724860.7745246430.7824588180.774172489PGSB - LCPNBhP87.34%82.93%86.11%84.50%84.12%83.55%hR86.10%83.44%86.45%85%84.69%84.21%hF0.8671518470.8318464330.8627582190.8474942970.8440377830.838769007

Table II

Table I – Dataset Statistics. PGSB is a public repository of available plant repeat sequences. Repbase is the public repository of repetitive DNA sequences from different eukaryotic species.Table II – Comparative results of different machine learning approaches in the PGSB hierarchical datasets. nLLCPN is non-Leaf Local Classifier Per Node and LCPNB is Local Classifier per Parent Node and Branch. Fig.1. – Hierarchical f-measure comparison between different machine learning approaches for nLLCPN and LCPNB hierarchical classification methods in PGSB datasetFig.2 – Hierarchical f-measure comparison between different machine learning approaches for nLLCPN and LCPNB hierarchical classification methods in Repbase datasetFig.3 – Comparison of hierarchical precision (hP) and hierarchical recall (hR) between Machine Learning approaches ANN and proposed SVM for two different classification methods (nLLCPN and LCPNB) in the PGSB datasetFig.4 – Comparison of hierarchical precision (hP) and hierarchical recall (hR) between Machine Learning approaches ANN and proposed SVM for two different classification methods (nLLCPN and LCPNB) in the Repbase dataset

Proper classification of Transposable Elements (TEs) is crucial in identifying their roles in genomeMachine Learning generates rapid annotation of the likeliest class of the transposable elementsAdvanced Machine Learning approach improves the prediction accuracy in the hierarchical classification of TEsOptimization of the cost and gamma parameters of support vector machine (SVM) with radial basis function (RBF) kernel leads to a better hierarchical classification of transposable elements.In the future, we would like to explore different features, including advanced machine learning techniques and hierarchical classification approaches.

Table and Figure Legends

[1] Nakano, Felipe Kenji, et al. "Top-down strategies for hierarchical classification of transposable elements with neural networks." 

Neural Networks (IJCNN), 2017 International Joint Conference on

. IEEE, 2017.

[2] Wicker, Thomas, et al. "A unified classification system for eukaryotic transposable elements." 

Nature Reviews Genetics

8.12 (2007): 973.

[3]

Melsted

, Pall, and Jonathan K. Pritchard. "Efficient counting of k-

mers

in DNA sequences using a bloom filter." 

BMC bioinformatics

 12.1 (2011): 333.