/
Accurate Identification of disordered protein residues using deep neural network Accurate Identification of disordered protein residues using deep neural network

Accurate Identification of disordered protein residues using deep neural network - PowerPoint Presentation

min-jolicoeur
min-jolicoeur . @min-jolicoeur
Follow
345 views
Uploaded On 2019-02-28

Accurate Identification of disordered protein residues using deep neural network - PPT Presentation

The 4 th Annual Conference on Computational Biology and Bioinformatics Speaker Sumaiya Iqbal Author Sumaiya Iqbal Denson Smith Md Tamjidul Hoque Computer Science University of New Orleans New Orleans LA 70148 ID: 754320

disordered disorder feature protein disorder disordered protein feature 2016 regions ordered psee region predicted residues proteins area performance probability

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Accurate Identification of disordered pr..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Accurate Identification of disordered protein residues using deep neural network

The 4th Annual Conference on Computational Biology and BioinformaticsSpeaker: Sumaiya IqbalAuthor: Sumaiya Iqbal, Denson Smith, Md Tamjidul HoqueComputer Science, University of New Orleans, New Orleans, LA 70148Email: {siqbal1, dsmith8, thoque}@uno.edu

4/4/2016

1Slide2

Introduction

What is protein disorder?Significance of disordered protein identificationWhy do we need computational tool for disorder prediction?Our contribution4/4/20162Slide3

What is Protein Disorder?

A fully functional protein is usually the one that is appropriately twisted and folded into a specific three dimensional conformation.However, proteins can misfold, and can be unable to adopt well-defined, stable three dimensional (3D) structures in an isolated state and under different non native environments.These proteins or partial regions of proteins are called intrinsically disordered proteins (IDPs) or

disordered regions in proteins (IDRs) [1,

2]The coordinates of their backbone atoms have no specific equilibriumstates, and thus adopt dynamic structural ensembles.

4/4/2016

3Slide4

Significance of Disordered Protein Identification

IDPs or, IDRs do not follow the well-known paradigm: Disordered proteins have comparatively higher total surface for interaction with partners than those of the ordered (or, structured) proteins.

IDPs play essential biological functions via protein-protein, protein-nucleic acid and protein-ligand interactions:

Cell cycle control and cellular signal transduction

Transcriptional and translational regulation Molecular assembly and protein

modification etc.

 

4/4/2016

4Slide5

Significance of Disordered Protein Identification (contd.)

Binding regions within IDRs and IDPs become biologically active through disorder to structure transitions, known as induced folding [3 – 9].Such binding regions is associated with critical human diseases, including cancer, cardiovascular disease, amyloidoses, neurodegenerative diseases, and diabetes.About 80% of the human proteins found in the available disorder databases, such as DisProt and IDEAL, contain at least one amyloidogenic region that are directly linked with diseases such as Parkinson’s diseases, Alzheimer’s diseases or, type II diabetes

.Thus, identifying

protein disorder is essential and assists in effective drug development.

4/4/2016

5Slide6

Why Do We Need Computational Tool for Disorder Prediction?

IDPs are abundant in nature and increasing fast in numbers.The experimental annotation (X-ray crystallography, NMR spectroscopy, ultra violet circular dichroism) of disordered residues is progressing slowlyCostly both in terms of time and moneyThe computational tools for disordered residue prediction using characteristics features and machine learning algorithms play an alternative and useful role in understanding the functions of protein disorder

.The well-known CASP competitions are biennially assessing their performances [10 – 12].

4/4/2016

6Slide7

Our Contribution

We propose a new disorder prediction tool, DisPredict3, from protein sequence alone that involvesDeep neural network (DNN)A new structural stability measuring feature (PSEE)An effective feature selection procedure (GFFS)We investigate the interconnection of disorder and critical binding regionss.We explore the structural characteristics of disordered regions

4/4/2016

7Slide8

Materials & method

Dataset preparationFeature set preparationPredictor model development4/4/20168Slide9

Protein Data bank [13]

Dataset Preparation

Database of Protein

Disorder[14]

Training Set

Test Set

Filter

Remove sequences having 25% similarity using Blastclust[21]

Purify

Remove sequences with abnormal amino acids or, unknown annotation

SL477 (Short-Long, 477 chains)

DD73 (Disorder Dataset, 73 chains)

(v 5.0)

(v 5.1 – 6.02)

4/4/2016

9Slide10

Feature Set Preparation

Amino acid (1)Characterizes specific amino acid type

Physical

Properties

(7)

Steric parameter,

Polarizability

,

Volume,

Hydrophobicity,

Isoelectric point,

Helix and

Sheet probability

Position

Specific

Scoring

Matrix

(20)

S

equence alignment

based evolutionary information computed

using PSI-BLAST[21]

Monogram

Bigram

(1, 20)

Conserved amino acid subsequence information computed using PSSM

Secondary Structure

(3+3)

Three different local secondary structure (helix, beta and coil) like tendency, predicted using SPINE X[23] and

MetaSSPred

[19]

Accessible Surface Area

(1+1)

Solvent Accessible Area information, predicted using SPINE X [23] and REGAd

3

p[20]

Backbone Angle Fluctuations

(1, 1)

Backbone torsion angle flexibility information, predicted using

Davar

[22]

Position

Specific

Estimated Energy

(1)

State of stability of the residue in the 3D conformation computed using contact energy and relative exposure expressed as energy

Terminal

(1)

Flexible terminal region residue indicator

4/4/2016

10Slide11

Feature Set Preparation (Contd.)

Genetic Algorithm

Extra Tree

Classifier

GFFS

61 features

37 features

21 × 37 features

candidate feature set

feature importance

Step 2:

Feature selection using genetic forest

feature selector

Step 3:

Include neighboring residue information applying a sliding window of size 21

4/4/2016

11Slide12

Predictor Model Development

DisPredict3

A

D

0.961054

V

D

0.866233

E

O 0.477779

G

O 0.372149

K

O 0.277364

Primary Protein Sequence

AVEGK…

3 hidden layers

150 hidden nodes

Exponential activation function

Learning rate 0.1

4/4/2016

12Slide13

RESULTS

Performance measuresState-of-the-art methods to compareBinary classification performance and comparisonsProbability prediction performance and comparisons4/4/201613Slide14

Performance measures

Name of metric

Definition

True positive (TP)

Number of correctly predicted disordered residues

True negative (TN)

Number of correctly predicted ordered residues

False positive (FP)

Number of incorrectly predicted disordered residues

False negative (FN)

Number incorrectly predicted ordered residues

Balanced accuracy (ACC)

Precision (PPV)

Mathews correlation coefficient (MCC)

Area under

curve (AUC)

Area under the receiver operating characteristic curve

Name of metric

Definition

True positive (TP)

Number of correctly predicted disordered residues

True negative (TN)

Number of correctly predicted ordered residues

False positive (FP)

Number of incorrectly predicted disordered residues

False negative (FN)

Number incorrectly predicted ordered residues

Balanced accuracy (ACC)

Precision (PPV)

Mathews correlation coefficient (MCC)

Area under

curve (AUC)

Area under the receiver operating characteristic curve

Table 1:

Name and definition of performance measuring parameters

4/4/2016

14Slide15

State-of-the-art Methods to Compare

MFDp [15]Uses support vector machine with linear kernel and combined outputs of 3 predictorsSPINE-D [16]Uses artificial neural network (2 hidden layers and 51 hidden nodes)DisPredict [17]Uses optimized support vector machine with radial basis function as kernelDisPredict2 [18]Uses optimized support vector machine with radial basis function

as kernel Uses optimized threshold

and new feature selection method

4/4/2016

15Slide16

State-of-the-art Methods to Compare (contd.)

DisPredict [17]Uses support vector machine with radial basis function as kernelUses optimized parameters for SVMDisPredict2 [18]Uses support vector machine with radial basis function as kernelUses optimized parameters for SVMUses optimized threshold for disorder vs. order separationUses new structural stability measuring feature (PSEE)

4/4/2016

16Slide17

Binary Classification Performance and Comparisons

MethodTPTNFPFN

ACCPrecision

MCCDisPredict3

39526588

1005

752

0.858

0.818

0.698

DisPredict2

3521

7005

588

1223

0.832

0.857

0.680

DisPredict

3675

6706

886

1069

0.829

0.806

0.663

SPINE-D

3777

6436

1156

967

0.822

0.765

0.639

MFDp

3703

6645

947

1041

0.828

0.796

0.658

Table 2:

Binary disorder classification performance comparison on DD73 dataset.The most useful measure for evaluation of binary predictor4/4/201617Slide18

Probability Prediction Performance and Comparisons

Figure 1: Binary disorder classification performance comparison on DD73 dataset.4/4/2016

18Slide19

Discussions

Dynamic structural characteristics of protein disorderCase studiesFuture works and conclusions4/4/201619Slide20

Structural Characteristics of Disordered Region (PSEE vs. Exposure)

Figure 2. Correlation between PSEE and relative exposure of ordered regions (blue circle

) and

disordered regions (

red diamond). The vertical dashed line separates the average PSEE of all

ordered

and disordered region and the horizontal dash-dotted line separates the ordered and

disordered

regions with more and less

than 25

% exposure.

1

st

Quadrant

(Disorder Dominated)

High relative exposure

Less negative energy

Energetically unstable area, preferred by disordered region. Our PSEE feature is useful.

Energetically stable area, preferred by ordered region. Our PSEE feature

is useful.

3

rd

Quadrant

(Order Dominated)

Low relative exposure

High negative energy

We see that disorder

can show order-like

characteristics !Slide21

S

tructural Characteristics of Disordered Region (PSEE vs. Coil tendency)Figure 3. Correlation between PSEE and coil probability

of ordered regions (

blue circle) and

disordered regions (

red diamond

). The vertical dashed line separates the average PSEE of all

ordered

and disordered region and the horizontal dash-dotted line separates the ordered and

disordered

regions with more and less

than 50% coil tendency.

1

st

Quadrant

(Disorder Dominated)

High coil probability

Less negative energy

Energetically unstable area, preferred by disordered region. Our PSEE feature is useful.

Energetically stable area, preferred by ordered region. Our PSEE feature

is useful.

3

rd

Quadrant

(Order Dominated)

Low coil probability

High negative energy

We see that disorder

can show order-like

characteristics !Slide22

Why Overlap?Noisy Annotation

Human host defense cathelicidin LL-37 and its smallest antimicrobial peptidePDB

2KO6A

DisProt

DP0004_C002

Figure 4.

Identical protein chain from PDB and

DisProt

, showing completely structured (ordered) state and unstructured (disordered) state from PDB and

DisProt

, respectively.Slide23

Why Overlap?Case Study 1: disorder

 orderHuman disordered proteins contain amyloidogenic regions that undergo disorder (without amyloid formation) to order (with amyloid formation) transitions.Beta – 2 – macroglobulin (Homo sapiens)

Figure 5: Disorder probability

plot for

-2-microglobulin.

Location

:

21

119

Disorder

probability

Mean =

0.324

Standard deviation =

0.181

4/4/2016

23Slide24

Why Overlap? (contd.)Case Study 2: disorder

 orderHuman disordered proteins contain amyloidogenic regions that undergo disorder (without amyloid formation) to order (with amyloid formation) transitions.Lysozyme – C (Homo sapiens)Figure 6: Disorder probability

plot for

Lysozyme C.

Location: 19 – 148

Disorder

probability

Mean = 0.352

Standard deviation = 0.205

4/4/2016

24Slide25

Conclusions and Future Works

DisPredict3 gives competitive output in predicting disordered protein residuesThe parameters of DNN can be tuned further to have better performanceThe new feature (PSEE) and the feature selection steps of DisPredict3 gives two-fold benefits:The feature space dimension becomes lower with the selected featuresRelevant features characterizes the disorder better It is possible to identify the critical binding regions within disordered regions that undergo disorder to structure transitions using disorder predictorThus it will be interesting to further investigate the usability of the tool for binding region or, induced folding region prediction.4/4/2016

25Slide26

Acknowledgement

Bioinformatics and Machine Learning Labhttp://cs.uno.edu/~tamjid/This work is supported by the Louisiana Board of Regents (BoR) through the Board of Regents Support Fund, LEQSF (2013-16)-RD-A-19. We gratefully acknowledge BoR. 4/4/201626Slide27

THANK YOU

4/4/201627