The 4 th Annual Conference on Computational Biology and Bioinformatics Speaker Sumaiya Iqbal Author Sumaiya Iqbal Denson Smith Md Tamjidul Hoque Computer Science University of New Orleans New Orleans LA 70148 ID: 754320
Download Presentation The PPT/PDF document "Accurate Identification of disordered pr..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Accurate Identification of disordered protein residues using deep neural network
The 4th Annual Conference on Computational Biology and BioinformaticsSpeaker: Sumaiya IqbalAuthor: Sumaiya Iqbal, Denson Smith, Md Tamjidul HoqueComputer Science, University of New Orleans, New Orleans, LA 70148Email: {siqbal1, dsmith8, thoque}@uno.edu
4/4/2016
1Slide2
Introduction
What is protein disorder?Significance of disordered protein identificationWhy do we need computational tool for disorder prediction?Our contribution4/4/20162Slide3
What is Protein Disorder?
A fully functional protein is usually the one that is appropriately twisted and folded into a specific three dimensional conformation.However, proteins can misfold, and can be unable to adopt well-defined, stable three dimensional (3D) structures in an isolated state and under different non native environments.These proteins or partial regions of proteins are called intrinsically disordered proteins (IDPs) or
disordered regions in proteins (IDRs) [1,
2]The coordinates of their backbone atoms have no specific equilibriumstates, and thus adopt dynamic structural ensembles.
4/4/2016
3Slide4
Significance of Disordered Protein Identification
IDPs or, IDRs do not follow the well-known paradigm: Disordered proteins have comparatively higher total surface for interaction with partners than those of the ordered (or, structured) proteins.
IDPs play essential biological functions via protein-protein, protein-nucleic acid and protein-ligand interactions:
Cell cycle control and cellular signal transduction
Transcriptional and translational regulation Molecular assembly and protein
modification etc.
4/4/2016
4Slide5
Significance of Disordered Protein Identification (contd.)
Binding regions within IDRs and IDPs become biologically active through disorder to structure transitions, known as induced folding [3 – 9].Such binding regions is associated with critical human diseases, including cancer, cardiovascular disease, amyloidoses, neurodegenerative diseases, and diabetes.About 80% of the human proteins found in the available disorder databases, such as DisProt and IDEAL, contain at least one amyloidogenic region that are directly linked with diseases such as Parkinson’s diseases, Alzheimer’s diseases or, type II diabetes
.Thus, identifying
protein disorder is essential and assists in effective drug development.
4/4/2016
5Slide6
Why Do We Need Computational Tool for Disorder Prediction?
IDPs are abundant in nature and increasing fast in numbers.The experimental annotation (X-ray crystallography, NMR spectroscopy, ultra violet circular dichroism) of disordered residues is progressing slowlyCostly both in terms of time and moneyThe computational tools for disordered residue prediction using characteristics features and machine learning algorithms play an alternative and useful role in understanding the functions of protein disorder
.The well-known CASP competitions are biennially assessing their performances [10 – 12].
4/4/2016
6Slide7
Our Contribution
We propose a new disorder prediction tool, DisPredict3, from protein sequence alone that involvesDeep neural network (DNN)A new structural stability measuring feature (PSEE)An effective feature selection procedure (GFFS)We investigate the interconnection of disorder and critical binding regionss.We explore the structural characteristics of disordered regions
4/4/2016
7Slide8
Materials & method
Dataset preparationFeature set preparationPredictor model development4/4/20168Slide9
Protein Data bank [13]
Dataset Preparation
Database of Protein
Disorder[14]
Training Set
Test Set
Filter
Remove sequences having 25% similarity using Blastclust[21]
Purify
Remove sequences with abnormal amino acids or, unknown annotation
SL477 (Short-Long, 477 chains)
DD73 (Disorder Dataset, 73 chains)
(v 5.0)
(v 5.1 – 6.02)
4/4/2016
9Slide10
Feature Set Preparation
Amino acid (1)Characterizes specific amino acid type
Physical
Properties
(7)
Steric parameter,
Polarizability
,
Volume,
Hydrophobicity,
Isoelectric point,
Helix and
Sheet probability
Position
Specific
Scoring
Matrix
(20)
S
equence alignment
based evolutionary information computed
using PSI-BLAST[21]
Monogram
Bigram
(1, 20)
Conserved amino acid subsequence information computed using PSSM
Secondary Structure
(3+3)
Three different local secondary structure (helix, beta and coil) like tendency, predicted using SPINE X[23] and
MetaSSPred
[19]
Accessible Surface Area
(1+1)
Solvent Accessible Area information, predicted using SPINE X [23] and REGAd
3
p[20]
Backbone Angle Fluctuations
(1, 1)
Backbone torsion angle flexibility information, predicted using
Davar
[22]
Position
Specific
Estimated Energy
(1)
State of stability of the residue in the 3D conformation computed using contact energy and relative exposure expressed as energy
Terminal
(1)
Flexible terminal region residue indicator
4/4/2016
10Slide11
Feature Set Preparation (Contd.)
Genetic Algorithm
Extra Tree
Classifier
GFFS
…
…
61 features
37 features
21 × 37 features
candidate feature set
feature importance
Step 2:
Feature selection using genetic forest
feature selector
Step 3:
Include neighboring residue information applying a sliding window of size 21
4/4/2016
11Slide12
Predictor Model Development
DisPredict3
A
D
0.961054
V
D
0.866233
E
O 0.477779
G
O 0.372149
K
O 0.277364
…
Primary Protein Sequence
AVEGK…
3 hidden layers
150 hidden nodes
Exponential activation function
Learning rate 0.1
4/4/2016
12Slide13
RESULTS
Performance measuresState-of-the-art methods to compareBinary classification performance and comparisonsProbability prediction performance and comparisons4/4/201613Slide14
Performance measures
Name of metric
Definition
True positive (TP)
Number of correctly predicted disordered residues
True negative (TN)
Number of correctly predicted ordered residues
False positive (FP)
Number of incorrectly predicted disordered residues
False negative (FN)
Number incorrectly predicted ordered residues
Balanced accuracy (ACC)
Precision (PPV)
Mathews correlation coefficient (MCC)
Area under
curve (AUC)
Area under the receiver operating characteristic curve
Name of metric
Definition
True positive (TP)
Number of correctly predicted disordered residues
True negative (TN)
Number of correctly predicted ordered residues
False positive (FP)
Number of incorrectly predicted disordered residues
False negative (FN)
Number incorrectly predicted ordered residues
Balanced accuracy (ACC)
Precision (PPV)
Mathews correlation coefficient (MCC)
Area under
curve (AUC)
Area under the receiver operating characteristic curve
Table 1:
Name and definition of performance measuring parameters
4/4/2016
14Slide15
State-of-the-art Methods to Compare
MFDp [15]Uses support vector machine with linear kernel and combined outputs of 3 predictorsSPINE-D [16]Uses artificial neural network (2 hidden layers and 51 hidden nodes)DisPredict [17]Uses optimized support vector machine with radial basis function as kernelDisPredict2 [18]Uses optimized support vector machine with radial basis function
as kernel Uses optimized threshold
and new feature selection method
4/4/2016
15Slide16
State-of-the-art Methods to Compare (contd.)
DisPredict [17]Uses support vector machine with radial basis function as kernelUses optimized parameters for SVMDisPredict2 [18]Uses support vector machine with radial basis function as kernelUses optimized parameters for SVMUses optimized threshold for disorder vs. order separationUses new structural stability measuring feature (PSEE)
4/4/2016
16Slide17
Binary Classification Performance and Comparisons
MethodTPTNFPFN
ACCPrecision
MCCDisPredict3
39526588
1005
752
0.858
0.818
0.698
DisPredict2
3521
7005
588
1223
0.832
0.857
0.680
DisPredict
3675
6706
886
1069
0.829
0.806
0.663
SPINE-D
3777
6436
1156
967
0.822
0.765
0.639
MFDp
3703
6645
947
1041
0.828
0.796
0.658
Table 2:
Binary disorder classification performance comparison on DD73 dataset.The most useful measure for evaluation of binary predictor4/4/201617Slide18
Probability Prediction Performance and Comparisons
Figure 1: Binary disorder classification performance comparison on DD73 dataset.4/4/2016
18Slide19
Discussions
Dynamic structural characteristics of protein disorderCase studiesFuture works and conclusions4/4/201619Slide20
Structural Characteristics of Disordered Region (PSEE vs. Exposure)
Figure 2. Correlation between PSEE and relative exposure of ordered regions (blue circle
) and
disordered regions (
red diamond). The vertical dashed line separates the average PSEE of all
ordered
and disordered region and the horizontal dash-dotted line separates the ordered and
disordered
regions with more and less
than 25
% exposure.
1
st
Quadrant
(Disorder Dominated)
High relative exposure
Less negative energy
Energetically unstable area, preferred by disordered region. Our PSEE feature is useful.
Energetically stable area, preferred by ordered region. Our PSEE feature
is useful.
3
rd
Quadrant
(Order Dominated)
Low relative exposure
High negative energy
We see that disorder
can show order-like
characteristics !Slide21
S
tructural Characteristics of Disordered Region (PSEE vs. Coil tendency)Figure 3. Correlation between PSEE and coil probability
of ordered regions (
blue circle) and
disordered regions (
red diamond
). The vertical dashed line separates the average PSEE of all
ordered
and disordered region and the horizontal dash-dotted line separates the ordered and
disordered
regions with more and less
than 50% coil tendency.
1
st
Quadrant
(Disorder Dominated)
High coil probability
Less negative energy
Energetically unstable area, preferred by disordered region. Our PSEE feature is useful.
Energetically stable area, preferred by ordered region. Our PSEE feature
is useful.
3
rd
Quadrant
(Order Dominated)
Low coil probability
High negative energy
We see that disorder
can show order-like
characteristics !Slide22
Why Overlap?Noisy Annotation
Human host defense cathelicidin LL-37 and its smallest antimicrobial peptidePDB
2KO6A
DisProt
DP0004_C002
Figure 4.
Identical protein chain from PDB and
DisProt
, showing completely structured (ordered) state and unstructured (disordered) state from PDB and
DisProt
, respectively.Slide23
Why Overlap?Case Study 1: disorder
orderHuman disordered proteins contain amyloidogenic regions that undergo disorder (without amyloid formation) to order (with amyloid formation) transitions.Beta – 2 – macroglobulin (Homo sapiens)
Figure 5: Disorder probability
plot for
-2-microglobulin.
Location
:
21
–
119
Disorder
probability
Mean =
0.324
Standard deviation =
0.181
4/4/2016
23Slide24
Why Overlap? (contd.)Case Study 2: disorder
orderHuman disordered proteins contain amyloidogenic regions that undergo disorder (without amyloid formation) to order (with amyloid formation) transitions.Lysozyme – C (Homo sapiens)Figure 6: Disorder probability
plot for
Lysozyme C.
Location: 19 – 148
Disorder
probability
Mean = 0.352
Standard deviation = 0.205
4/4/2016
24Slide25
Conclusions and Future Works
DisPredict3 gives competitive output in predicting disordered protein residuesThe parameters of DNN can be tuned further to have better performanceThe new feature (PSEE) and the feature selection steps of DisPredict3 gives two-fold benefits:The feature space dimension becomes lower with the selected featuresRelevant features characterizes the disorder better It is possible to identify the critical binding regions within disordered regions that undergo disorder to structure transitions using disorder predictorThus it will be interesting to further investigate the usability of the tool for binding region or, induced folding region prediction.4/4/2016
25Slide26
Acknowledgement
Bioinformatics and Machine Learning Labhttp://cs.uno.edu/~tamjid/This work is supported by the Louisiana Board of Regents (BoR) through the Board of Regents Support Fund, LEQSF (2013-16)-RD-A-19. We gratefully acknowledge BoR. 4/4/201626Slide27
THANK YOU
4/4/201627