Presented By Yashwanth Karthik Kumar Mamidi 1 Committee Members Dr Minhaz Zibran Dr Christopher M Summa MentorsSupervisors Dr Md Tamjidul Hoque Dr Chindo Hicks MS Thesis Defense Presentation ID: 813481
Download The PPT/PDF document "Classification of Prostate Cancer Patien..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Classification of Prostate Cancer Patients into Indolent and Aggressive Using Machine Learning
Presented By:
Yashwanth Karthik Kumar Mamidi
1
Committee Members:Dr. Minhaz ZibranDr. Christopher M Summa
Mentors/Supervisors:Dr. Md Tamjidul HoqueDr. Chindo Hicks
MS Thesis Defense Presentation
Slide22
Contents
IntroductionBackground
Objective and Hypothesis
Classification Algorithm (Gleason Score)Prostate AdenocarcinomaProject Design and Execution
ResultsConclusions and Future WorkReferences
Slide33
The prostate is a gland located below the bladder and in front of the rectum [1].
Most common among older men and rare in men younger than 40.
Symptoms:
Problems passing urine.Low back pain.
Pain with ejaculation.Blood in the urine or serum.
Ref M. T. H. Information, "Prostate Cancer,"
https://medlineplus.gov/prostatecancer.html,
U.S. National Library of Medicine
Introduction
Slide4Background
4
Most diagnosed cancer and the second cause of cancer related death in men in the US [1].
In 2019, an estimated 174,650 men were diagnosed with prostate cancer and 31,620 people died due to this disease in US (ACS) [2].
Prostate-specific antigen (PSA) screening has enabled early detection of prostate cancer and the reduction in mortality rates [3].
However, PSA screening has resulted in unintended consequence of over diagnosis and over treatment [3].
A critical problem faced by clinicians and oncologists is stratifying patients into those with indolent and those with aggressive tumors using Gleason score [4].
Ref:
S. Rodney, T. T. Shah, H. R. Patel, and M. Arya, "Key papers in prostate cancer,"
Expert Rev Anticancer
Ther
,
vol. 14, pp. 1379-84, Nov 2014.
A. C. Society, "Cancer Facts & Figures 2019,“, Myriad
Prolaris
prostate cancer,
https://prolaris.com/2019/04/17/2019-prostate-cancer-statistics/
,
2019.
A.
Lavi
and M. Cohen, "[PROSTATE CANCER EARLY DETECTION USING PSA - CURRENT TRENDS AND RECENT UPDATES],"
Harefuah
,
vol. 156, pp. 185-188, Mar 2017.
K. Lin, J. M. Croswell, H. Koenig, C. Lam, and A.
Maltz
, "U.S. Preventive Services Task Force Evidence Syntheses, formerly Systematic Evidence Reviews," in
Prostate-Specific Antigen-Based Screening for Prostate Cancer: An Evidence Update for the U.S. Preventive Services Task Force
, ed Rockville (MD): Agency for Healthcare Research and Quality (US), 2011.
Slide5Classification Algorithm (Gleason Score)
The most important predictor of mortality is the Gleason Score [1, 2].
Gleason score ranges from 6 to 10.
Ref:
J. I. Epstein, W. C.
Allsbrook
, Jr., M. B. Amin, and L. L.
Egevad
, "The 2005 International Society of Urological Pathology (ISUP) Consensus Conference on Gleason Grading of Prostatic Carcinoma,"
Am J Surg
Pathol
,
vol. 29, pp. 1228-42, Sep 2005.
J. I. Epstein, L.
Egevad
, M. B. Amin, B. Delahunt, J. R. Srigley, and P. A. Humphrey, "The 2014 International Society of Urological Pathology (ISUP) Consensus Conference on Gleason Grading of Prostatic Carcinoma: Definition of Grading Patterns and Proposal for a New Grading System,"
Am J Surg
Pathol
,
vol. 40, pp. 244-52, Feb 2016.
5
Slide6Objective
To develop and apply Machine Learning to stratification of patients into Indolent and Aggressive with high accuracy.
Hypothesis
Patients with current classification protocol (Gleason Score) were diagnosed with indolent and aggressive could lead to measurable changes distinguishing the two patient groups.
We hypothesize to use existing information (GS: 6, 8, 9, 10) and train a machine learning model to correctly classify patients with GS: 7.
6
Slide7PROSTATE ADENOCARCINOMA
Downloaded and extracted data from GDC-TCGA [1].
Gene-probes : 60483
Samples: 547
Indolent
= (190)
We filtered gene-probes from the data which have at least ~30% (165) values greater than 0 .
Resulting gene-probes: 34956.
Ref:
N. C. Institute, "GDC Data Portal”,
https://portal.gdc.cancer.gov/
.
Aggressive
= (305)
GS: 7 (4+3), 8, 9, 10
Normal
(52)
GS: 6 and 7 (3+4)
7
Slide88
AssumptionsModel-1
We assume that the classification method using Gleason score could truly classify all Prostate Cancer (PCa) patients. We used Differential Expression analysis and ran classification algorithms to check the classification accuracy and misclassification rates.
Model-2
From Model-1, we observed that a few samples were misclassified and we assume that samples with Gleason score: 7 are misclassified using primary and secondary GS. We used Differential Expression analysis and ran classification algorithms on samples with Gleason score: 6, 8, 9, 10. to check the classification accuracy and misclassification rates.
Slide99
Library sizes of all samples expressed using a barplot constitutes the data quality and unnormalized library sizes.
Normalized data using TMM before applying machine learning [1].
Ref:
M. D. Robinson and A. Oshlack, "A scaling normalization method for differential expression analysis of RNA-seq data," Genome Biology,
vol. 11, p. R25, 2010/03/02 2010.
Samples
Library size
Slide10Project Design and Execution
10
Slide11Fold change is defined as ratio between two quantities.
It is used to analyze multiple measurements of a biological system taken at different times.
Change described by the ratio between the points is easier to interpret than the difference.
11
Volcano Plot
Slide12TCGA-Data
T = 495 and N= 52
Indolent Vs Normal
I =190 and N = 52
Aggressive Vs Normal
A = 305 and N= 52
Probes associated with 2 diseases
Indolent Vs Aggressive
(Feature Selection)
Machine Learning
Model-1
Classifiers Used:
Stochastic Gradient Descent (SGD)
Sequential Minimal Optimization (SMO)
SimpleLogistic
Logistic Model Trees (LMT)
MultiClassCLassifier
12
Slide13Samples with GS: 6, (3+4) Vs 8 – 10, (4+3)
Applied Differential Expression (D.E)Indolent Vs Normals:
18,215-probes (p-value < 0.05)Aggressive Vs Normal:21,042-probes (p-value < 0.05)
16,317 common probes are removed.
6,623-unique probesIndolent Vs Aggressive:2074- probes with p-value <0.05
13
Slide14Normalized data
2074 – probes 495 – Samples(removed
Normals)
All Samples
Except NormalsIndolent: RedAggressive: Blue
14PCA representation
Slide15TCGA-Data
( Removed Samples- GS:7)
Indolent Vs Normal
Aggressive Vs Normal
Unique Probes
Indolent Vs Aggressive
Machine Learning
Classify GS = 7 samples
Model-2
15
Classifiers Used:
Stochastic Gradient Descent (SGD)
Sequential Minimal Optimization (SMO)
SimpleLogistic
Logistic Model Trees (LMT)
MultiClassCLassifier
Slide16Samples with GS: 6 Vs 8 -10
Applied Differential Expression (D.E)
Indolent Vs Normal:
15,105-probes (p-value < 0.05)
Aggressive Vs Normal:
20,712-probes (p-value < 0.05)
12,985 common probes are removed.
9,848-unique probes
Indolent Vs Aggressive:
3513- probes with p-value <0.05
16
Slide17Normalized data:
(Samples without GS: 7)3513 – Gene probes
249 – Samples (removed Normal)17
All Samples
With GS: 6, 8, 9, 10
Indolent: Red
Aggressive: Blue
PCA representation
Slide18Accuracy
LFC values
LFC: Log-fold change
18
Results
Slide19Accuracy
LFC values
LFC: Log-fold change
19
Slide2020
Normalized data:
(samples with GS: 7 (3+4 Vs 4+3))
3513 – probes
246 – Samples (removed Normal)Indolent: RedAggressive: Blue
PCA representation
Slide21Number of Samples: 249
Accuracy
LFC values
LFC: Log-fold change
21
Slide22Accuracy
LFC values
LFC: Log-fold change
22
Slide2323
LMT
MultiClassClassifier
SGD
SimpleLogistic
SMOPCA of all samples
Indolent: Red
Aggressive: Blue
Slide2424
LMT
MultiClassClassifier
SGD
SimpleLogistic
SMOIndolent: Red
Aggressive: Blue
PCA of all the samples 6 and 8 - 10
Slide2525
Performance of various Classifiers(Samples with Gleason Score: 6, 7, 8, 9, and 10)
Model type and Description
Sensitivity
Specificity
Accuracy
Precision
F1 Score
MCC
Balanced Accuracy
Support Vector Machine (SVM)
0.91351
0.68800
0.85657
0.89655
0.90495
0.61332
0.80076
Logistic Regression (LogReg)
0.84865
0.67200
0.80404
0.88451
0.86621
0.50225
0.76032
Random Decision Forest (RDF)
0.92432
0.51200
0.82020
0.84864
0.884860.487330.71286Extra Tree Classifier (ETC)0.92703
0.48800
0.81616
0.84275
0.88288
0.84275
0.71946
Gradient Boosting Classifier (GBC)
0.91081
0.54400
0.81818
0.85533
0.88220
0.49032
0.71941
K nearest neighbor (KNN)
0.85676
0.59200
0.78990
0.86141
0.85908
0.44642
0.72438
eXtreme Gradient Boosting (XGBC)
0.90210
0.66892
0.85051
0.88761
0.88914
0.60723
0.79830
Slide2626
Genetic Algorithm
It is a stochastic search algorithm, which gradually refines solutions through natural selection rather than manually designing a search strategy [1].It is used to search a space of potential solutions to find one which solves the problem.
Features and Fitness values
Gene-probes: 1020 (Samples with GS: 6, 7, 8, 9, 10)Fitness value: 1.6140201
Gene-probes: 1681 (Samples with GS: 6, 8, 9, 10)Fitness Value: 1.74722
Ref:
Z. Piserchia, Koenig, Daniel, "Applications of Genetic Algorithm in Bioinformatics," eScholarship
, 2018.
Slide2727
Stacking
Stacking is a model, which obtains information from multiple different models and aggregates them to obtain a new model [1, 6].The generalized error rate will be minimized and yields to more accurate results.
Two stages of learners in stacking:Base Classifiers: more than one classifier is used.
Meta Classifiers: only one classifier is used.The generalized error rate is reduced by combining the prediction probabilities from the base classifiers using a meta-classifier.
Ref:S. Iqbal and M. Hoque, "PBRpredict-Suite: A Suite of Models to Predict Peptide Recognition Domain Residues from Protein Sequence,"
Bioinformatics (Oxford, England), vol. 34, 05/03 2018.
A. Mishra, P. Pokhrel
, and M. Hoque, "StackDPPred
: A Stacking based Prediction of DNA-binding Protein from Sequence," Bioinformatics,
vol. 35, 07/19 2018.Q. Hu, C.
Merchante
, A. N. Stepanova, J. M. Alonso, and S. Heber, "A Stacking-Based Approach to Identify Translated Upstream Open Reading Frames in Arabidopsis Thaliana," in
Bioinformatics Research and Applications
, Cham, 2015, pp. 138-149.
S.
Gattani
, A. Mishra, and M. T. Hoque, "
StackCBPred
: A stacking based prediction of protein-carbohydrate binding sites from sequence,"
Carbohydr Res, vol. 486, p. 107857, Dec 1 2019.
D. S. M. A. Aditi S. Kuchi, Dr. Minhaz F. Zibran, Dr. Mahdi Abdelguerfi, Dr. Md Tamjidul Hoque, "Detection of Sand Boils from Images using Machine Learning Approaches," ScholarWorks@UNO.
M. Flot, A. Mishra, A. S.
Kuchi, and M. T. Hoque, "StackSSSPred: A Stacking-Based Prediction of Supersecondary Structure from Sequence," Methods in molecular biology (Clifton, N.J.),
vol. 1958, pp. 101-122, 2019 2019.
Slide2828
Performance of Stacking Methods
(Samples with Gleason Score: 6, 7, 8, 9, and 10)
Model type and Description
Sensitivity
Specificity
Accuracy
Precision
F1 Score
MCC
Balanced Accuracy
LogReg, KNN, SVM as Base,
SVM as Meta-classifier
0.99198
0.85124
0.95758
0.95373
0.97248
0.88337
0.92161
LogReg, SVM, KNN, XGBC as Base,
XGBC as Meta-classifier
0.95989
0.834710.929290.94723
0.95352
0.80618
0.87295
LogReg, KNN, SVM as Base,
XGBC as Meta-classifier
0.972
0.8368
0.931310.9490.9610.83690.90690
Slide2929
Performance of Stacking Methods
(Samples with Gleason Score: 6, 8, 9, and 10)
Model type and Description
Sensitivity
Specificity
Accuracy
Precision
F1 Score
MCC
Balanced Accuracy
LogReg, KNN, SVM as Base,
SVM as Meta-classifier
0.98182
0.89655
0.97189
0.98630
0.98405
0.86558
0.93918
LogReg, SVM, KNN, XGBC as Base,
XGBC as Meta-classifier
0.95127
0.79110.938120.96976
0.95991
0.71929
0.90869
LogReg, KNN, SVM as Base, XGBC as Meta-classifier
0.95909
0.79310
0.94779
0.972350.965680.721000.91513
Slide3030
PCA of All Samples
Indolent: Red
Aggressive: Blue
Before Classification
After Classification
Slide3131
PCA of All Samples except GS: 7
Indolent: Red
Aggressive: Blue
Before ClassificationAfter Classification
Slide3232
Conclusions and Future Work
Results have shown show that the samples with GS: 7 (3+4 and 4+3) are misclassified compared to samples with GS: 6, 8, 9, 10.Current classification protocol could misclassify PCa patients and can accurately classify most of the PCa patients to indolent or aggressive using ML algorithms.
Implemented Stacking-based machine learning technique to increase prediction accuracy.
Highest accuracy obtained from individual classifiers is around ~86% and using stacking, we improved the accuracy to ~96%.In order to improve more accuracy, mutation based or methylation-based analysis can be implemented to yield better results.
Slide3333
References:
M. T. H. Information, "Prostate Cancer," https://medlineplus.gov/prostatecancer.html
, U.S. National Library of Medicine
.S. Rodney, T. T. Shah, H. R. Patel, and M. Arya, "Key papers in prostate cancer,"
Expert Rev Anticancer Ther, vol. 14, pp. 1379-84, Nov 2014.
A. C. Society, "Cancer Facts & Figures 2019,“, Myriad
Prolaris prostate cancer, https://prolaris.com/2019/04/17/2019-prostate-cancer-statistics/,
2019.
A.
Lavi and M. Cohen, "[PROSTATE CANCER EARLY DETECTION USING PSA - CURRENT TRENDS AND RECENT UPDATES]," Harefuah
,
vol. 156, pp. 185-188, Mar 2017.
K. Lin, J. M. Croswell, H. Koenig, C. Lam, and A.
Maltz
, "U.S. Preventive Services Task Force Evidence Syntheses, formerly Systematic Evidence Reviews," in
Prostate-Specific Antigen-Based Screening for Prostate Cancer: An Evidence Update for the U.S. Preventive Services Task Force
, ed Rockville (MD): Agency for Healthcare Research and Quality (US), 2011.
J. I. Epstein, W. C.
Allsbrook
, Jr., M. B. Amin, and L. L. Egevad
, "The 2005 International Society of Urological Pathology (ISUP) Consensus Conference on Gleason Grading of Prostatic Carcinoma," Am J Surg Pathol, vol. 29, pp. 1228-42, Sep 2005.J. I. Epstein, L. Egevad
, M. B. Amin, B. Delahunt, J. R. Srigley, and P. A. Humphrey, "The 2014 International Society of Urological Pathology (ISUP) Consensus Conference on Gleason Grading of Prostatic Carcinoma: Definition of Grading Patterns and Proposal for a New Grading System," Am J Surg Pathol
, vol. 40, pp. 244-52, Feb 2016.
Slide3434
N. C. Institute, "GDC Data Portal”,
https://portal.gdc.cancer.gov/.
M. D. Robinson and A.
Oshlack, "A scaling normalization method for differential expression analysis of RNA-seq data," Genome Biology, vol. 11, p. R25, 2010/03/02 2010.
S. Iqbal and M. Hoque, "PBRpredict-Suite: A Suite of Models to Predict Peptide Recognition Domain Residues from Protein Sequence,"
Bioinformatics (Oxford, England), vol. 34, 05/03 2018.
A. Mishra, P.
Pokhrel, and M. Hoque, "
StackDPPred: A Stacking based Prediction of DNA-binding Protein from Sequence,"
Bioinformatics, vol. 35, 07/19 2018.
Q. Hu, C.
Merchante
, A. N. Stepanova, J. M. Alonso, and S. Heber, "A Stacking-Based Approach to Identify Translated Upstream Open Reading Frames in Arabidopsis Thaliana," in
Bioinformatics Research and Applications
, Cham, 2015, pp. 138-149.
S.
Gattani
, A. Mishra, and M. T. Hoque, "
StackCBPred
: A stacking based prediction of protein-carbohydrate binding sites from sequence," Carbohydr Res,
vol. 486, p. 107857, Dec 1 2019.
Slide3535
D. S. M. A. Aditi S.
Kuchi, Dr.
Minhaz F.
Zibran, Dr. Mahdi Abdelguerfi, Dr. Md
Tamjidul Hoque, "Detection of Sand Boils from Images using Machine Learning Approaches," ScholarWorks@UNO.
M.
Flot, A. Mishra, A. S. Kuchi
, and M. T. Hoque, "StackSSSPred
: A Stacking-Based Prediction of Supersecondary
Structure from Sequence," Methods in molecular biology (Clifton, N.J.), vol. 1958, pp. 101-122, 2019 2019.
Z.
Piserchia
, Koenig, Daniel, "Applications of Genetic Algorithm in Bioinformatics,"
eScholarship
,
2018.
Slide36Thank You
Any Questions?36