/
Classification of Prostate Cancer Patients into Indolent and Aggressive Using Machine Classification of Prostate Cancer Patients into Indolent and Aggressive Using Machine

Classification of Prostate Cancer Patients into Indolent and Aggressive Using Machine - PowerPoint Presentation

radions
radions . @radions
Follow
346 views
Uploaded On 2020-10-06

Classification of Prostate Cancer Patients into Indolent and Aggressive Using Machine - PPT Presentation

Presented By Yashwanth Karthik Kumar Mamidi 1 Committee Members Dr Minhaz Zibran Dr Christopher M Summa MentorsSupervisors Dr Md Tamjidul Hoque Dr Chindo Hicks MS Thesis Defense Presentation ID: 813481

prostate samples aggressive cancer samples prostate cancer aggressive probes indolent vol accuracy gleason 2019 stacking score based data classification

Share:

Link:

Embed:

Download Presentation from below link

Download The PPT/PDF document "Classification of Prostate Cancer Patien..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Classification of Prostate Cancer Patients into Indolent and Aggressive Using Machine Learning

Presented By:

Yashwanth Karthik Kumar Mamidi

1

Committee Members:Dr. Minhaz ZibranDr. Christopher M Summa

Mentors/Supervisors:Dr. Md Tamjidul HoqueDr. Chindo Hicks

MS Thesis Defense Presentation

Slide2

2

Contents

IntroductionBackground

Objective and Hypothesis

Classification Algorithm (Gleason Score)Prostate AdenocarcinomaProject Design and Execution

ResultsConclusions and Future WorkReferences

Slide3

3

The prostate is a gland located below the bladder and in front of the rectum [1].

Most common among older men and rare in men younger than 40.

Symptoms:

Problems passing urine.Low back pain.

Pain with ejaculation.Blood in the urine or serum.

Ref M. T. H. Information, "Prostate Cancer,"

https://medlineplus.gov/prostatecancer.html,

U.S. National Library of Medicine

Introduction

Slide4

Background

4

Most diagnosed cancer and the second cause of cancer related death in men in the US [1].

In 2019, an estimated 174,650 men were diagnosed with prostate cancer and 31,620 people died due to this disease in US (ACS) [2].

Prostate-specific antigen (PSA) screening has enabled early detection of prostate cancer and the reduction in mortality rates [3].

However, PSA screening has resulted in unintended consequence of over diagnosis and over treatment [3].

A critical problem faced by clinicians and oncologists is stratifying patients into those with indolent and those with aggressive tumors using Gleason score [4].

Ref:

S. Rodney, T. T. Shah, H. R. Patel, and M. Arya, "Key papers in prostate cancer,"

Expert Rev Anticancer

Ther

,

vol. 14, pp. 1379-84, Nov 2014.

A. C. Society, "Cancer Facts & Figures 2019,“, Myriad

Prolaris

prostate cancer,

https://prolaris.com/2019/04/17/2019-prostate-cancer-statistics/

,

2019.

A.

Lavi

and M. Cohen, "[PROSTATE CANCER EARLY DETECTION USING PSA - CURRENT TRENDS AND RECENT UPDATES],"

Harefuah

,

vol. 156, pp. 185-188, Mar 2017.

K. Lin, J. M. Croswell, H. Koenig, C. Lam, and A.

Maltz

, "U.S. Preventive Services Task Force Evidence Syntheses, formerly Systematic Evidence Reviews," in

Prostate-Specific Antigen-Based Screening for Prostate Cancer: An Evidence Update for the U.S. Preventive Services Task Force

, ed Rockville (MD): Agency for Healthcare Research and Quality (US), 2011.

Slide5

Classification Algorithm (Gleason Score)

The most important predictor of mortality is the Gleason Score [1, 2].

Gleason score ranges from 6 to 10.

Ref:

J. I. Epstein, W. C.

Allsbrook

, Jr., M. B. Amin, and L. L.

Egevad

, "The 2005 International Society of Urological Pathology (ISUP) Consensus Conference on Gleason Grading of Prostatic Carcinoma,"

Am J Surg

Pathol

,

vol. 29, pp. 1228-42, Sep 2005.

J. I. Epstein, L.

Egevad

, M. B. Amin, B. Delahunt, J. R. Srigley, and P. A. Humphrey, "The 2014 International Society of Urological Pathology (ISUP) Consensus Conference on Gleason Grading of Prostatic Carcinoma: Definition of Grading Patterns and Proposal for a New Grading System,"

Am J Surg

Pathol

,

vol. 40, pp. 244-52, Feb 2016.

5

Slide6

Objective

To develop and apply Machine Learning to stratification of patients into Indolent and Aggressive with high accuracy.

Hypothesis

Patients with current classification protocol (Gleason Score) were diagnosed with indolent and aggressive could lead to measurable changes distinguishing the two patient groups.

We hypothesize to use existing information (GS: 6, 8, 9, 10) and train a machine learning model to correctly classify patients with GS: 7.

6

Slide7

PROSTATE ADENOCARCINOMA

Downloaded and extracted data from GDC-TCGA [1].

Gene-probes : 60483

Samples: 547

Indolent

= (190)

We filtered gene-probes from the data which have at least ~30% (165) values greater than 0 .

Resulting gene-probes: 34956.

Ref:

N. C. Institute, "GDC Data Portal”,

https://portal.gdc.cancer.gov/

.

Aggressive

= (305)

GS: 7 (4+3), 8, 9, 10

Normal

(52)

GS: 6 and 7 (3+4)

7

Slide8

8

AssumptionsModel-1

We assume that the classification method using Gleason score could truly classify all Prostate Cancer (PCa) patients. We used Differential Expression analysis and ran classification algorithms to check the classification accuracy and misclassification rates.

Model-2

From Model-1, we observed that a few samples were misclassified and we assume that samples with Gleason score: 7 are misclassified using primary and secondary GS. We used Differential Expression analysis and ran classification algorithms on samples with Gleason score: 6, 8, 9, 10. to check the classification accuracy and misclassification rates.

Slide9

9

Library sizes of all samples expressed using a barplot constitutes the data quality and unnormalized library sizes.

Normalized data using TMM before applying machine learning [1].

Ref:

M. D. Robinson and A. Oshlack, "A scaling normalization method for differential expression analysis of RNA-seq data," Genome Biology,

vol. 11, p. R25, 2010/03/02 2010.

Samples

Library size

Slide10

Project Design and Execution

10

Slide11

Fold change is defined as ratio between two quantities.

It is used to analyze multiple measurements of a biological system taken at different times.

Change described by the ratio between the points is easier to interpret than the difference.

11

Volcano Plot

Slide12

TCGA-Data

T = 495 and N= 52

Indolent Vs Normal

I =190 and N = 52

Aggressive Vs Normal

A = 305 and N= 52

Probes associated with 2 diseases

Indolent Vs Aggressive

(Feature Selection)

Machine Learning

Model-1

Classifiers Used:

Stochastic Gradient Descent (SGD)

Sequential Minimal Optimization (SMO)

SimpleLogistic

Logistic Model Trees (LMT)

MultiClassCLassifier

12

Slide13

Samples with GS: 6, (3+4) Vs 8 – 10, (4+3)

Applied Differential Expression (D.E)Indolent Vs Normals:

18,215-probes (p-value < 0.05)Aggressive Vs Normal:21,042-probes (p-value < 0.05)

16,317 common probes are removed.

6,623-unique probesIndolent Vs Aggressive:2074- probes with p-value <0.05

13

Slide14

Normalized data

2074 – probes 495 – Samples(removed

Normals)

All Samples

Except NormalsIndolent: RedAggressive: Blue

14PCA representation

Slide15

TCGA-Data

( Removed Samples- GS:7)

Indolent Vs Normal

Aggressive Vs Normal

Unique Probes

Indolent Vs Aggressive

Machine Learning

Classify GS = 7 samples

Model-2

15

Classifiers Used:

Stochastic Gradient Descent (SGD)

Sequential Minimal Optimization (SMO)

SimpleLogistic

Logistic Model Trees (LMT)

MultiClassCLassifier

Slide16

Samples with GS: 6 Vs 8 -10

Applied Differential Expression (D.E)

Indolent Vs Normal:

15,105-probes (p-value < 0.05)

Aggressive Vs Normal:

20,712-probes (p-value < 0.05)

12,985 common probes are removed.

9,848-unique probes

Indolent Vs Aggressive:

3513- probes with p-value <0.05

16

Slide17

Normalized data:

(Samples without GS: 7)3513 – Gene probes

249 – Samples (removed Normal)17

All Samples

With GS: 6, 8, 9, 10

Indolent: Red

Aggressive: Blue

PCA representation

Slide18

Accuracy

LFC values

LFC: Log-fold change

18

Results

Slide19

Accuracy

LFC values

LFC: Log-fold change

19

Slide20

20

Normalized data:

(samples with GS: 7 (3+4 Vs 4+3))

3513 – probes

246 – Samples (removed Normal)Indolent: RedAggressive: Blue

PCA representation

Slide21

Number of Samples: 249

Accuracy

LFC values

LFC: Log-fold change

21

Slide22

Accuracy

LFC values

LFC: Log-fold change

22

Slide23

23

LMT

MultiClassClassifier

SGD

SimpleLogistic

SMOPCA of all samples

Indolent: Red

Aggressive: Blue

Slide24

24

LMT

MultiClassClassifier

SGD

SimpleLogistic

SMOIndolent: Red

Aggressive: Blue

PCA of all the samples 6 and 8 - 10

Slide25

25

Performance of various Classifiers(Samples with Gleason Score: 6, 7, 8, 9, and 10)

Model type and Description

Sensitivity

Specificity

Accuracy

Precision

F1 Score

MCC

Balanced Accuracy

Support Vector Machine (SVM)

0.91351

0.68800

0.85657

0.89655

0.90495

0.61332

0.80076

Logistic Regression (LogReg)

0.84865

0.67200

0.80404

0.88451

0.86621

0.50225

0.76032

Random Decision Forest (RDF)

0.92432

0.51200

0.82020

0.84864

0.884860.487330.71286Extra Tree Classifier (ETC)0.92703

0.48800

0.81616

0.84275

0.88288

0.84275

0.71946

Gradient Boosting Classifier (GBC)

0.91081

0.54400

0.81818

0.85533

0.88220

0.49032

0.71941

K nearest neighbor (KNN)

0.85676

0.59200

0.78990

0.86141

0.85908

0.44642

0.72438

eXtreme Gradient Boosting (XGBC)

0.90210

0.66892

0.85051

0.88761

0.88914

0.60723

0.79830

Slide26

26

Genetic Algorithm

It is a stochastic search algorithm, which gradually refines solutions through natural selection rather than manually designing a search strategy [1].It is used to search a space of potential solutions to find one which solves the problem.

Features and Fitness values

Gene-probes: 1020 (Samples with GS: 6, 7, 8, 9, 10)Fitness value: 1.6140201

Gene-probes: 1681 (Samples with GS: 6, 8, 9, 10)Fitness Value: 1.74722

Ref:

Z. Piserchia, Koenig, Daniel, "Applications of Genetic Algorithm in Bioinformatics," eScholarship

, 2018.

Slide27

27

Stacking

Stacking is a model, which obtains information from multiple different models and aggregates them to obtain a new model [1, 6].The generalized error rate will be minimized and yields to more accurate results.

Two stages of learners in stacking:Base Classifiers: more than one classifier is used.

Meta Classifiers: only one classifier is used.The generalized error rate is reduced by combining the prediction probabilities from the base classifiers using a meta-classifier.

Ref:S. Iqbal and M. Hoque, "PBRpredict-Suite: A Suite of Models to Predict Peptide Recognition Domain Residues from Protein Sequence,"

Bioinformatics (Oxford, England), vol. 34, 05/03 2018.

A. Mishra, P. Pokhrel

, and M. Hoque, "StackDPPred

: A Stacking based Prediction of DNA-binding Protein from Sequence," Bioinformatics,

vol. 35, 07/19 2018.Q. Hu, C.

Merchante

, A. N. Stepanova, J. M. Alonso, and S. Heber, "A Stacking-Based Approach to Identify Translated Upstream Open Reading Frames in Arabidopsis Thaliana," in

Bioinformatics Research and Applications

, Cham, 2015, pp. 138-149.

S.

Gattani

, A. Mishra, and M. T. Hoque, "

StackCBPred

: A stacking based prediction of protein-carbohydrate binding sites from sequence,"

Carbohydr Res, vol. 486, p. 107857, Dec 1 2019.

D. S. M. A. Aditi S. Kuchi, Dr. Minhaz F. Zibran, Dr. Mahdi Abdelguerfi, Dr. Md Tamjidul Hoque, "Detection of Sand Boils from Images using Machine Learning Approaches," ScholarWorks@UNO.

M. Flot, A. Mishra, A. S.

Kuchi, and M. T. Hoque, "StackSSSPred: A Stacking-Based Prediction of Supersecondary Structure from Sequence," Methods in molecular biology (Clifton, N.J.),

vol. 1958, pp. 101-122, 2019 2019.

Slide28

28

Performance of Stacking Methods

(Samples with Gleason Score: 6, 7, 8, 9, and 10)

Model type and Description

Sensitivity

Specificity

Accuracy

Precision

F1 Score

MCC

Balanced Accuracy

LogReg, KNN, SVM as Base,

SVM as Meta-classifier

0.99198

0.85124

0.95758

0.95373

0.97248

0.88337

0.92161

LogReg, SVM, KNN, XGBC as Base,

XGBC as Meta-classifier

0.95989

0.834710.929290.94723

0.95352

0.80618

0.87295

LogReg, KNN, SVM as Base,

XGBC as Meta-classifier

0.972

0.8368

0.931310.9490.9610.83690.90690

Slide29

29

Performance of Stacking Methods

(Samples with Gleason Score: 6, 8, 9, and 10)

Model type and Description

Sensitivity

Specificity

Accuracy

Precision

F1 Score

MCC

Balanced Accuracy

LogReg, KNN, SVM as Base,

SVM as Meta-classifier

0.98182

0.89655

0.97189

0.98630

0.98405

0.86558

0.93918

LogReg, SVM, KNN, XGBC as Base,

XGBC as Meta-classifier

0.95127

0.79110.938120.96976

0.95991

0.71929

0.90869

LogReg, KNN, SVM as Base, XGBC as Meta-classifier

0.95909

0.79310

0.94779

0.972350.965680.721000.91513

Slide30

30

PCA of All Samples

Indolent: Red

Aggressive: Blue

Before Classification

After Classification

Slide31

31

PCA of All Samples except GS: 7

Indolent: Red

Aggressive: Blue

Before ClassificationAfter Classification

Slide32

32

Conclusions and Future Work

Results have shown show that the samples with GS: 7 (3+4 and 4+3) are misclassified compared to samples with GS: 6, 8, 9, 10.Current classification protocol could misclassify PCa patients and can accurately classify most of the PCa patients to indolent or aggressive using ML algorithms.

Implemented Stacking-based machine learning technique to increase prediction accuracy.

Highest accuracy obtained from individual classifiers is around ~86% and using stacking, we improved the accuracy to ~96%.In order to improve more accuracy, mutation based or methylation-based analysis can be implemented to yield better results.

Slide33

33

References:

M. T. H. Information, "Prostate Cancer," https://medlineplus.gov/prostatecancer.html

, U.S. National Library of Medicine

.S. Rodney, T. T. Shah, H. R. Patel, and M. Arya, "Key papers in prostate cancer,"

Expert Rev Anticancer Ther, vol. 14, pp. 1379-84, Nov 2014.

A. C. Society, "Cancer Facts & Figures 2019,“, Myriad

Prolaris prostate cancer, https://prolaris.com/2019/04/17/2019-prostate-cancer-statistics/,

2019.

A.

Lavi and M. Cohen, "[PROSTATE CANCER EARLY DETECTION USING PSA - CURRENT TRENDS AND RECENT UPDATES]," Harefuah

,

vol. 156, pp. 185-188, Mar 2017.

K. Lin, J. M. Croswell, H. Koenig, C. Lam, and A.

Maltz

, "U.S. Preventive Services Task Force Evidence Syntheses, formerly Systematic Evidence Reviews," in

Prostate-Specific Antigen-Based Screening for Prostate Cancer: An Evidence Update for the U.S. Preventive Services Task Force

, ed Rockville (MD): Agency for Healthcare Research and Quality (US), 2011.

J. I. Epstein, W. C.

Allsbrook

, Jr., M. B. Amin, and L. L. Egevad

, "The 2005 International Society of Urological Pathology (ISUP) Consensus Conference on Gleason Grading of Prostatic Carcinoma," Am J Surg Pathol, vol. 29, pp. 1228-42, Sep 2005.J. I. Epstein, L. Egevad

, M. B. Amin, B. Delahunt, J. R. Srigley, and P. A. Humphrey, "The 2014 International Society of Urological Pathology (ISUP) Consensus Conference on Gleason Grading of Prostatic Carcinoma: Definition of Grading Patterns and Proposal for a New Grading System," Am J Surg Pathol

, vol. 40, pp. 244-52, Feb 2016.

Slide34

34

N. C. Institute, "GDC Data Portal”,

https://portal.gdc.cancer.gov/.

M. D. Robinson and A.

Oshlack, "A scaling normalization method for differential expression analysis of RNA-seq data," Genome Biology, vol. 11, p. R25, 2010/03/02 2010.

S. Iqbal and M. Hoque, "PBRpredict-Suite: A Suite of Models to Predict Peptide Recognition Domain Residues from Protein Sequence,"

Bioinformatics (Oxford, England), vol. 34, 05/03 2018.

A. Mishra, P.

Pokhrel, and M. Hoque, "

StackDPPred: A Stacking based Prediction of DNA-binding Protein from Sequence,"

Bioinformatics, vol. 35, 07/19 2018.

Q. Hu, C.

Merchante

, A. N. Stepanova, J. M. Alonso, and S. Heber, "A Stacking-Based Approach to Identify Translated Upstream Open Reading Frames in Arabidopsis Thaliana," in

Bioinformatics Research and Applications

, Cham, 2015, pp. 138-149.

S.

Gattani

, A. Mishra, and M. T. Hoque, "

StackCBPred

: A stacking based prediction of protein-carbohydrate binding sites from sequence," Carbohydr Res,

vol. 486, p. 107857, Dec 1 2019.

Slide35

35

D. S. M. A. Aditi S.

Kuchi, Dr.

Minhaz F.

Zibran, Dr. Mahdi Abdelguerfi, Dr. Md

Tamjidul Hoque, "Detection of Sand Boils from Images using Machine Learning Approaches," ScholarWorks@UNO.

M.

Flot, A. Mishra, A. S. Kuchi

, and M. T. Hoque, "StackSSSPred

: A Stacking-Based Prediction of Supersecondary

Structure from Sequence," Methods in molecular biology (Clifton, N.J.), vol. 1958, pp. 101-122, 2019 2019.

Z.

Piserchia

, Koenig, Daniel, "Applications of Genetic Algorithm in Bioinformatics,"

eScholarship

,

2018.

Slide36

Thank You

Any Questions?36