OvertheyearsanumberofdierentmethodshavebeendevelopedforpredictingthetopologyofTMHproteinsIngeneralthesemethodsneedtopredictthefollowingitemsithetypeofeachresidueeghelixloopetciitheT ID: 101486
Download Pdf The PPT/PDF document "Fig.1.Transmembranehelix." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Fig.1.Transmembranehelix. Overtheyears,anumberofdierentmethodshavebeendevelopedforpre-dictingthetopologyofTMHproteins.Ingeneral,thesemethodsneedtopredictthefollowingitems:(i)thetypeofeachresidue(e.g.,helix,loop,etc.),(ii)theTMHsegments,and(iii)theirorientation.Thevariousmethodsdevelopeddieronthenumberofdistinctstepsthattheyusetopredicttheaboveitems.Somemethodspredicteachitemindividually,othersutilizepredictorsthatcombinesomeofthesesteps,andotherspredictallthreeitemsinasinglestep.Theresiduetypesarepredictedbyeitherrelyingonthefactthatmembranesegmentscon-tainprimarilyhydrophobicresidues(e.g.,TopPred[29])orbyutilizingmachine-learningapproaches(e.g.,neuralnetworks,supportvectormachines)usingasfeaturestheaminoacidsequenceoftheproteinorevolutionaryinformationintheformofsequenceproles(e.g.,PHDhtm[25],MEMSAT3[10],SVMTop[20]).Thesegmentsareidentiedusingsimplehydrophobicityplots[16]toascertainprobablehelicalsegmentsandthenemployvariousrulesbasedontheexpectedlengthsoftheTMHsegmentstoeitheraccept,reject,orbreaklongsegments[29,20,32].ThesegmentorientationisoftendeterminedbyrelyingonthefactthattheregionsbetweenTMHsegmentsthatarepositivelychargedtendtoresideintheintracellularregionsofthemembrane(positive-insiderule[29]).Theapproachesthatcombinesegmentidenticationwithorientationdetermi-nation(e.g.,MEMSAT3)employdynamicprogrammingmethodstodeterminethedierentsegmentsofaTMHproteinanditsorientationrelativetothecy-toplasm.Finally,theapproachesthatpredictalloftheaboveitemsinasinglesteputilizehiddenMarkovmodels(HMM)thatcapturethedierentstructuralcomponentsofaTMHprotein(e.g.,TMHsegment,insideloop,outsideloop,signalpeptide,etc.)asseparatemodules.Thesemodelsaretrainedoneithertheaminoacidsequenceoftheproteins(e.g.,TMHMM[26]andHMMTOP[27])oronsequenceproles(e.g.,Phobius[17])andpredictthetopologyofaTMHproteinbydeterminingitsmostprobablepaththroughthatmodelusingViterbidecoding[22].ThispaperfocusesonimprovingtheaccuracyofHMM-basedapproachesbycombiningthemwithanSVM-basedapproachthatpredictsthetypesofeach residue.Specically,wedevelopedaTMHtopologypredictionalgorithm,calledTOPTMH,thatsolvestheresidue-typeprediction,segmentidentication,andorientationdeterminationinthreedistinctsteps.ThetypeofeachresidueisannotatedviaanSVM-basedapproachutilizingawindow-basedencodingoftheresidues'proleinformationandasecondorderexponentialkernelfunction[24,23,12].ThesegmentsareidentiedbyusingapairofHMMsthatmodelthedierentstructuralcomponentsofTMHproteins.TherstHMMusesasinputtheSVMpredictionsforeachresidue,whereasthesecondHMMusesasinputhydropathyinformationasmeasuredbyarecentlyintroducedhydrophobicityscale[8].Finally,theorientationofthepredictedsegmentsisdeterminedbyapplyingthepositive-insiderule.Theadvantagesofthisapproacharethree-fold.First,byusingadiscrimi-nativeapproachtolearnaresidue-typepredictionmodel,theaccuracyofthesepredictionsarehigherthanthoseobtained(indirectly)bytheHMMmodel.Sec-ond,byencodingtheproteinsequencesviatheSVMpredictions,whosesignalissignicantlyhigherthanthatoftherawsequenceprole,thedemandsimposedduringHMMparameterestimationaresubstantiallyreducedallowingittobet-terfocusonlearninghowtocorrectlyidentifythedierentsegments.Third,bycombiningtheoutputsoftheHMMmodelstrainedontheSVMpredictionsandonthehydrophobicityscores,itallowsTOPTMHtocorrectlyidentifytheTMHsegmentsthathaveanaminoacidcompositionthatissimilartothatofsignalpeptides.WeexperimentallyevaluatedtheperformanceofTOPTMHonthreewidelyuseddatasets.Ourevaluationwasperformedintwophases.First,weevaluatedthegainsobtainedbyTOPTMHbycomparingitagainstanapproachthatusesarule-basedschemetoidentifytheTMHsegmentsfromtheSVMpre-dictionsandanotherthatusesjustasingleHMMmodeltrainedontheSVMpredictions.OurevaluationshowedthattheHMM-basedsegmentidenticationoutperformstherule-basedapproachbyatleast50%intermsoftheQokscore,whichmeasuresper-segmentaccuracy,andthatbycombiningboththeSVM-andthehydrophobicity-basedHMMmodels,afurther3%{19%improvementscanbeobtained.Second,weevaluateditsperformancebycomparingitagainstPhobius[17]andMEMSAT3[10].OurevaluationshowedthatTOPTMHout-performsbothofthemacrossthedierentdatasets.Wealsoevaluatedtheper-formanceofTOPTMHonanindependentstaticbenchmark[14].TheresultsonthisblindevaluationshowedthatTOPTMHachievesthehighestscoresonhigh-resolutionsequences(Q2scoreof84%andQokscoreof86%)againstexistingstate-of-the-artsystemswhileachievinglowsignalpeptideerror.2BackgroundandDenitions2.1TransmembraneHelicalProteinsThestructureofatypicalTMHproteinisshowninFigure1.Itconsistsofaseriesofhelicalsegmentspassingthroughthecell'smembrane(bilipidlayer)separatedbyloopsegmentsthatareeitherontheinsideortheoutsidesideof 3.1ResidueAnnotationStepWedevelopedanSVM-basedTMHresidueannotationapproachthatusesfea-turesobtainedfromtheprotein'sPSSM.ItsoverallstructureissimilartothatusedbyexistingmethodsforSVM-basedstructuralandfunctionalannotationofproteinresiduesusingpositionspecicscoringmatrices(e.g.,secondarystruc-tureforglobularproteins[12],solventaccessiblesurfacearea[24],disorderpre-diction[24],andDNA-binding[24]).TOPTMHformulatestheresidueannotationproblemasabinaryclassi-cationproblemwhosegoalistopredictifaresiduebelongstoahelixstateornot.ForeachresidueiofaproteinsequenceX,theinputtotheSVMisa(2w+1)-lengthsubsequence(wmer)ofXcenteredatpositioni.Eachwmerisrepresentedbyavectorxioflength(2w+1)20thatisobtainedbyconcatenat-ingtherowsofthePSSMforeachpositionofthewmer.Thiswmer-basedinputisusedforbothtrainingandprediction.Theparameterwdeterminesthelengthofthelocalenvironmentaroundtheithsequencepositionusedwhilebuildingandapplyingthemodelanditsoptimalvalueisdeterminedexperimentally.TOPTMHusesSVMlight[9]tolearntheactualSVMmodelandutilizesthesecondorderexponentialfunction(soe)[12]asitskernelfunction.Thesoekernelhasbeenshowntoproducebetterresultsthanthetraditionalradialbasisfunction(rbf)kernelforvarioussequenceannotationpredictionproblems[12,24,23].Forasequence,thesepredictionsareavailableasawebservicecalledMONSTER1.InthecontextofTOPTMH,thesoekernelfunctionisgivenbyKsoe(xi;yj)=exp 1+K2(xi;yj) p K2(xi;yj)K2(xi;yj)!;(1)wherexiandyjarethevectorrepresentationsoftwowmers,K2isgivenbyK2(xi;yj)=hxi;yji+hxi;yji2;(2)andhxi;yjidenotesthedot-productofthexiandyjvectors.3.2SegmentIdenticationStepInordertodeterminethebestapproachforidentifyingtheTMHsegmentswedevelopedandstudiedthreedierentapproaches.TherstapproachutilizesasimpleschemebasedonempiricalrulesandtheothertwopredictthetopologybyemployinghiddenMarkovmodels(HMM)[22].TherstHMM-basedapproachusesasingleHMMbasedsolelyontheSVMscores,whereasthesecondusestwoHMMs|onebasedonSVMscoresandonebasedonhydrophobicityscales.Rule-BasedTherule-basedsegmentidenticationapproachpost-processestheSVM-basedresidueannotationsandidentiesthesegmentsbyapplyingsome 1http://bio.dtc.umn.edu/monster Fig.2.ThelayoutoftheHMMmodelusedinTOPTMH. TheHMMmodelswerebuiltusingtheUMDHMM[11]package(version1.02),whichwasmodiedtotakeasinputannotatedproteinsequences.ThethreadingofasequencethroughtheHMMmodelwasdoneusingtheViterbi[22]algorithm.HMMBasedonSVMScores(HMM-SVM).ThisapproachbuildsanHMMmodelthatonlytakesintoaccounttheper-residueSVMscoresproducedbytheannotationstep.Toconstructthetrainingset,theSVMscoreforeachresidueiscomputed.Since,HMMsareprimarilydesignedtooperateonnitesizeal-phabets,therawSVMscoresarediscretizedintoanitenumberofbinswitheachbincorrespondingtoadistinctsymbol.ThenaltrainingsetfortheHMMcorrespondstoasetofproteinswithknownTMHtopologyrepresentedasse-quencesofSVM-scorebasedbins.AsimilarSVM-basedpredictionfollowedbydiscretizationisperformedwhenthismodelisusedtopredictthetopologyofatestprotein.WediscretizedtheSVMscoresintoequal-sizeintervals,andassignedallresidueswithscores3and3intotherstandlastbin,re-spectively.HMMBasedonSVMScoresandHydrophobicityScores(HMM-SVM+HP).ThismodelbuildsapairofHMMmodels|onebasedonSVMscores(HMM-SVM)andonebasedonthehydrophobicityvalues(HMM-HP)ofknownTMHse-quencesandcombinesthetopologypredictionsfrombothHMMmodels.Thisapproachwasmotivatedbythefactthatincertaincases,theSVM-basedresidueannotationmayfailtoidentifycertainhydrophobicTMHsegments.Thisisfur-therdiscussedinSection5. Table1.DiscretizationofHydrophobicityvalues. LabelsAminoAcidsHPValues 1R,E,K,D2:5h2N,H,P,Q1:0h2:53T,Y,G,S0:1h0:94F,V,C,A,M,W0:4h0:15I,Lh0:5 HPValuesdenotesarangeofhydrophobicityvaluesdecidedbasedon[8] TheHMM-SVMmodelisidenticaltothatdescribedintheprevioussection.TheHMM-HPmodelisbuiltbyrstencodingtheaminoacidsofeachTMHproteinasasequenceofdiscretizedhydrophobicityvalues.Table1showstheschemeusedtodiscretizethehydrophobicityvaluesforeachaminoacid.BoththeHMM-SVMandHMM-HPmodelsareusedindependentlytopredicttheTMHsegments.ThenalsetofpredictionsconsistsofthesegmentspredictedbyHMM-SVMandthosesegmentspredictedbyHMM-HPthatdonotoverlapwithanyofthesegmentsofHMM-SVM.Twosegmentsareconsideredtooverlapiftheyhavemorethanveresiduesincommon.SincethisapproachcombinesboththeSVM-andHP-basedHMMmodels,wewillrefertoitasHMM-SVM+HP.3.3OrientationDeterminationStepOncetheTMHsegmentshavebeenidentied,theirorientationrelativetotheN-terminusisdeterminedbyapplyingthepositive-insiderule[29]usingthetechniqueintroducedinTHUMBUP[32].Inthisapproach,eachproteinisrstcodedintoabinarysequencebyassigningaonetotherstproteinresidueandallthearganineandlysineresiduesandazerototheremainingresidues.Then,ascoreiscomputedforeachloopbyaddingthevaluesofits15neighboringresiduesoneachside.Ifthetotalscoreforodd-numberedloopsisgreaterthanorequaltothatofevenloops,theN-terminusisinsidethemembrane,otherwiseitisoutside.4ExperimentalDesign4.1DatasetsWeevaluatedthepredictionperformanceoftheTOPTMHmethodondatasetsusedbythePhobiusandMEMSAT3methodsandbyparticipatingonthestaticbenchmark[13].ThedatasetsobtainedfromthePhobiusstudyincludedasetof247transmembraneproteinsandasetof45transmembraneproteinsthatcontainedsignalpeptideresidueswithtransmembranehelixsegments.WewilldenotetherstdatasetasTM-OnlyandthesecondasTM-SP.Thedataset Q%obs2T,Q%prd2T,andQ2.Q%obs2TisthepercentageofobservedTMHresiduesthatarepredictedcorrectly(helixrecall),Q%prd2TisthepercentageofpredictedTMHresiduesthatarepredictedcorrectly(helixprecision),andQ2isthepercentageofcorrectlypredictedresidues(bothhelixandnon-helix).Theper-segmentevaluationmeasurestheabilityofamethodtocorrectlyidentifytheactualTMHsegments.Weusedthreeper-segmentmetricsdenotedbyQ%obshtm,Q%prdhtm,andQok.Q%obshtmisthepercentageofobservedTMHsegmentsthatarepredictedcorrectly(TMHsegmentrecall),Q%prdhtmisthepercentageofpredictedTMHsegmentsthatarepredictedcorrectly(TMHsegmentpreci-sion),andQokisthepercentageofproteinsforwhichalltheTMHsegmentsarepredictedcorrectly.NotethatQokisaverystrictmetricaseachproteincon-tributeseitherazerooranone.Intheabovemetrics,apredictedTMHsegmentisconsideredtobecorrectlyidentiedifthereisanoverlapoftenresiduesbe-tweenthepredictedandobservedhelixsegments2Inaddition,apredictedhelixsegmentiscountedonlyonce.Thisisillustratedbyconsideringthefollowingexamples: Obs1:TTTTTTTTTTTTTTTT------TTTTTTTTTTTTT Pred1:-----TTTTTTTTTTTTTTTTTTTTTTTTTTT--- Obs2:---TTTTTTTTTTTTTTTTTTTTTTTTTTTTTT-- Pred2:TTTTTTTTTTTTTT------TTTTTTTTTTTTTTTInthisexample,Obs1andPred1aretheobservedandpredictedTMHsegmentsforaparticularproteinsequence.Duringevaluation,thesecondsegmentoftheObs1sequencewillnotbeconsideredascorrectlypredicted,sincetheonlyseg-mentpredictedinPred1isalreadyaccountedforintherstsegmentoftheObs1sequence.Ontheotherhand,thesecondsegmentofthePred2sequencewillbeconsideredasincorrectlypredictedastherstsegmentwillbeconsideredfortheonlysegmentinObs2sequence.Although,theper-residuemeasurescapturetheaccuracyofamethodtopredicttheannotationlabelforaresidue,itisnotabletoassesstheabilityofthemethodtoidentifytheTMHsegmentsseparatedbyloopregionsofdierentlengths.Hence,TMHpredictionalgorithmsaremostlyevaluatedusingper-segmentmetrics.5Results5.1ResidueAnnotationPerformanceTheperformanceachievedbytheSVM-basedresidueannotationfordierentvaluesofwisshowninTable2.Thistableshowstheper-residueperformancemetrics(Q2,Q%obs2TandQ%prd2T)forasubsetoftheTM-Onlydataset.Weob-servethatintermsofthevariousmetrics,theperformanceachievedfordierent 2Earliertechniquesusedanoverlapofonlythree[3]orve[17]residues,whichistooshortandcanarticiallyin atetheperformanceofascheme. Table3.TMHSegmentIdenticationPerformance. TM-SPTM-Only Per-ResiduesScores MethodsQ2Q%obs2TQ%prd2TQ2Q%obs2TQ%prd2T Raw-SVM96.7371.1086.6090.6484.3083.10Rule95.1659.5695.8989.1979.6587.36HMM-SVM-D596.2876.3984.8789.4085.5482.25HMM-SVM-D796.4576.8587.7289.3485.6182.23HMM-SVM-D1296.2477.5684.4589.3186.1381.35HMM-SVM-D7+HP97.0884.8088.5089.4686.2182.04 Per-SegmentScores MethodsQokQ%obshtmQ%prdhtmQokQ%obshtmQ%prdhtm RawSVM35.5585.2370.0938.8694.3474.33Rule64.4475.00100.0070.8592.8894.96HMM-SVM-D564.4484.0987.0571.6695.3993.73HMM-SVM-D771.1185.2392.5972.0695.6393.52HMM-SVM-D1260.0085.2285.2270.0495.8092.87HMM-SVM-D7+HP84.4493.1893.1873.6896.1293.33 topredictcorrectlylargecontiguousportionsofeachhelicalsegment.Ontheotherhand,theper-segmentperformanceachievedbytheothersegmentiden-ticationapproachesareconsiderablyhigher.Boththerule-andHMM-basedapproachesareabletosignicantlyimproveoverRaw-SVMforboththeTM-SPandTM-Onlydatasets.Amongthem,theapproachesbasedonHMM-SVMout-performtherule-basedapproachby2%{12%,eventhoughthelatterachievedthehighestQ%prdhtmscores(100%and96.44%forTM-SPandTM-Only,respec-tively).TheoverallbestQokresultswereobtainedbytheHMM-SVM-D7+HPap-proach.Inparticular,theQokvaluesachievedbyHMM-SVM-D7+HPare19%and3%betterthanthenextbestperformingscheme(HMM-SVM-D7)ontheTM-SPandTM-Onlydatasets,respectively.ThelargeperformanceadvantageofHMM-SVM-D7+HPoverHMM-SVM-D7ontheTM-SPdatasetareprimar-ilyduetoincreasesinrecall(Q%obshtm).HMM-SVM-D7+HPachievesaQ%obshtmof93.18%comparedtothe85.23%achievedbyHMM-SVM-D7.Apossibleex-planationfortherelativelypoorperformanceofHMM-SVM-D7isthatduetothesignalpeptidesegmentspresentinsomeofthesequencesintheTM-SPdataset,theSVMmodelfailstoidentifysomeoftheTMHresidues.However,theseresiduescanbecorrectlyidentiedwhenhydrophobicityscoresareconsid-ered,andassuchthecombinedHMM-SVM-D7+HPapproachleadstobetteroverallresults. withbothcorrecttopologyandlocationthanTOPTMH(147vs134).Webe-lievethatthisisprimarilyduetothefactthatduetothebinaryclassicationoftheproteinsequencesinhelixandnon-helixresidues,TOPTMHwasnotabletoeectivelydierentiatebetweeninsideandoutsideloopsandthuscouldnotperformsimilartoMEMSAT3.TOPTMHPerformanceontheStaticBenchmark.TheperformanceofTOPTMHonthestaticbenchmarkisshownonTable6.TheTOPTMHresultsshowninthesetablescorrespondtotheresultsobtainedusingtheHMM-SVM-D7+HPtopologypredictionapproach.FromtheseresultsweseethatTOPTMHachievedthehighestQokscoreof86%forthehigh-resolutionsequencesandthehighestQ2scoresof84%and90%forthehigh-andlow-resolutionsequences,respectively.Moreover,TOPTMHhasperformedabout7%betterinTMHpredictionthanbothMEMSAT3andPhobius.NotethateventhoughHMM-TOP2achievedQ%obshtmandQ%prdhtmscoresthatwerehigherthanthecorrespondingscoresachievedbyTOPTMH,itsQokscoreofislowerthanthatachievedbyTOPTMH.ThisisduetothefactthateventhoughHMMTOP2identiedmoreTMHsegmentsintotalthanTOPTMH,itwasnotassuccessfulinpredictingproteinsforwhichalloftheTMHsegmentswereidentiedcorrectly. Table6.TMHBenchmarkResults. HighResolutionAccuracyLowResolutionAccuracy Per-segmentPer-residuePer-segmentPer-residue MethodQokQ%obshtmQ%prdhtmQ2Q%obs2TQ%prd2TQokQ%obshtmQ%prdhtmQ2Q%obs2TQ%prd2T TOPTMH869596847590669288908480PHDpsihtm08849998807683679594898777HMMTOP2839999806989669493908583MEMSAT3809897837888639287888676Phobius809293806984659088908179DAS799996724894399381866585TopPred2759090776483488479887471TMHMM1719090806881729192908380SOSUI718886756674498886887972PHDhtm07698381787682568586878375 ResultsforTOPTMHandMEMSAT3wereobtainedbycollectingpredictionsfortestsetoftheTMHstaticbenchmark[13]andsubmittingtheresultstothebenchmarkserver.Phobius[17]pre-dictionwerecollectedloadingthebenchmarktestsequencestothePhobiuswebserver[13]andsubmittingtheoutputtothebenchmarkserver.AlltheotherresultswereprovidedbytheTMHstaticbenchmarkevaluationweb-site. 15. T.KlabundeandG.Hessler.Drugdesignstrategiesfortargetingg-protein-coupledreceptors.ChemBioChem,3:928{944,2002. 16. J.KyteandR.F.Doolittle.Asimplemethodfordisplayingthehydropathiccharacterofaprotein.JournalofMolecularBiology,157(1):105{132,1982. 17. L.Kll,A.Krogh,andE.L.L.Sonnhammer.Acombinedtransmembranetopologyandsignalpeptidepredictionmethod.JournalofMolecularBiology,338:1027{1036,2004. 18. L.KllandE.L.L.Sonnhammer.Reliabilityoftransmembranepredictionsinwhole-genomedata.FEBSLett.,532(3):415{418,2002. 19. J.LiuandB.Rost.Comparingfunctionandstructurebetweenentireproteomes.ProteinSci.,10:1970{1979,2001. 20. AllanLo,Hua-ShengChiu,Ting-YiSung,Ping-ChiangLyu,andWen-LianHsu.Enhancedmembraneproteintopologypredictionusingahierarchicalclassicationmethodandanewscoringfunction.JProteomeRes,7(2):487{496,Feb2008. 21. AmitOberai,YungokIhm,SangukKim,andJamesUBowie.Alimiteduniverseofmembraneproteinfamiliesandfolds.ProteinSci,15(7):1723{1734,Jul2006. 22. L.R.Rabiner.Atutorialonhiddenmarkovmodelsandselectedapplicationsinspeechrecognition.InProceedingsoftheIEEE,volume77,pages257{286,1989. 23. HuzefaRangwalaandGeorgeKarypis.frmsdpred:Predictinglocalrmsdbetweenstructuralfragmentsusingsequenceinformation.Proteins,Feb2008. 24. HuzefaRangwala,ChristopherKauman,andGeorgeKarypis.Ageneralizedframeworkforproteinsequenceannotation.InProceedingsoftheNIPSWorkshoponMachineLearninginComputationalBiology,2007. 25. B.Rost,P.Fariselli,andR.Casadio.Topologypredictionforhelicaltransmem-braneproteinsat86accuracy.ProteinSci,5(8):1704{1718,Aug1996. 26. E.L.L.Sonnhammer,G.vonHeijne,andA.Krogh.Ahiddenmarkovmodelforpredictingtransmembranehelicesinproteinsequences.InProceedingsoftheSixthInternationalConferenceonIntelligentSystemsforMolecularBiology,pages175{82,1998. 27. G.E.TusndyaandI.Simon.Principlesgoverningaminoacidcompositionofinte-gralmembraneproteins:applicationtotopologyprediction.JournalofMolecularBiology,283(2):489{506,1998. 28. G.E.TusndyaandI.Simon.Thehmmtoptransmembranetopologypredictionserver.Bioinformatics,17(9):849{850,2001. 29. G.vonHeijne.Membraneproteinstructureprediction.hydrophobicityanalysisandthepositive-insiderule.JournalofMolecularBiology,225(2):487{494,1992. 30. GunnarvonHeijne.Formationoftransmembranehelicesinvivo{ishydrophobicityallthatmatters?TheJournalofgeneralphysiology,129(5):353{356,2007. 31. E.WallinandG.vonHeijne.Genome-wideanalysisofintegralmembraneproteinsfromeubacterial,archaean,andeukaryoticorganisms.ProteinSci,7(4):1029{38,1998. 32. H.ZhouandY.Zhou.Predictingthetopologyoftransmembranehelicalproteinsusingmeanburialpropensityandahidden-markov-model-basedmethod.ProteinSci,12:1547{1555,2003.