/
JournalofMachineLearningResearch5(2004)1435 JournalofMachineLearningResearch5(2004)1435

JournalofMachineLearningResearch5(2004)1435 - PDF document

olivia-moreira
olivia-moreira . @olivia-moreira
Follow
379 views
Uploaded On 2015-08-14

JournalofMachineLearningResearch5(2004)1435 - PPT Presentation

LESLIEANDKUANGsupportvectormachineclassiersSVMsandotherkernelmethodsineldsoutsidecomputationalbiologysuchastextprocessingandspeechrecognitionForexamplethegappyngramkerneldevelopedbyLodhietal ID: 107438

LESLIEANDKUANGsupportvectormachineclassiers(SVMs)andotherkernelmethodsineldsoutsidecomputationalbiology suchastextprocessingandspeechrecognition.Forexample thegappyn-gramkernelde-velopedbyLodhietal.

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch5(2004)1..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch5(2004)1435–1455Submitted1/04;Revised7/04;Published11/04FastStringKernelsusingInexactMatchingforProteinSequencesChristinaLeslieCLESLIE@CS.COLUMBIA.EDUCenterforComputationalLearningSystemsColumbiaUniversityNewYork,NY10115,USARuiKuangRKUANG@CS.COLUMBIA.EDUDepartmentofComputerScienceColumbiaUniversityNewYork,NY10027,USAEditor:KristinBennettAbstractWedescribeseveralfamiliesofk-merbasedstringkernelsrelatedtotherecentlypresentedmis-matchkernelanddesignedforusewithsupportvectormachines(SVMs)forclassicationofpro-teinsequencedata.Thesenewkernels–restrictedgappykernels,substitutionkernels,andwildcardkernels–arebasedonfeaturespacesindexedbyk-lengthsubsequences(“k-mers”)fromthestringalphabetS.However,forallkernelswedenehere,thekernelvalueK(x;y)canbecomputedinO(cK(jxjjyj))time,wheretheconstantcKdependsontheparametersofthekernelbutisinde-pendentofthesizejSjofthealphabet.Thusthecomputationofthesekernelsislinearinthelengthofthesequences,likethemismatchkernel,butweimproveupontheparameter-dependentcon-stantcKkm+1jSjmofthe(k;m)-mismatchkernel.Wecomputethekernelsefcientlyusingatriedatastructureandrelateournewkernelstotherecentlydescribedtransducerformalism.InproteinclassicationexperimentsontwobenchmarkSCOPdatasets,weshowthatournewfasterkernelsachieveSVMclassicationperformancecomparabletothemismatchkernelandtheFisherkernelderivedfromprolehiddenMarkovmodels,andweinvestigatethedependenceofthekernelsonparameterchoice.Keywords:kernelmethods,stringkernels,computationalbiology1.IntroductionTheoriginalworkonstringkernels–kernelfunctionsdenedonthesetofsequencesfromanal-phabetSratherthanonavectorspace(CristianiniandShawe-Taylor,2000)–camefromtheeldofcomputationalbiologyandwasmotivatedbyalgorithmsforaligningDNAandproteinsequences.Pairwisealignmentalgorithms,inparticulartheSmith-Watermanalgorithmforoptimallocalalign-mentandtheNeedleman-Wunschalgorithmforoptimalglobalalignment(Watermanetal.,1991),modeltheevolutionaryprocessofmutations–insertions,deletions,andresiduesubstitutionsrela-tivetoanancestralsequence–andgivenaturalsequencesimilarityscoresrelatedtoevolutionarydistance.However,standardpairwisealignmentscoresdonotrepresentvalidkernels(Vertetal.,2004),andtherststringkernelstobedened–dynamicalignmentkernelsbasedonpairhiddenMarkovmodelsbyWatkins(1999)andconvolutionkernelsintroducedbyHaussler(1999)–hadtodevelopnewtechnicalmachinerytotranslateideasfromalignmentalgorithmsintoakernelframe-work.Morerecently,therehasalsobeeninterestinthedevelopmentofstringkernelsforusewithc\r2004ChristinaLeslieandRuiKuang. LESLIEANDKUANGsupportvectormachineclassiers(SVMs)andotherkernelmethodsineldsoutsidecomputationalbiology,suchastextprocessingandspeechrecognition.Forexample,thegappyn-gramkernelde-velopedbyLodhietal.(2002)implementedadynamicalignmentkernelfortextclassication.Apracticaldisadvantageofallthesestringkernelsistheircomputationalexpense.Ingeneral,thesekernelsrelyondynamicprogrammingalgorithmsforwhichthecomputationofeachkernelvalueK(x;y)isquadraticinthelengthoftheinputsequencesxandy,thatis,O(jxjjyj)withconstantfactorthatdependsontheparametersofthekernel.Therecentlypresentedk-spectrum(gap-freek-gram)kernelandthe(k;m)mismatchkernelpro-videanalternativemodelforstringkernelsforbiologicalsequencesandweredesigned,inparticular,fortheapplicationofSVMproteinclassication.Thesekernelsusecountsofcommonoccurrencesofshortk-lengthsubsequences,calledk-mers,ratherthannotionsofpairwisesequencealignment,asthebasisforsequencecomparison.Thek-merideastillcapturesabiologically-motivatedmodelofsequencesimilarity,inthatsequencesthatdivergethroughevolutionarestilllikelytocontainshortsubsequencesthatmatchoralmostmatch.Leslieetal.(2002a)introducedalineartime(O(k(jxj+jyj))implementationofthek-spectrumkernel,usingexactmatchesofk-merpatternsonly,basedonatriedatastructure.Later,the(k;m)-mismatchkernel(Leslieetal.,2002b)extendedthek-merbasedkernelframeworkandachievedimprovedperformanceontheproteinclassicationtaskbyincorporatingthebiologicallyimportantnotionofcharactermismatches(residuesubstitu-tions).Usingamismatchtreedatastructure,thecomplexityofthekernelcalculationwasshowntobeO(cK(jxj+jyj)),withcK=km+1jSjmfork-gramswithuptommismatchesfromalphabetSAdifferentextensionofthek-merframeworkwaspresentedbyVishwanathanandSmola(2002),whocomputedtheweightedsumofexact-matchingk-spectrumkernelsfordifferentkbyusingsuf-xtreesandsufxlinks,allowingeliminationoftheconstantfactorinthespectrumkernelforacomputetimeofO(jxj+jyj)Inthispaper,weextendthek-merbasedkernelframeworkinnewwaysbypresentingseveralnovelfamiliesofstringkernelsforusewithSVMsforclassicationofproteinsequencedata.Thesekernels–restrictedgappykernels,substitutionkernels,andwildcardkernels–areagainbasedonfeaturespacesindexedbyk-lengthsubsequencesfromthestringalphabetS(orthealphabetaug-mentedbyawildcardcharacter)andusebiologically-inspiredmodelsofinexactmatching.Thusthenewkernelsarecloselyrelatedbothtothe(k;m)-mismatchkernelandthegappyk-gramstringkernelsusedintextclassication.However,forallkernelswedenehere,thekernelvalueK(x;y)canbecomputedinO(cK(jxj+jyj))time,wheretheconstantcKdependsontheparametersofthekernelbutisindependentofthesizejSjofthealphabet.Ourefcientcomputationusesarecursivefunctionbasedonatriedatastructureandislinearinthelengthofthesequences,likethemismatchkernel,butweimproveupontheparameter-dependentconstant;asimilartrie-basedsequencesearchstrategyhasbeenused,forexample,inworkofSagot(1998)formotifdiscovery.Therestrictedgappykernelswepresentherecanbeseenasafastapproximationofthegappyk-gramkernelofLodhietal.(2002),wherebyusingourk-merbasedcomputation,weavoiddynamicprogrammingandtheresultantquadraticcomputetime;wenotethatthegappyk-gramkernelcanalsobeseenasaspecialcaseofadynamicalignmentkernel(Watkins,1999),givingalinkbetweenthisworkandsomeofthekernelswedene.Cortesetal.(2002)haverecentlypresentedatransducerformalismfordeningrationalstringkernels,andallthek-merbasedkernelscanbenaturallydescribedinthisframework.Werelateournewkernelstothetransducerformalismandgivetransducerscorrespond-ingtoournewerkernels.WenotethatCortesetal.(2002)alsodescribetheoriginalconvolutionkernelsofHaussler(1999)withintheirframework,suggestingthatthetransducerformalismisa1436 FASTSTRINGKERNELSnaturalunifyingframeworkfordescribingmanystringkernelsintheliterature.However,thecom-plexityforkernelcomputationusingthestandardrationalkernelalgorithmisO(cTjxjjyj),wherecTisaconstantthatdependsonthetransducer,leadingagaintoquadraticratherthanlineardependenceonsequencelength.Finally,wereportproteinclassicationexperimentsontwobenchmarkdatasetsbasedontheSCOPdatabase(Murzinetal.,1995),whereweshowthatournewfasterkernelsachieveSVMclassicationperformancecomparabletothemismatchkernelandtheFisherkernelderivedfromprolehiddenMarkovmodels.Wealsousekernelalignmentscores(Cristianinietal.,2001)toinvestigatehowdifferentthevariousinexactmatchingmodelsarefromeachother,andtowhatextenttheydependonkernelparameters.Moreover,weshowthatbyusinglinearcombinationsofdifferentkernels,wecanimproveperformanceoverthebestindividualkernel.Thecurrentpaperisanextendedversionoftheoriginalpaperpresentingtheseinexactstringmatchingkernels(LeslieandKuang,2003).WehaveaddedthesecondsetofSCOPexperimentstoallowmoreinvestigationofthedependenceofSVMperformanceonkernelparameterchoices.Wehavealsoincludedthekernelalignmentresultstoexploredifferencesbetweenkernelsandparameterchoicesandtheadvantageofcombiningthesekernels.2.DenitionsofFeatureMapsandStringKernelsBelow,wereviewthedenitionofmismatchkernels(Leslieetal.,2002b)andpresentthreenewfamiliesofstringkernels:restrictedgappykernels,substitutionkernels,andwildcardkernels.Ineachcase,thekernelisdenedviaanexplicitfeaturemapfromthespaceofallnitese-quencesfromanalphabetStoavectorspaceindexedbythesetofk-lengthsubsequencesfromSor,inthecaseofwildcardkernels,Saugmentedbyawildcardcharacter.Forproteinsequences,SisthealphabetofjSj=20aminoacids.Werefertoak-lengthcontiguoussubsequenceoccurringinaninputsequenceasaninstancek-mer(alsocalledak-gramintheliterature).Themismatchkernelfeaturemapobtainsinexactmatchingofinstancek-mersfromtheinputsequencetok-merfeaturesbyallowingarestrictednumberofmismatches;thenewkernelsachieveinexactmatchingbyallowingarestrictednumberofgaps,byenforcingaprobabilisticthresholdoncharactersub-stitutions,orbypermittingarestrictednumberofmatchestowildcardcharacters.Thesemodelsforinexactmatchinghaveallbeenusedinthecomputationalbiologyliteratureinothercontexts,inparticularforsequencepatterndiscoveryinDNAandproteinsequences(Sagot,1998)andproba-bilisticmodelsforsequenceevolution(HenikoffandHenikoff,1992,SchwartzandDayhoff,1978,Altschuletal.,1990).2.1SpectrumandMismatchKernelsInpreviouswork,wedenedthe(k;m)-mismatchkernelviaafeaturemapFMismatch(km)tothejSjk-dimensionalvectorspaceindexedbythesetofk-mersfromS.Foraxedk-mera=a1a2:::akwitheachaiacharacterinS,the(k;m)-neighborhoodgeneratedbyaisthesetofallk-lengthsequencesbfromSthatdifferfromabyatmostmmismatches.WedenotethissetbyN(km)(a)Forak-mera,thefeaturemapisdenedasFMismatch(km)(a)=(fb(a))b2Sk1437 LESLIEANDKUANGwherefb(a)=1ifbbelongstoN(km)(a),andfb(a)=0otherwise.Forasequencexofanylength,weextendthemapadditivelybysummingthefeaturevectorsforallthek-mersinxFMismatch(km)(x)=åk-mersainxFMismatch(km)(a):Eachinstanceofak-mercontributestoallcoordinatesinitsmismatchneighborhood,andtheb-coordinateofFMismatch(km)(x)isjustacountofallinstancesofthek-merboccurringwithuptommismatchesinx.The(k;m)-mismatchkernelK(km)isthengivenbytheinnerproductoffeaturevectors:KMismatch(km)(x;y)=hFMismatch(km)(x);FMismatch(km)(y)i:Form=0,weobtainthek-spectrum(Leslieetal.,2002a)ork-gramkernel(Lodhietal.,2002).2.2RestrictedGappyKernelsForthe(g;k)-gappystringkernel,gk,weusethesamejSjk-dimensionalfeaturespace,indexedbythesetofk-mersfromS,butwedeneourfeaturemapbasedongappymatchesofg-merstok-merfeatures.Foraxedg-mera=a1a2:::ag(eachai2S),letG(gk)(a)bethesetofallthek-lengthsubsequencesoccurringina(withuptogkgaps).ThenwedenethegappyfeaturemaponaasFGap(gk)(a)=(fb(a))b2Sk;wherefb(a)=1ifbbelongstoG(gk)(a),andfb(a)=0otherwise.Inotherwords,eachinstanceg-mercontributestothesetofk-merfeaturesthatoccur(inatleastoneway)assubsequenceswithuptogkgapsintheg-mer.Nowweextendthefeaturemaptoarbitrarynitesequencesxbysummingthefeaturevectorsforalltheg-mersinxFGap(gk)(x)=åg-mersa2xFGapgk(a):ThekernelKGap(gk)(x;y)isdenedasbeforebytakingtheinnerproductoffeaturevectorsforxandyAlternatively,givenaninstanceg-mer,wemaywishtocountthenumberofoccurrencesofeachk-lengthsubsequenceandweighteachoccurrencebythenumberofgaps.Following(Lodhietal.,2002),wecandeneforg-meraandk-merfeatureb=b1b2:::bktheweightingflb(a)=1lkå1i1i2:::ikgaij=bjforj=1kliki1+1;wherethemultiplicativefactorsatises0l1.WecanthenobtainaweightedversionofthegappykernelKWeightedGap(gkl)fromthefeaturemap:FWeightedGap(gkl)(x)=åg-mersa2x(flb(a))b2Sk:Here,theweightingliki1+1lkpenalizesagappyoccurrenceofak-merbyafactorlraisedtothenumberofinternalgaps.Thisfeaturemapisrelatedtothegappyk-gramkerneldenedbyLodhi1438 FASTSTRINGKERNELSetal.(2002)butenforcesthefollowingrestriction:here,onlythosek-charactersubsequencesthatoccurwithatmostgkgaps,ratherthanallgappyoccurrences,contributetothecorrespondingk-merfeature.Whenrestrictedtoinputsequencesoflengthg,ourfeaturemapcoincideswiththatoftheusualgappyk-gramkernel.Note,however,thatforourkernel,agappyk-merinstance(occurringwithatmostgkgaps)iscountedinall(overlapping)g-mersthatcontainit,whereasinLodhietal.(2002),agappyk-merinstanceisonlycountedonce.Ifwewishtoapproximatethegappyk-gramkernel,wecandeneasmallvariationofourrestrictedgappykernelwhereoneonlycountsagappyk-merinstanceifitsrstcharacteroccursintherstpositionofag-merwindow.Thatis,themodiedfeaturemapisdenedoneachg-merabycoordinatefunctionseflb(a)=1lkå1=i1i2:::ikgaij=bjforj=1kliki1+1;0l1,andisextendedtolongersequencesbyaddingfeaturevectorsforg-mers.Thismodiedfeaturemapnowgivesa“truncation”oftheusualgappyk-gramkernel.InSection3,weshowthatourrestrictedgappykernelhasO(c(g;k)(jxj+jyj))computationtime,whereconstantc(g;k)dependsonsizeofgandk,whiletheoriginalgappyk-gramkernelhascomplexityO(k(jxjjyj)).Noteinparticularthatwedonotcomputethestandardgappyk-gramkerneloneverypairofg-gramsfromxandy,whichwouldnecessarilybequadraticinsequencelengthsincethereareO(jxjjyj)suchpairs.Wewillseethatforreasonablechoicesofgandk,weobtainmuchfastercomputationtime,whileinexperimentalresultsreportedinSection5,westillobtaingoodclassicationperformance.2.3SubstitutionKernelsThesubstitutionkernelisagainsimilartothemismatchkernel,exceptthatwereplacethecombi-natorialdenitionofamismatchneighborhoodwithasimilarityneighborhoodbasedonaproba-bilisticmodelofcharactersubstitutions.Incomputationalbiology,itisstandardtocomputepair-wisealignmentscoresforproteinsequencesusingasubstitutionmatrix(HenikoffandHenikoff,1992,SchwartzandDayhoff,1978,Altschuletal.,1990)thatgivespairwisescoress(a;b)de-rivedfromestimatedevolutionarysubstitutionprobabilities.Inonescoringsystem(SchwartzandDayhoff,1978),thescoress(a;b)arebasedonestimatesofconditionalsubstitutionprobabilitiesP(ajb)=p(a;b)=q(b),wherep(a;b)istheprobabilitythataandbco-occurinanalignmentofcloselyrelatedproteins,q(a)isthebackgroundfrequencyofaminoacida,andP(ajb)representstheprobabilityofamutationintoaduringxedevolutionarytimeintervalgiventhattheancestoraminoacidwasb.WedenethemutationneighborhoodM(k)(a)ofak-mera=a1a2:::akasfollows:M(k)(a)=fb=b1b2:::bk2SkkåilogP(aijbi)sg:Mathematically,wecandenes=s(N)suchthatmaxa2SkjM(k)(a)jN,sowehavetheoreticalcontroloverthemaximumsizeofthemutationneighborhoods.Inpractice,choosingstoallowanappropriateamountofmutationwhilerestrictingneighborhoodsizemayrequireexperimentationandcross-validation.NowwedenethesubstitutionfeaturemapanalogouslytothemismatchfeaturemapFSub(k)(x)=åk-mersainx(fb(a))b2Sk;1439 LESLIEANDKUANGwherefb(a)=1ifbbelongstothemutationneighborhoodM(k)(a),andfb(a)=0otherwise.2.4WildcardKernelsFinally,wecanaugmentthealphabetSwithawildcardcharacterdenotedby,andwemaptoafeaturespaceindexedbythesetWofk-lengthsubsequencesfromS[fghavingatmostmoccurrencesofthecharacter.Thefeaturespacehasdimensionåmi=0kjSjkiAk-meramatchesasubsequencebinWifallnon-wildcardentriesofbareequaltothecorrespondingentriesofa(wildcardsmatchallcharacters).ThewildcardfeaturemapisgivenbyFWildcard(kml)(x)=åk-mersainx(fb(a))b2W;wherefb(a)=ljifamatchespatternbcontainingwildcardcharacters,fb(a)=0ifadoesnotmatchb,and0l1.Othervariationsofthewildcardidea,includingspecializedweightingsanduseofgroupingsofrelatedcharacters,aredescribedbyEskinetal.(2003).3.EfcientComputationAllthek-merbasedkernelswedeneabovecanbeefcientlycomputedusingatrie(retrievaltree)datastructure,similartothemismatchtreeapproachpreviouslypresented(Leslieetal.,2002b).Inthisframework,k-merfeaturescorrespondtopathsfromtheroottotheleafnodesofthetree,andthedatastructureisusedtoorganizeatraversalofalltheinexactmatchinginstancepatternsinthedatathatcontributetoeachk-merfeaturecount.Wewilldescribethecomputationofthegappykernelinmostdetail,sincetheotherkernelsareeasieradaptationsofthemismatchkernelcomputation.Forsimplicity,weexplainhowtocomputeasinglekernelvalueK(x;y)forapairofinputsequences;computationofthefullkernelmatrixinonetraversalofthedatastructureisastraightforwardextension.3.1(g;k)-GappyKernelComputationForthe(g;k)-gappykernel,werepresentourfeaturespaceasarootedtreeofdepthkwhereeachinternalnodehasjSjbranchesandeachbranchislabeledwithasymbolfromS.Inthisdepthktrie,eachleafnoderepresentsaxedk-merinfeaturespacebyconcatenatingthebranchsymbolsalongthepathfromroottoleafandeachinternalnoderepresentstheprexforthoseforthesetofk-merfeaturesinthesubtreebelowit.Usingadepth-rsttraversalofthistree,wemaintainateachnodethatwevisitasetofpointerstoallg-merinstancesintheinputsequencesthatcontainasubsequence(withgaps)thatmatchesthecurrentprexpattern;wealsostore,foreachg-merinstance,anindexpointingtothelastpositionwehaveseensofarintheg-mer.Attheroot,westorepointerstoallg-merinstances,andforeachinstance,thestoredindexis0,indicatingthatwehavenotyetseenanycharactersintheg-mer.Aswepassfromaparentnodetoachildnodealongabranchlabeledwithsymbola,weprocesseachofparent'sinstancesbyscanningaheadtondthenextoccurrenceofsymbolaineachg-mer.Ifsuchacharacterexists,wepasstheg-mertothechildnodealongwithitsupdatedindex;otherwise,wedroptheinstanceanddonotpassittothechild.Thusateachnodeofdepthd,we1440 FASTSTRINGKERNELShaveeffectivelyperformedagreedygappedalignmentofg-mersfromtheinputsequencestothecurrentd-lengthprex,allowinginsertionofuptogkgapsintotheprexsequencetoobtaineachalignment.Whenweencounteranodewithanemptylistofpointers(novalidoccurrencesofthecurrentprex),wedonotneedtosearchbelowitinthetree;infact,unlessthereisavalidg-merinstancefromeachofxandy,wedonothavetoprocessthesubtree.Whenwereachaleafnode,wesumthecontributionsofallinstancesoccurringineachsourcesequencetoobtainfeaturevaluesforxandycorrespondingtothecurrentk-mer,andweupdatethekernelbyaddingtheproductofthesefeaturevalues.Sinceweareperformingadepth-rsttraversal,wecanaccomplishthealgorithmwitharecursivefunctionanddonothavetostorethefulltrieinmemory.Figure1showsexpansiondownapathduringtherecursivetraversal.ab121ab243ab00abFigure1:Trietraversalforgappykernel.Expansionalongapathfromroottoleafduringtraveralofthetrieforthe(5;3)-gappykernel,showingonlytheinstance5-mersforasinglesequencex=abaabab.Eachnodestoresitsvalid5-merinstancesandtheindextothelastmatchforeachinstance.Instancesattheleafnodecontributetothekernelfor3-merfeatureabbThecomputationattheleafnodedependsonwhichversionofthegappykerneloneuses.Fortheunweightedfeaturemap,weobtainthefeaturevaluesofxandycorrespondingtothecurrentk-merbycountingtheg-merinstancesattheleafcomingfromxandfromy,respectively;theproductofthesecountsgivesthecontributiontothekernelforthisk-merfeature.Forthel-weightedgappyfeaturemap,weneedacountofallalignmentsofeachvalidg-merinstanceagainstthek-merfeatureallowinguptogkgaps.Thiscanbecomputedwithasimpledynamicprogrammingroutine(similartotheNeedleman-Wunschalgorithm),wherewesumoverarestrictedsetofpaths,asshowninFigure2.ThecomplexityisO(k(gk)),sincewellarestrictedtrellisof(k+1)(gk+1)squares.Notethatwhenwealignasubsequencebi1bi2:::bikagainstak-mera1a2:::ak,weonlypenalizeinteriorgapscorrespondingtonon-consecutiveindicesin112:::kgTherefore,themultiplicativegapcostis1inthezerothandlastrowsofthetrellisandlintheotherrows.Inordertodeterminetheworstcasecomplexityofthekernelcomputation,weestimatethetraversaltime–whichcanbeboundedbythetotalnumberofg-merinstancesthatareprocessed1441 LESLIEANDKUANG(A)(B)Figure2:Dynamicprogrammingattheleafnode.Thetrellisin(A)showstherestrictedpathsforaligningag-meragainstak-mer,withinsertionofuptogkgapsinthek-mer,forg=5andk=3.ThebasicrecursionforsummingpathweightsisS(;)=m(ai;bj)S(1;1)+g()S(;1),wherem(a;b)=1ifaandbmatch,0iftheyaredifferent,andthegappenaltyg()=1for=0;kandg()=lforotherrows.Thatis,exceptforthetopandbottomrows,everytimewemovetotheright,weintroduceanadditionainternalgapandincuramultiplicativepenaltyofl;whenwemovediagonally,weseewhetherthecorrespondingcharactersmatchornot.Trellis(B)showstheexampleofaligningababbagainst3-merabb.AnexplanationofhowtointerprettrellisdiagramsfordynamicprogrammingcanbefoundinDurbinetal.(1998).attheleafnodesmultipliedbythemaximumnumberoftimesaninstanceisoperatedonasitispassedfromroottoleaf–plustheprocessingtimeattheleafnodesneedtocomputethekernelupdate.Eachg-merinstanceintheinputdatacancontributetogkk-merfeatures,whichwecanwriteasO(ggk)ifgkissmallerthankandO(gk)otherwise.Forsimplicity,weassumethemoretypicalformercase,andweletthereadermakesimpleadjustmentsinthelattercase.Therefore,atmostO(ggk(jxj+jyj)g-merinstancesareprocessedatleafnodesinthetraversal.Sinceweiteratethroughatmostgpositionsofeachg-merinstanceaswepassfromroottoleaf,thetraversaltimeisO(ggk+1(jxj+jyj)).ThetotalprocessingtimeatleafnodesisO(ggk(jxj+jyj))fortheunweightedgappykernelandO(k(gk)ggk(jxj+jyj))fortheweightedgappykernel.Therefore,inbothcases,wehavetotalcomplexityoftheformO(c(g;k)(jxj+jyj)),withc(g;k)=O((gk)ggk+1)forthemoreexpensivekernel.FurtherdiscussionofthecomplexityargumentandpseudocodeforthealgorithmcanbefoundinShawe-TaylorandCristianini(2004).Notethatwiththedenitionofthegappyfeaturemapsgivenabove,agappyk-charactersub-sequenceoccuringwithcgkgapsiscountedineachoftheg(k+c)+1g-lengthwindowsthatcontainit.Toobtainfeaturemapsthatcountagappyk-charactersubsequenceonlyonce,wecanmakeminorvariationstothealgorithmbyrequiringthattherstcharacterofagappyk-meroccursintherstpositionoftheg-lengthwindowinordertocontributetothecorrespondingk-merfeature.1442 FASTSTRINGKERNELS3.2(k;s)-SubstitutionKernelComputationForthesubstitutionkernel,computationisverysimilartothemismatchkernelalgorithm.Weuseadepthktrietorepresentthefeaturespace.Westore,ateachdepthdnodethatwevisit,asetofpointerstoallk-merinstancesaintheinputdatawhosed-lengthprexeshavecurrentmutationscoreådi=1logP(aijbi)softhecurrentprexpatternb1b2:::bd,andwestorethecurrentmutationscoreforeachk-merinstance.Aswepassfromaparentnodeatdepthdtoachildnodeatdepthd+1alongabranchlabeledwithsymbolb,weprocesseachk-merabyaddinglogP(ad+1jb)tothemutationscoreandpassittothechildifandonlyifthescoreisstilllessthans.Asbefore,weupdatethekernelattheleafnodebycomputingthecontributionofthecorrespondingk-merfeature.ThenumberofleafnodesvisitedinthetraversalisO(N(jxj+jyj)),wheretheconstantisthemaximummutationneighborhoodsize,N=maxa2SkjM(k)j.WecanchoosessufcientlysmalltogetanydesiredboundonN,butitisdifculttoestimatehowtosettheparameterstoobtaingoodSVMclassicationperformanceexceptbyempiricalresults.TotalcomplexityforthekernelvaluecomputationisO(kN(jxj+jyj))3.3(k;m)-WildcardKernelComputationComputationofthewildcardkernelisagainverysimilartothemismatchkernelalgorithm.WeuseadepthktriewithbrancheslabeledbycharactersinS[fg,andweprune(donottraverse)subtreescorrespondingtoprexpatternswithgreaterthanmwildcardcharacters.Ateachnodeofdepthdwemaintainpointerstoallk-mersinstancesintheinputsequenceswhosed-lengthprexesmatchthecurrentd-lengthprexpattern(withwildcards)representedbythepathdownfromtheroot.Eachk-merinstanceinthedatamatchesatmoståmi=0k=O(km)k-lengthpatternshavinguptomwildcards.ThusthenumberofleafnodesvisitedisinthetraversalisO(km(jxj+jyj)),andtotalcomplexityforthekernelvaluecomputationisO(km+1(jxj+jyj))3.4ComparisonwithMismatchKernelComplexityForthe(k;m)mismatchkernel,thesizeofthemismatchneighborhoodofaninstancek-merisO(kmjSjm),sototalkernelvaluecomputationisO(km+1jSjm(jxj+jyj)).Alltheotherkernelspre-sentedherehaverunningtimeO(cK(jxj+jyj)),whereconstantcKdependsontheparametersofthekernelbutnotonthesizeofthealphabetS.Therefore,wehaveimprovedconstanttermforlargeralphabets(suchasthealphabetof20aminoacids).InSection5,weshowthatthesenew,fasterkernelshaveperformancecomparabletothemismatchkernelinproteinclassicationexperiments.4.TransducerRepresentationCortesetal.(2002)recentlyshowedthatmanyknownstringkernelscanbeassociatedwithandconstructedfromweightednitestatetransducerswithinputalphabetS.Webrieyoutlinetheirtransducerformalismandgivetransducersforsomeofournewlydenedkernels.Forsimplicity,weonlydescribetransducersovertheprobabilitysemiringR+=[0;¥),withregularadditionandmultiplication.FollowingthedevelopmentinCortesetal.(2002),aweightednitestatetransduceroverR+isdenedbyaniteinputalphabetS,aniteoutputalphabetD,anitesetofstatesQ,asetofinput1443 LESLIEANDKUANGstatesIQ,asetofoutputstatesFQ,anitesetoftransitionsEQ(S[feg)(D[feg)R+Q,aninitialweightfunctionlI!R+,andanalweightfunctionrF!R+.Here,thesymbolerepresentstheemptystring.ThetransducercanberepresentedbyaweighteddirectedgraphwithnodesindexedbyQandeachtransitione2Ecorrespondingtoadirectededgefromitsoriginstatep[e]toitsdestinationstaten[e]andlabeledbytheinputsymbol[e]itaccepts,theoutputsymbolo[e]itemits,andtheweightw[e]itassigns.Wewritethelabelas[e]o[e]=w[e](abbreviatedas[e]o[e]iftheweightis1).Forapathp=e1e2:::ekofconsecutivetransitions(directedpathingraph),theweightforthepathisw[p]=w[e1]w[e2]:::w[ek],andwedenotep[p]=p[e1]andn[p]=n[ek].WewriteS=[k0SkforthesetofallstringsoverS.Foraninputstringx2Sandoutputstringz2D,wedenotebyP(I;x;z;F)thesetofpathsfrominitialstatesItonalstatesFthatacceptstringxandemitstringz.AtransducerTiscalledregulatedifforanypairofinputandoutputstrings(x;z),theoutputweight[[T]](x;z)thatTassignstothepairiswell-dened.Theoutputweightisgivenby:[[T]](x;z)=åp2P(IxzF)l(p[p])w[p]r(n[p])AkeyobservationfromCortesetal.(2002)isthatthereisageneralmethodfordeningastringkernelfromaweightedtransducerT.LetYR+!Rbeasemiringmorphism(forus,itwillsimplybeinclusion),anddenotebyT1thetransducerobtainedfromTbytransposingtheinputandoutputlabelsofeachtransition.ThenifthecomposedtransducerS=TT1isregulated,oneobtainsarationalstringkernelforalphabetSviaK(x;y)=Y([[S]](x;y))=åzY([[T]](x;z))Y([[T]](y;z))wherethesumisoverallstringsz2D(whereDistheoutputalphabetforT)orequivalently,overalloutputstringsthatcanbeemittedbyT.Therefore,wecanthinkofTasdeningafeaturemapindexedbyallpossibleoutputstringsz2DforTUsingthisconstruction,Cortesetal.showedthatthek-gramcountertransducerTkcorrespondstothek-gramork-spectrumkernel,andthegappyk-gramcountertransducerTklgivestheunre-strictedgappyk-gramkernelfromLodhietal.(2002).Figure3(a)showsthediagramofthe3-gramtransducerT3,andFigure3(b)givesthegappy3-gramtransducerT3l.Our(g;k;l)-gappykernelKWeightedGap(gkl)canbeobtainedfromthecomposedtransducerT=TklTgusingtheTT1con-struction.(Inallourexamples,weusel(s)=1foreveryinitialstatesandr()=1foreverynalstate.)Forthe(k;m)-wildcardkernel,wesettheoutputalphabettobeD=S[fganddeneatrans-ducerwithm+1nalstates,asindicatedinthegure.Them+1nalstatescorrespondtodes-tinationsofpathsthatemitk-gramswith0,1,...,mwildcardcharacters,respectively.The(3;1)-wildcardtransducerisshowninFigure4.The(k;s)-substitutionkerneldoesnotappeartofallexactlyintothisframework,thoughifwethresholdindividualsubstitutionprobabilitiesindependentlyratherthanthresholdtheproductprobabilityoverallpositionsinthek-mer,wecandeneatransducerthatgeneratesasimilarkernel.Startingwiththek-gramtransducer,weaddadditionaltransitions(between“consecutive”statesofthek-gram)oftheformabforthosepairsofsymbolswithlogP(ajb)so.Nowtherewillbea(unique)pathinthetransducerthatacceptsk-mera=a1a2:::akandemitsb=b1b2:::bkifandonlyifeverysubstitutionsatiseslogP(aijbi)so1444 FASTSTRINGKERNELS(a)(b)Figure3:Thek-gramandgappyk-gramtransducers.Thediagramsshowthe3-gramtransducer(a)andthegappy3-gramtransducer(b)foratwo-letteralphabet.Here,theedgelabelael,forexample,means“acceptsymbola,outputtheemptysymbole,multiplytheweightbyl”.Figure4:The(k;m)-wildcardtransducer.Thediagramshowsthe(3;1)-wildcardtransducerforatwo-letteralphabet.5.ExperimentsWetestedallthenewstringkernelswithSVMclassiersontwobenchmarkdatasets(Jaakkolaetal.,1999,Westonetal.,2003),bothdesignedfortheremoteproteinhomologydetectionproblem,inordertocomparetoresultswiththemismatchkernelreportedpreviously(Leslieetal.,2002b)andotherrecentkernelrepresentationsforproteinsequencedata.WealsopresentresultstoexplorehowparameterchoicesforthenewkernelsaffectSVMclassicationperformance.ThebenchmarksarebasedondifferentversionsoftheSCOPdatabase(Murzinetal.,1995),anexpert-curateddatabaseofproteindomainswithknown3Dstructure,organizedhierarchicallyintofolds,superfamilies,and1445 LESLIEANDKUANGfamilies.ProteindomainsequencesbelongingtodifferentfamiliesbutthesamesuperfamilyareconsideredtoberemotehomologsinSCOP.Intheseexperiments,remotehomologyissimulatedbyholdingoutallmembersofatargetSCOPfamilyfromagivensuperfamilyasatestset,whileexampleschosenfromtheremainingfamiliesinthesamesuperfamilyformthepositivetrainingset.Thenegativetestandtrainingexamplesarechosenfromdisjointsetsoffoldsoutsidethetargetfamily'sfold,sothatnegativetestandnegativetrainingsetsareunrelatedtoeachotherandtothepositiveexamples.Moredetailsoftheexperimentalset-upcanbefoundinJaakkolaetal.(1999).Whileinprinciple,wecandeneandtestinexactmatchingstringkernelsforawiderangeofparameters,inpractice,onlyasmallparameterrangeisbiologicallymotivatedforuseintheremoteproteinhomologydetectionproblem.Fortheexactmatchingk-spectrumkernel(Leslieetal.,2002a),theonlyinterestingparameterchoicesarek=3andk=4,sinceexactoccurrencesofk-mersoflengthk5inremotelyhomologousproteinsaresorarethatthespectrumkernelwouldmostlybe0offthediagonal.Byincorporatinginexactmatchingsuchasmismatchesorgaps,wecanuseslightlylongersubsequenceinstancesandallowafewmismatchesorgaps;however,usingverylongsubsequences,orallowingagreatamountofmutationofsubsequenceinstancesinourinexactmatchingscheme,wouldnotcapturebiologicallyrealisticsequencesimilarity.Forexample,forthegappykernel,weexpectthatlength(g;k)=(6;4)–4-mersallowingupto2gaps–wouldbeausefulparameterchoice,whileallowingmanymoregapsandhencealongerg-lengthwindowwouldbelessuseful.Wetestarangeofparameterchoicesthatseemreasonableintheexperimentsbelow.IntherstandlargerSCOPbenchmarkdataset,basedonSCOPversion1.37,wecomparetotheFisherkernelofJaakkolaetal.(1999)inadditiontoourpreviousmismatchkernel.IntheFisherkernelmethod,thefeaturevectorsarederivedfromproleHMMstrainedonthepositivetrainingexamples.ThefeaturevectorforsequencexisthegradientoftheloglikelihoodfunctionlogP(xjq)denedbythemodelandevaluatedatthemaximumlikelihoodestimateformodelparameters:F(x)=ÑqlogP(xjq)jq=q0.TheFisherkernelwasthebestperformingmethodonthisdatasetpriortothemismatch-SVMapproach,whoseperformanceisasgoodasFisher-SVMandbetterthanallotherstandardmethodstried(Leslieetal.,2002b).Wenotethatinthisdataset,additionalpositivetrainingsequenceswerepulledinfromthenon-redundantproteinsequencedatabaseusinganiter-ativetrainingmethodfortheproleHMMs.Thepresenceoftheseadditional“domainhomologs”makesthelearningtaskeasierforallmethods.WealsoincludeasecondsetofexperimentstofurtherinvestigatethedependenceofSVMperformanceonparameterchoicesforthenewkernels.ThisseconddatasetisbasedonSCOPversion1.59andcontainsonlysequencesfromtheSCOPdatabase–nodomainhomologsareadded.TheexperimentsaresimilartothosedescribedbyLiaoandNoble(2002)butuseamorerecentversionofSCOP.Inthisdataset,thepositivetrainingsetsarequitesmall,andthelearningtaskismoredifcultinthissetting.Inparticular,thereisnotenoughpositivetrainingdatatotrainproleHMMsintheseexperiments,sowedonotreportFisherkernelresults(whicharenotcompetitiveinthissetting).Thereisanothersuccessfulfeaturerepresentationforproteinclassication,theSVM-pairwisemethodpresentedinLiaoandNoble(2002).HereoneusesanempiricalkernelmapbasedonpairwiseSmith-Waterman(Watermanetal.,1991)alignmentscoresF(x)=(d(x1;x);:::;d(xm;x));1446 FASTSTRINGKERNELSwherexi=1:::m,arethetrainingsequencesandd(xi;x)istheE-valueforthealignmentscorebetweenxandxi.Wehavepreviouslyshown(Leslieetal.,2004)thatthemismatchkernelusedwithanSVMclassieriscompetitivewithSVM-pairwiseontheSCOPbenchmarkpresentedinLiaoandNoble(2002),sowedonotrepeattheSVM-pairwiseexperimentsfortheverysimilarbenchmarkhere.Inbothexperiments,wenormalizedthekernelbyk(x;y) k(xy)pk(xx)pk(yy).Allmethodsareeval-uatedusingthereceiveroperatingcharacteristic(ROC)score,whichistheareaunderthereceiveroperatingcurve,whichplotstherateoftruepositivesasafunctionoftherateoffalsepositivesasthethresholdfortheclassiervaries(GribskovandRobinson,1996).PerfectrankingofallpositivesaboveallnegativesgivesanROCscoreof1,whilearandomclassierhasanexpectedscorecloseto0:5.WealsousetheROC-50score,whichisthenormalizedareaunderthereceiveroperatingcurveuptotherst50falsepositives;thisscorefocusesonthetopoftherankingofthetestexamplesproducedbytheclassierandismoreinformativewhenthereareveryfewpositiveexamplesinthetestset.UseofROC-50scores(orotherROC-Nscores)isthemoststandardwayofevaluatingperformanceofhomologydetectionmethodsinbioinformatics(GribskovandRobinson,1996).Finally,weusekernelalignmentscores(Cristianinietal.,2001)onthesecondSCOPdatasettoinvestigatetheempiricaldifferencesbetweenthedifferentinexactmatchingmodelsforproteinsequencedata,andweinvestigatemethodsforcombiningkernelstoimproveSVMperformance.5.1SCOPExperimentswithDomainHomologs:ComparisonwithFisherandMismatchKernelsWerstpresentexperimentalresultsforthenewkernelsonthelargerofthetwoSCOPdatasets,theFisher-SCOPbenchmarkintroducedbyJaakkolaetal.(1999)thatcontainsdomainhomologsforadditionalpositivetrainingdata,andcompareSVMclassicationperformancetoboththemismatchkernelandtheFisherkernel.Wetestedthe(g;k)-gappykernelwithparameterchoices(g;k)=(6;4)(7;4)(8;5)(8;6),and(9;6).Amongthem(g;k)=(6;4)yieldedthebestresults,thoughotherchoicesofparametershadquitesimilarperformance(datanotshown).Wealsotestedthealternativeweightedgappykernel,wherethecontributionofaninstanceg-mertoak-merfeatureisaweightedsumofallthepossiblematchesofthek-mertosubsequencesintheg-merwithmultiplicativegappenaltyl(0l1).Weusedgappenaltyl=1:0andl=0:5withthe(6;4)weightedgappykernel.Wefoundthatl=0:5weightingslightlyweakenedperformance(resultsnotshown).InFigure5,weseethatunweightedandweighted(l=1:0)gappykernelshavecomparableresultsto(5;1)-mismatchkernelandFisherkernel.Wetestedthesubstitutionkernelswith(k;s)=(4;6:0).Here,s=6:0waschosensothatthemembersofamutationneighborhoodofaparticular4-merwouldtypicallyhaveonlyonepositionwithasubstitution,andsuchsubstitutionswouldhavefairlyhighprobability.Therefore,themu-tationneighborhoodsweremuchsmallerthan,forexample,(4;1)-mismatchneighborhoods.TheresultsareshowninFigure6.Again,thesubstitutionkernelhascomparableperformancewithmismatch-SVMandFisher-SVM,thoughresultsareperhapsslightlyweakerformoredifculttestfamilies.Inordertocomparewiththe(5;1)-mismatchkernel,wetestedwildcardkernelswithparameters(k;m;l)=(5;1;1:0)and(k;m;l)=(5;1;0:5).ResultsareshowninFigure7.Thewildcardkernelwithl=1:0seemstoperformaswelloralmostaswellasthe(5;1)-mismatchkernelandFisher1447 LESLIEANDKUANG(6,4)-Gap-SVM(Weight=0.5) ROC\n(6,4)-Gap-SVM(Weight=1.0) ROC0510152025303500.20.40.60.81Number of familiesROC50(5,1)-Mismatch-SVM ROC50Fisher-SVM ROC50(6,4)-Gap-SVM(Weight=0.5) ROC50\n(6,4)-Gap-SVM(Weight=1.0) ROC50(a)(b)Figure5:ComparisonofofMismatch-SVM,Fisher-SVMandGappy-SVM.ThegraphplotsthetotalnumberoffamiliesforwhichagivenmethodexceedsanROCscorethreshold(a)orROC-50scorethreshold(b).(4,6)-Substitution-SVM ROC0510152025303500.20.40.60.81Number of familiesROC50(5,1)-Mismatch-SVM ROC50Fisher-SVM ROC50(4,6)-Substitution-SVM ROC50(a)(b)Figure6:Comparisonofmismatch-SVM,Fisher-SVMandsubstitution-SVM.ThegraphplotsthetotalnumberoffamiliesforwhichagivenmethodexceedsanROCscorethreshold(a)orROC-50scorethreshold(b).kernel,whileenforcingapenaltyonwildcardcharactersofl=0:5seemstoweakenperformancesomewhat.Ifwecompareresultsforthebest-performingparameterchoicesthatwetriedfromeachkernelfamily–the(5;1)-mismatchkernel,the(5;1;1:0)-wildcardkernel,the(6;4)-gappykernelwithl=1:0,andthe(4;6:0)-substitutionkernel–thenasignedrankedtestwithBonferronicorrectionformultiplecomparisons(HenikoffandHenikoff,1992,Salzberg,1997)andap-valuecut-offof0.05ndsnosignicantdifferencesbetweenthefourkernels,eitheronthebasisofROCorROC-50scores.5.2SCOPExperimentswithoutDomainHomologs:DependenceonParametersInthesecondsetofSCOPexperiments,wetakeadvantageofthesmallerdatasetfromWestonetal.(2003)togeneratekernelscorrespondingtoawiderrangeofparametervalues,sothatwe1448 FASTSTRINGKERNELS(5,1,1.0)-Wildcard-SVM ROC(5,1,0.5)-Wildcard-SVM ROC\n0510152025303500.20.40.60.81Number of familiesROC50(5,1)-Mismatch-SVM ROC50Fisher-SVM ROC50(5,1,1.0)-Wildcard-SVM ROC50(5,1,0.5)-Wildcard-SVM ROC50\n(a)(b)Figure7:Comparisonofmismatch-SVM,Fisher-SVMandwildcard-SVM.ThegraphplotsthetotalnumberoffamiliesforwhichagivenmethodexceedsanROCscorethreshold(a)orROC-50scorethreshold(b).canexplorehowparameterchoicesaffectSVMclassicationperformance.Wealsousekernelalignmentandkernel-targetalignmentscores(Cristianinietal.,2001)toinvestigatedifferencesbetweendifferentkernelmodels.Notethatthisdatasetcontainsnodomainhomologs,andthusthesmallamountofpositivetrainingdatamakestheexperimentsmoredifcult.Inexperimentswiththegappykernel,wechoseparametervalues(g;k)=(6;4)(7;5)and(8;6)andsetthegappenaltytol=1:0,thepreferredchoicefromthepreviousexperiments.Thechoice(g;k)=(6;4)stillproducedthebestclassicationresults,whichwereslightlybutnotsignicantlyweakerthanthoseof(5;1)-mismatchkernel.TheresultsareshowninFigure8.Performancedeterioratesaslargervaluesofthegparameterarechosenwiththenumberofgapsheldxed.Gap(5,7)\nGap(6,8)\n010203040506000.20.40.60.81Number of familiesROC50(5,1)-Mismatch-SVM ROC50\n(6,4)-Gap-SVM(Weight=1.0) ROC50\n(7,5)-Weight-Gap-SVM (Weight=1.0) ROC50\n(8,6)-Weight-Gap-SVM (Weight=1.0) ROC50\n(a)(b)Figure8:Dependenceonparametersforthegappykernel.ThegraphplotsthetotalnumberoffamiliesforwhichagivenmethodexceedsanROC(a)orROC-50(b)scorethreshold.Thesubstitutionkernelwastestedwithparameterchoices(k;s)=(4;6:0);(5;7:5)and(6;9:0)Allofthesethreekernelsgaveslightlystrongerperformancethanthe(5;1)-mismatchkernel,andresultsforthedifferentparameterchoiceswereremarkablysimilar,asshowninFigure9.Thus,moresothanforotherinexactmatchingmodels,thesubstitutionkernelperformanceseemsstable1449 LESLIEANDKUANGaswevarykwhilesisadjustedadditively;however,asweseebelow,theGrammatricesproducedbythesedifferentchoicesofkernelsareinfactquitedifferent.(6,9)-Substitution\n 0 10 20 30 40 50 60 0 0.2 0.4 0.6 0.8 1No. of families with given performance(6,9)-Substitution-SVM ROC50\n(5,1)-Mismatch-SVM ROC50\n(4,6)-Substitution-SVM ROC50\n(6,9)-Substitution-SVM ROC50\n(a)(b)Figure9:Dependenceonparametersforsubstitutionkernel.ThegraphplotsthetotalnumberoffamiliesforwhichagivenmethodexceedsanROC(a)orROC-50(b)scorethreshold.Resultsforthreeparameterchoicesgivealmostidenticalresults.Wetestedthewildcardkernelwith(k;m;l)=(5;1;1:0)and(5;2;1:0).Weobservedasignif-icantimprovementinperformancewhenweallowedupto2wildcardsinsteadof1withk=5.Theperformanceof(5;2;1:0)-wildcardkernelgavethebestresultsamongallkernelfamiliesandparametersthatwetried,thoughseveralotherkernelchoicesgaveverysimilarperformance.There-sultsareshowninFigure10.Intuitively,itisclearthatallowing1mismatchisclosertopermitting2wildcardsthantopermittingasinglewildcard:twok-mersthatareidenticalexceptintwopositionshaveintersecting(k;1)-mismatchneighborhoodsandhencetheir(k;1)-mismatchfeaturevectorshavenon-zeroinnerproduct;similarly,suchapairofk-mershavenon-orthogonal(k;2)-wildcardfeaturevectorsbutorthogonal(k;1)-wildcardfeaturevectors.(5,2,1.0)-Wildcard-SVM ROC\n010203040506000.20.40.60.81Number of familiesROC50(5,1)-Mismatch-SVM ROC50\n(5,1,1.0)-Wildcard-SVM ROC50\n(5,2,1.0)-Wildcard-SVM ROC50\n(a)(b)Figure10:Dependenceonparametersforthewildcardkernel.ThegraphplotsthetotalnumberoffamiliesforwhichagivenmethodexceedsanROC(a)orROC-50(b)scorethreshold.Inthegraph,thecurveof(5;2;1:0)-wildcardkernelclearlyoutperformsthe(5;1;1:0)-wildcardkernel.1450 FASTSTRINGKERNELSKernelKernelAlignmentROCROC-50(5,1)-mismatch0.09820.8750.416(6,4)-gappy0.14280.8510.387(7,5)-gappy0.02690.8250.315(8,6)-gappy0.00900.7820.242(4,6.0)-substitution0.16430.8760.441(5,7.5)-substitution0.03690.8650.428(6,9.0)-substitution0.01700.8710.442(5,1,1.0)-wildcard0.03100.8160.304(5,2,1.0)-wildcard0.15650.8810.447Table1:MeanROCandROC-50scoresover54targetfamilies.Kernel(51)-mismatch(64)-gappy(46)-subst(69)-subst(51)-wildcard(52)-wildcard(51)-mismatch1.0000.9230.8120.9470.9680.864(64)-gappy1.0000.9150.7420.7750.955(46)-subst1.0000.5910.6220.942(69)-subst1.0000.9910.626(51)-wildcard1.0000.669(52)-wildcard1.000Table2:PairwisekernelalignmentscoresoverthefullSCOPdataset.InTable1,wesummarizethemeanROCandROC-50scoresacrossthe54targetfamiliesforallthestringkernelsfamiliesandparametervalueschosen.Thetablealsoshowsmeantrainingsetkernel-targetalignmentscoresacrosstheexperiments.KernelalignmentwasintroducedbyCristianinietal.(2001)asameasureofsimilaritybetweenpairsofkernelsorbetweenakernelandatargetfunction.TheempiricalkernelalignmentscorebetweentwokernelsisdenedasthevaluehK1;K2iphK1;K1ihK2;K2i,whereK1andK2aretheGrammatricesforthekernelsonthesampledata,andh;iistheeuclideaninnerproductwhentheGrammatricesareviewedasvectors(Hilbert-Schmidtinnerproduct).ThusthealignmentscoreissimplythecosineoftheanglebetweenthetwovectorsrepresentingGrammatrices.Theempiricalkernel-targetalignmentisthekernelalignmentforaGrammatrixandthetargetyyt,whereyisthecolumnvectoroflabels.Table1showsthatforthegappyandwildcardkernels,highkernel-targetalignmentscoresdoseemtocorrelatewithgoodSVMclassicationperformance.However,forthesubstitutionkernels,thekernel-targetalignmentislowforlargervaluesofkwhileperformanceremainsstrong.InTable2,weshowthepairwisekernelalignmentscoresbetweennormalizedkernelsonthefullSCOPdatasetof7329sequences.Insomecases,thealignmentscoresbetweenkernelsofthesamefamilywithdifferentparameterscanbequitelow,forexamplethe(5;1;1:0)-wildcardkerneland(5;2;1:0)-wildcardkernel.Surprisingly,the(6;9)-substitutionkernelGrammatrixisverysimilartothe(5;1;1:0)-wildcardkernelGrammatrixwhencomparedbyalignmentscore,eventhoughtheirSVMperformanceissomewhatdifferent,showingthatthescoregivesonlyaroughmeasureofkernelsimilarity.The(6;4)-gappykernel,(4;6)-substitutionkerneland(5;2;1:0)-wildcardkernelareagroupofwellalignedGrammatrices.(5;1)-mismatchkernelseemstobeinbetweenthetwopreviousgroupsintermsofkernelalignment.Clearly,allthemodelsofinexactmatchingarefairlysimilar,buttheredoappeartobeseveralsignicantlydifferentGrammatricesinthesetbelowthatallsuccessfullyrepresentthedataforthepurposesofSVMlearning.1451 LESLIEANDKUANGKernelROCROC-50Keq(ai1)0:9070:520Kopt(s0:01)0:9010:502Table3:MeanROCandROC-50scoresoflinearlycombinedkernels.HereKopt=åNi=1aiKiwhereNisthenumberofkernels,aistheoptimalvectorforthebestalignmentwithtargetyy0,andthetheregularizationparameterdependsonsasdescribedinthetext.Sincedifferentkernelscapturesomewhatdifferentnotionsofsequencesimilarity,weconsiderwhetheraconvexcombinationofkernelsK(a)=åNi=1aiKi,withai0for=1:::N,canoutper-formindividualkernels.Weconsidertwoschemesforchoosingsuchalinearcombination.Intherstapproach,wesimplyassignequalweightsai=1=NforalltoobtainanewkernelKeq.Forasecondapproach,wefollowKandolaetal.(2002),whoproposedageneralmethodforlearningtheaibysolvingaoptimizationproblemtomaximizethekernelalignmentbetweenGrammatrixofK(a)andtargetyy0A(S;K(a);yy0)=y0K(a)yjyjjjK(a)jj;yieldinganewkernelKopt.Here,oneintroducesaregularizationparameterltoconstrainjjajjandpreventover-alignment;theoptimizationthenamountstoaquadraticprogrammingproblemthatcanbesolvedthroughstandardmethods.Wenowpick6kernelswithrelativelygoodperformanceandlowpairwisekernelalignmentascomponentsforthenewkernel–(5;1)-mismatch,(6;4)-gappy,(4;6:0)-substitution,(5;7:5)-substitution,(6;9:0)-substitutionand(5;2;1:0)-wildcard–andrepeatthesecondsetofSCOPexperimentswiththesetwolinearcombinationkernels.ForKoptweusearegularizationparameteroftheforml=N2åijhKi;Kji,whereh;iistheHilbert-Schmidtinnerproductbetweenmatrices.Wefoundthatperformancevariedslightlybutsignicantlyaswevarieds=:001;:01;:1;1;10;100;1000(resultsnotshown);sincetheexperimentsdonotcontainacross-validationset,wesimplyreporttheperformanceofthebestparameterchoice(s=:01)withthecaveatthatthisresultmaybesomewhatoptimistic.WereportthemeanROCandROC-50scoresacross54experimentsforthesimplecaseKeq,andtheoptimalalignmentcaseKoptinTable3.WefoundthatKoptwiththebestregularizationparameterchoicedoesachievesignicantim-provementoverthebestindividualkernel(indeed,almostallregularizationparametersthatwetrieddisplayedsomeadvantageoverthebestindividualkernel);however,thesimpleweightingusedinKeqslightlyoutperformedKoptintheseexperiments.Interestingly,formostof54experiments,Kopt(s=0:01)hadnon-zeroweightsonlyforthetwobestperformingkernels,the(4;6:0)-substitutionand(5;2;1:0)-wildcardkernels,withtheweightforthelatteraboutanorderofmagnitudesmallerthanthatoftheformer.Theseresultssuggestthatsomeofthekernelsarecomplementarytoeachotherandthatcombiningthemcanhelpimproveperformance,thoughitappearsthatoptimalalign-mentdoesnotoutperformasimpleuniformweightingschemeforcombiningkernels.6.DiscussionWehavepresentedanumberofdifferentk-merbasedstringkernelsthatcaptureanotionofinex-actmatching–throughuseofgaps,probabilisticsubstitutions,andwildcards–butmaintainfastcomputationtime.Usingarecursivefunctionbasedonatriedatastructure,weshowthatforallour1452 FASTSTRINGKERNELSnewkernels,thetimetocomputeakernelvalueK(x;y)isO(cK(jxj+jyj)),wheretheconstantcKdependsontheparametersofthekernelbutnotonthesizeofthealphabetS.Thusweimproveontheconstantfactorinvolvedincomputationofthepreviouslypresentedmismatchkernel,inwhichjSjaswellaskandmcontrolthesizeofthemismatchneighborhoodandhencetheconstantcKWealsoshowhowmanyofourkernelscanbeobtainedthroughtherecentlypresentedtrans-ducerformalismofrationalTT1kernelsandgivethetransducerTforseveralexamples.Thisconnectiongivesanintuitiveunderstandingofthekerneldenitionsandcouldinspirenewstringkernels.Finally,wepresentresultsontwobenchmarkSCOPdatasetsfortheremoteproteinhomologydetectionproblemandshowthatmanyofthenew,fasterkernelsachieveperformancecomparabletothemismatchkernel.Wealsoinvestigatehowkernelperformancedependsonparameterchoiceforthedifferentinexactmatchingmodels.Intuitively,itisclearthattheonlybiologicalreasonablechoicesinvolveshortk-merfeatures,sinceasweallowktogrow,wecannotpermitsufcientinexactmatchingwithoutalsointroducingnoise.However,withintheseconstraints,ourresultsdemonstratethesomewhatdifferentbehaviorofthevariouskernelfamilies.WenotethatVishwanathanandSmola(2002)usedcountingstatisticsandasufxtreecon-structiontoeliminatetheconstantfactorofkincomputationtimefortheexact-matchingspectrumkernel(Leslieetal.,2002a).Itmaybepossibletoextendthistechniquetothefastinexact-matchingkernelspresentedhere.Apromisingdirectionforappliedworkinthisareaiscombiningstringkernelrepresentationswithsemi-supervisedapproachesforleveragingtheabundantunlabeledproteinsequencedata(se-quenceswhose3Dstructureisunknown)availableinsequencedatabases.OnerecentapproachispresentedbyWestonetal.(2003),wherestringkernelsareusedasabasekernelrepresentation,andunlabeledsequencedatatogetherwithadissimilaritymeasurebetweensequenceexamplesareusedtobuildclusterkernelsthatmodifythebasekernelforaricherrepresentation.Inmorerecentwork(Kuangetal.,2004),wedenek-merbasedstringkernelsforprobabilisticsequenceproles(Gribskovetal.,1987),whichalsogivearicherrepresentationofsequencesbyestimatingposi-tionspecicresidueemissionprobabilitiesfromunlabeleddata.Theseprole-basedstringkernelsprovideanotherpromisingsemi-supervisedapproachforkernelrepresentationofproteinsequencedata.AcknowledgmentsWewouldliketothankEleazarEskin,RisiKondorandWilliamStaffordNobleforhelpfuldis-cussionsandCorinnaCortes,PatrickHaffnerandMehryarMohriforexplainingtheirtransducerformalismtous.ThisworkissupportedbyanAwardinInformaticsfromthePhRMAFoundation,NIHgrantLM07276-02,andNSFgrantITR-0312706.ReferencesS.F.Altschul,W.Gish,W.Miller,E.W.Myers,andD.J.Lipman.Abasiclocalalignmentsearchtool.JournalofMolecularBiology,215:403–410,1990.C.Cortes,P.Haffner,andM.Mohri.Rationalkernels.NeuralInformationProcessingSystems162002.1453 LESLIEANDKUANGN.CristianiniandJ.Shawe-Taylor.AnIntroductiontoSupportVectorMachines.Cambridge,2000.N.Cristianini,J.Shawe-Taylor,A.Elisseeff,andJ.Kandola.Onkernel-targetalignment.InNeuralInformationProcessingSystems,volume15,2001.R.Durbin,S.Eddy,A.Krogh,andG.Mitchison.BiologicalSequenceAnalysis.CambridgeUP,1998.E.Eskin,W.S.Noble,Y.Singer,andS.Snir.Auniedapproachforsequencepredictionusingsparsesequencemodels.Technicalreport,HebrewUniversity,2003.M.Gribskov,A.D.McLachlan,andD.Eisenberg.Proleanalysis:Detectionofdistantlyrelatedproteins.PNAS,pages4355–4358,1987.M.GribskovandN.L.Robinson.Useofreceiveroperatingcharacteristic(ROC)analysistoevaluatesequencematching.ComputersandChemistry,20(1):25–33,1996.D.Haussler.Convolutionkernelsondiscretestructure.Technicalreport,UCSantaCruz,1999.S.HenikoffandJ.G.Henikoff.Aminoacidsubstitutionmatricesfromproteinblocks.PNAS,89:10915–10919,1992.T.Jaakkola,M.Diekhans,andD.Haussler.UsingtheFisherkernelmethodtodetectremoteproteinhomologies.InProceedingsoftheSeventhInternationalConferenceonIntelligentSystemsforMolecularBiology,pages149–158.AAAIPress,1999.J.Kandola,J.Shawe-Taylor,andN.Cristianini.Optimizingkernelalignmentovercombinationsofkernels.NeoroCOLTTechnicalReportNC-TR-2002-121,http://www.neurocolt.org,2002.R.Kuang,E.Ie,K.Wang,K.Wang,M.Siddiqi,Y.Freund,andC.Leslie.Prole-basedstringkernelsforremotehomologydetectionandmotifextraction.InComputationalSystemsBioinfor-matics,2004.C.Leslie,E.Eskin,A.Cohen,J.Weston,andW.S.Noble.Mismatchstringkernelsfordiscrimina-tiveproteinclassication.Bioinformatics,2004.C.Leslie,E.Eskin,andW.S.Noble.Thespectrumkernel:AstringkernelforSVMproteinclassication.ProceedingsofthePacicBiocomputingSymposium,2002a.C.Leslie,E.Eskin,J.Weston,andW.S.Noble.MismatchstringkernelsforSVMproteinclassi-cation.NeuralInformationProcessingSystems16,2002b.C.LeslieandR.Kuang.Fastkernelsforinexactstringmatching.ProceedingsofCOLT/KernelWorkshop,2003.C.LiaoandW.S.Noble.Combiningpairwisesequencesimilarityandsupportvectormachinesforremoteproteinhomologydetection.ProceedingsoftheSixthAnnualInternationalConferenceonResearchinComputationalMolecularBiology,2002.H.Lodhi,C.Saunders,J.Shawe-Taylor,N.Cristianini,andC.Watkins.Textclassicationusingstringkernels.JournalofMachineLearningResearch,2:419–444,2002.1454 FASTSTRINGKERNELSA.G.Murzin,S.E.Brenner,T.Hubbard,andC.Chothia.SCOP:Astructuralclassicationofproteinsdatabasefortheinvestigationofsequencesandstructures.JournalofMolecularBiology247:536–540,1995.M.Sagot.Spellingapproximateorrepeatedmotifsusingasufxtree.LectureNotesinComputerScience,1380:111–127,1998.S.L.Salzberg.Oncomparingclassiers:Pitfallstoavoidandarecommendedapproach.DataMiningandKnowledgeDiscovery,1:371–328,1997.R.M.SchwartzandM.O.Dayhoff.Matricesfordetecingdistantrelationships.InAtlasofProteinSequenceandStructure,pages353–358,SilverSpring,MD,1978.NationalBiomedicalResearchFoundation.J.Shawe-TaylorandN.Cristianini.KernelMethodsforPatternAnalysis.CambridgeUniversityPress,2004.J.-P.Vert,H.Saigo,andT.Akutsu.KernelMethodsinComputationalBiology,chapterLocalalignmentkernelsforbiologicalsequences.MITPress,2004.S.V.N.VishwanathanandA.Smola.Fastkernelsforstringandtreematching.NeuralInformationProcessingSystems16,2002.M.S.Waterman,J.Joyce,andM.Eggert.Computeralignmentofsequences,chapterPhylogeneticAnalysisofDNASequences.Oxford,1991.C.Watkins.Dynamicalignmentkernels.Technicalreport,ULRoyalHolloway,1999.J.Weston,C.Leslie,D.Zhou,A.Elisseeff,andW.S.Noble.Clusterkernelsforsemi-supervisedproteinclassication.NeuralInformationProcessingSystems17,2003.1455