/
JournalofMachineLearningResearch6(2005)615 JournalofMachineLearningResearch6(2005)615

JournalofMachineLearningResearch6(2005)615 - PDF document

giovanna-bartolotta
giovanna-bartolotta . @giovanna-bartolotta
Follow
386 views
Uploaded On 2017-01-25

JournalofMachineLearningResearch6(2005)615 - PPT Presentation

EVGENIOUMICCHELLIANDPONTILelaborateontheseideaswithinapracticalcontextandpresentexperimentsoftheproposedkernelbasedmultitasklearningmethodsontworealdatasetsMultitasklearningisimportantinavarietyo ID: 513800

EVGENIOU MICCHELLIANDPONTILelaborateontheseideaswithinapracticalcontextandpresentexperimentsoftheproposedkernel-basedmulti-tasklearningmethodsontworealdatasets.Multi-tasklearningisimportantinavarietyo

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "JournalofMachineLearningResearch6(2005)6..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

JournalofMachineLearningResearch6(2005)615–637Submitted2/05;Published4/05LearningMultipleTaskswithKernelMethodsTheodorosEvgeniouTHEODOROS.EVGENIOU@INSEAD.EDUTechnologyManagementINSEAD77300Fontainebleau,FranceCharlesA.MicchelliCAM@MATH.ALBANY.EDUDepartmentofMathematicsandStatisticsStateUniversityofNewYorkTheUniversityatAlbany1400WashingtonAvenueAlbany,NY12222,USAMassimilianoPontilM.PONTIL@CS.UCL.AC.UKDepartmentofComputerScienceUniversityCollegeLondonGowerStreet,LondonWC1E,UKEditor:JohnShawe-TaylorAbstractWestudytheproblemoflearningmanyrelatedtaskssimultaneouslyusingkernelmethodsandregularization.Thestandardsingle-taskkernelmethods,suchassupportvectormachinesandregularizationnetworks,areextendedtothecaseofmulti-tasklearning.Ouranalysisshowsthattheproblemofestimatingmanytaskfunctionswithregularizationcanbecastasasingletasklearningproblemifafamilyofmulti-taskkernelfunctionswedeneisused.Thesekernelsmodelrelationsamongthetasksandarederivedfromanovelformofregularizers.Specickernelsthatcanbeusedformulti-tasklearningareprovidedandexperimentallytestedontworealdatasets.Inagreementwithpastempiricalworkonmulti-tasklearning,theexperimentsshowthatlearningmultiplerelatedtaskssimultaneouslyusingtheproposedapproachcansignicantlyoutperformstandardsingle-tasklearningparticularlywhentherearemanyrelatedtasksbutfewdatapertask.Keywords:multi-tasklearning,kernels,vector-valuedfunctions,regularization,learningalgo-rithms1.IntroductionPastempiricalworkhasshownthat,whentherearemultiplerelatedlearningtasksitisbenecialtolearnthemsimultaneouslyinsteadofindependentlyastypicallydoneinpractice(BakkerandHeskes,2003;Caruana,1997;Heskes,2000;ThrunandPratt,1997).However,therehasbeeninsufcientresearchonthetheoryofmulti-tasklearningandondevelopingmulti-tasklearningmethods.Akeygoalofthispaperistoextendthesingle-taskkernellearningmethodswhichhavebeensuccessfullyusedinrecentyearstomulti-tasklearning.Ouranalysisestablishesthattheproblemofestimatingmanytaskfunctionswithregularizationcanbelinkedtoasingletasklearningproblemprovidedafamilyofmulti-taskkernelfunctionswedeneisused.Forthispurpose,weusekernelsforvector-valuedfunctionsrecentlydevelopedbyMicchelliandPontil(2005).Wec\r2005TheodorosEvgeniou,CharlesMicchelliandMassimilianoPontil. EVGENIOU,MICCHELLIANDPONTILelaborateontheseideaswithinapracticalcontextandpresentexperimentsoftheproposedkernel-basedmulti-tasklearningmethodsontworealdatasets.Multi-tasklearningisimportantinavarietyofpracticalsituations.Forexample,innanceandeconomicsforecastingpredictingthevalueofmanypossiblyrelatedindicatorssimultaneouslyisoftenrequired(Greene,2002);inmarketingmodelingthepreferencesofmanyindividuals,forexamplewithsimilardemographics,simultaneouslyiscommonpractice(AllenbyandRossi,1999;Arora,Allenby,andGinter,1998);inbioinformatics,wemaywanttostudytumorpredictionfrommultiplemicro–arraydatasetsoranalyzedatafrommutliplerelateddiseases.Itisthereforeimportanttoextendtheexistingkernel-basedlearningmethods,suchasSVMandRN,thathavebeenwidelyusedinrecentyears,tothecaseofmulti-tasklearning.Inthispaperweshalldemonstrateexperimentallythattheproposedmulti-taskkernel-basedmethodsleadtosignicantperformancegains.Thepaperisorganizedasfollows.InSection2webrieyreviewthestandardframeworkforsingle-tasklearningusingkernelmethods.Wethenextendthisframeworktomulti-tasklearningforthecaseoflearninglinearfunctionsinSection3.Withinthisframeworkwedevelopageneralmulti-tasklearningformulation,inthespiritofSVMandRNtypemethods,andproposesomespecicmulti-tasklearningmethodsasspecialcases.Wedescribeexperimentscomparingtwooftheproposedmulti-tasklearningmethodstotheirstandardsingle-taskcounterpartsinSection4.Finally,inSection5wediscussextensionsoftheresultsofSection3tonon-linearmodelsformulti-tasklearning,summarizeourndings,andsuggestfutureresearchdirections.1.1PastRelatedWorkTheempiricalevidencethatmulti-tasklearningcanleadtosignicantperformanceimprovement(BakkerandHeskes,2003;Caruana,1997;Heskes,2000;ThrunandPratt,1997)suggeststhatthisareaofmachinelearningshouldreceivemoredevelopment.Thesimultaneousestimationofmultiplestatisticalmodelswasconsideredwithintheeconometricsandstatisticsliterature(Greene,2002;Zellner,1962;SrivastavaandDwivedi,1971)priortotheinterestsinmulti-tasklearninginthemachinelearningcommunity.Taskrelationshipshavebeentypicallymodeledthroughtheassumptionthattheerrorterms(noise)fortheregressionsestimatedsimultaneously—oftencalled“SeeminglyUnrelatedRegressions”—arecorrelated(Greene,2002).Alternatively,extensionsofregularizationtypemethods,suchasridgeregression,tothecaseofmulti-tasklearninghavealsobeenproposed.Forexample,BrownandZidek(1980)considerthecaseofregressionandproposeanextensionofthestandardridgeregressiontomultivariateridgeregression.BreimanandFriedman(1998)proposethecurds&wheymethod,wheretherelationsbetweenthevarioustasksaremodeledinapost–processingfashion.Theproblemofmulti-tasklearninghasbeenconsideredwithinthestatisticallearningandma-chinelearningcommunitiesunderthename“learningtolearn”(seeBaxter,1997;Caruana,1997;ThrunandPratt,1997).AnextensionoftheVC-dimensionnotionandofthebasicgeneralizationboundsofSLTforsingle-tasklearning(Vapnik,1998)tothecaseofmulti-tasklearninghasbeendevelopedin(Baxter,1997,2000)and(Ben-DavidandSchuller,2003).In(Baxter,2000)theprob-lemofbiaslearningisconsidered,wherethegoalistochooseanoptimalhypothesisspacefromafamilyofhypothesisspaces.In(Baxter,2000)thenotionofthe“extendedVCdimension”(forafamilyofhypothesisspaces)isdenedanditisusedtoderivegeneralizationboundsontheaverageerrorofTtaskslearnedwhichisshowntodecreaseatbestas1T.In(Baxter,1997)thesamesetup616 LEARNINGMULTIPLETASKSWITHKERNELMETHODSwasusedtoanswerthequestion“howmuchinformationisneededpertaskinordertolearnTtasks”insteadof“howmanyexamplesareneededforeachtaskinordertolearnTtasks”,andthetheoryisdevelopedusingBayesianandinformationtheoryargumentsinsteadofVCdimensionones.In(Ben-DavidandSchuller,2003)theextendedVCdimensionwasusedtoderivetighterboundsthatholdforeachtask(notjusttheaverageerroramongtasksasconsideredin(Baxter,2000))inthecasethatthelearningtasksarerelatedinaparticularwaydened.Morerecentworkconsiderslearningmultipletasksinasemi-supervisedsetting(AndoandZhang,2004)andtheproblemoffeatureselectionwithSVMacrossthetasks(Jebara,2004).Finally,anumberofapproachesforlearningmultipletasksareBayesian,whereaprobabilitymodelcapturingtherelationsbetweenthedifferenttasksisestimatedsimultaneouslywiththemod-els'parametersforeachoftheindividualtasks.In(AllenbyandRossi,1999;Arora,Allenby,andGinter,1998)ahierarchicalBayesmodelisestimated.First,itisassumedapriorithattheparame-tersoftheTfunctionstobelearnedareallsampledfromanunknownGaussiandistribution.Then,aniterativeGibbssamplingbasedapproachisusedtosimultaneouslyestimateboththeindividualfunctionsandtheparametersoftheGaussiandistribution.InthismodelrelatednessbetweenthetasksiscapturedbythisGaussiandistribution:thesmallerthevarianceoftheGaussianthemorerelatedthetasksare.Finally,(BakkerandHeskes,2003;Heskes,2000)suggestasimilarhierarchi-calmodel.In(BakkerandHeskes,2003)amixtureofGaussiansforthe“upperlevel”distributioninsteadofasingleGaussianisused.Thisleadstoclusteringthetasks,oneclusterforeachGaussianinthemixture.InthispaperwewillnotfollowaBayesianorastatisticalapproach.Instead,ourgoalistodevelopmulti-tasklearningmethodsandtheoryasanextensionofwidelyusedkernellearningmethodsdevelopedwithinSLTorRegularizationTheory,suchasSVMandRN.Weshowthatusingaparticulartypeofkernels,theregularizedmulti-tasklearningmethodweproposeisequivalenttoasingle-tasklearningonewhensuchamulti-taskkernelisused.Theworkhereimprovesupontheideasdiscussedin(EvgeniouandPontil,2004;MicchelliandPontil,2005b).Oneofouraimsistoshowexperimentallythatthemulti-tasklearningmethodswedevelopheresignicantlyimproveupontheirsingle-taskcounterpart,forexampleSVM.Therefore,toemphasizeandclarifythispointweonlycomparethestandard(single-task)SVMwithaproposedmulti-taskversionofSVM.Ourexperimentsshowthebenetsofmulti-tasklearningandindicatethatmulti-taskkernellearningmethodsaresuperiortotheirsingle-taskcounterpart.Anexhaustivecomparisonofanysingle-taskkernelmethodswiththeirmulti-taskversionisbeyondthescopeofthiswork.2.BackgroundandNotationInthissection,webrieyreviewthebasicsetupforsingle-tasklearningusingregularizationinareproducingkernelHilbertspace(RHKS)HKwithkernelK.Formoredetailedaccounts(seeEvgeniou,Pontil,andPoggio,2000;Shawe-TaylorandCristianini,2004;Sch¨olkopfandSmola,2002;Vapnik,1998;Wahba,1990)andreferencestherein.2.1Single-TaskLearning:ABriefReviewInthestandardsingle-tasklearningsetupwearegivenmexamplesf(xi;yi)2Nmg(weusethenotationNmforthesetf1;:::;mg)sampledi.i.d.fromanunknownprobabilitydistributionPon.TheinputspaceistypicallyasubsetofRd,theddimensionalEuclideanspace,andtheoutputspaceisasubsetofR.Forexample,inbinaryclassicationischosentobef1;1g617 EVGENIOU,MICCHELLIANDPONTILThegoalistolearnafunctionwithsmallexpectederrorE[L(y;(x))],wheretheexpectationistakenwithrespecttoPandLisaprescribedlossfunctionsuchasthesquareerror(y(x))2.Tothisend,acommonapproachwithinSLTandregularizationtheoryistolearnastheminimizerinHKofthefunctional1måj2NmL(yj;(xj))+gkk2K(1)wherekk2KisthenormofinHK.WhenHKconsistsoflinearfunctions(x)=w0x,withw;x2Rdweminimize1måj2NmL(yj;w0xj)+gw0w(2)whereallvectorsarecolumnvectorsandweusethenotationA0forthetransposeofmatrixA,andwisad1matrix.Thepositiveconstantgiscalledtheregularizationparameterandcontrolsthetradeoffbetweentheerrorwemakeonthemexamples(thetrainingerror)andthecomplexity(smoothness)ofthesolutionasmeasuredbythenormintheRKHS.Machinesofthisformhavebeenmotivatedintheframeworkofstatisticallearningtheory(Vapnik,1998).LearningmethodssuchasRNandSVMareparticularcasesofthesemachinesforcertainchoicesofthelossfunctionL(Evgeniou,Pontil,andPoggio,2000).Underrathergeneralconditions(Evgeniou,Pontil,andPoggio,2000;MicchelliandPontil,2005b;Wahba,1990)thesolutionofEquation(1)isoftheform(x)=åj2NmcjK(xj;x)(3)wherefcj2NmgisasetofrealparametersandKisakernelsuchasanhomogeneouspolynomialkernelofdegreerK(x;)=(x0)rx;2Rd.ThekernelKhasthepropertythat,forx2K(x;)2HKand,for2HKh;K(x;)iK=(x),whereh;iKistheinnerproductinHK(Aronszajn,1950).Inparticular,forx;2K(x;)=hK(x;);K(;)iKimplyingthatthemmmatrix(K(xi;xj);2Nm)issymmetricandpositivesemi-deniteforanysetofinputsfxj2NmgTheresultinEquation(3)isknownastherepresentertheorem.Althoughitissimpletoprove,itisremarkableasitmakesthevariationalproblem(1)amenableforcomputations.Inparticular,ifLisconvex,theuniqueminimizeroffunctional(1)canbefoundbyreplacingbytherighthandsideofEquation(3)inEquation(1)andthenoptimizingwithrespecttotheparametersfcj2NmgApopularwaytodenethespaceHKisbasedonthenotionofafeaturemapF!WwhereWisaHilbertspacewithinnerproductdenotedbyh;iW.Suchafeaturemapgivesrisetothelinearspaceofallfunctions!Rdenedforx2andw2Was(x)=hw;F(x)iWwithnormhw;wiW.Itcanbeshownthatthisspaceis(moduloanisometry)theRKHSHKwithkernelKdened,forx;2,asK(x;)=hF(x);F()iW.Therefore,theregularizationfunctional(1)becomes1måj2NmL(yj;hw;F(xj)iW)+ghw;wiW:(4)Again,anyminimizerofthisfunctionalhastheformw=åj2NmcjF(xj)(5)618 LEARNINGMULTIPLETASKSWITHKERNELMETHODSwhichisconsistentwithEquation(3).2.2Multi-TaskLearning:NotationFormulti-tasklearningwehaventasksandcorrespondingtothethtaskthereareavailablemexamplesf(xi;yi)2NmgsampledfromadistributionPon.Thus,thetotaldataavailableisf(xi;yi)2Nm;`2Nng.Thegoalittolearnallnfunctions!fromtheavailableexamples.Inthispaperwemainlydiscussthecasethatthetaskshaveacommoninputspace,thatisX=XforallandbrieycommentonthemoregeneralcaseinSection5.1.Therearevariousspecialcasesofthissetupwhichoccurinpractice.Typically,theinputspaceisindependentof.Evenmoreso,theinputdataximaybeindependentofforeverysample.Thishappensinmarketingapplicationsofpreferencemodeling(AllenbyandRossi,1999;Arora,Allenby,andGinter,1998)wherethesamechoicepanelquestionsaregiventomanyindividualconsumers,eachindividualprovideshis/herownpreferences,andweassumethatthereissomecommonalityamongthepreferencesoftheindividuals.Ontheotherhand,therearepracticalcir-cumstanceswheretheoutputdatayiisindependentof.Forexample,thisoccursintheproblemofintegratinginformationfromheterogeneousdatabases(Ben-David,Gehrke,andSchuller,2002).Inothercasesonedoesnothaveeitherpossibilities,thatis,thespacesaredifferent.Thisisforexamplethemachinevisioncaseoflearningtorecognizeafacebyrstlearningtorecognizepartsoftheface,suchaseyes,mouth,andnose(Heiseleetal.,2002).Eachofthesetaskscanbelearnedusingimagesofdifferentsize(ordifferentrepresentations)Wenowturntotheextensionofthetheoryandmethodsforsingle-tasklearningusingtheregularizationbasedkernelmethodsbrieyreviewedabovetothecaseofmulti-tasklearning.Inthefollowingsectionwewillconsiderthecasethatfunctionsarealllinearfunctionsandpostponethediscussionofnon-linearmulti-tasklearningtoSection5.3.AFrameworkforMulti-TaskLearning:TheLinearCaseThroughoutthissectionweassumethat=Rd=Randthatthefunctionsarelinear,thatis,(x)=u0xwithu2Rd.Weproposetoestimatethevectorofparametersu=(u2Nn)2RndastheminimizerofaregularizationfunctionR(u)=1nmå2Nnåj2NmL(yj;u0xj)+gJ(u)(6)wheregisapositiveparameter,Jisahomogeneousquadraticfunctionofu,thatis,J(u)=u0Eu(7)andEadndnmatrixwhichcapturestherelationsbetweenthetasks.FromnowonweassumethatmatrixEissymmetricandpositivedenite,unlessotherwisestated.WebrieycommentonthecasethatEispositivesemidenitebelow.ForacertainchoiceofJ(or,equivalently,matrixE),theregularizationfunction(6)learnsthetasksindependentlyusingtheregularizationmethod(1).Inparticular,forJ(u)=å2Nnkuk2thefunction(6)decouples,thatis,R(u)=1nå2Nnr(u)wherer(u)=1måj2NmL(yj;u0xj)+gkuk2meaningthatthetaskparametersarelearnedindependently.Ontheotherhand,ifwechooseJ(u)=619 EVGENIOU,MICCHELLIANDPONTILåq2Nnkuuqk2,wecan“force”thetaskparameterstobeclosetoeachother:taskparametersuarelearnedjointlybyminimizing(6).Notethatfunction(6)dependsondnparameterswhosenumbercanbeverylargeifthenum-beroftasksnislarge.Ouranalysisbelowestablishesthatthemulti-tasklearningmethod(6)isequivalenttoasingle-tasklearningmethodasin(2)foranappropriatechoiceofamulti-taskker-nelinEquation(10)below.Asweshallsee,theinputspaceofthiskerneldependsistheoriginalddimensionalspaceofthedataplusanadditionaldimensionwhichrecordsthetaskthedatabe-longsto.Forthispurpose,wetakethefeaturespacepointofviewandwriteallfunctionsintermsofthesamefeaturevectorw2Rpforsomep2N;pdn.Thatis,foreachwewrite(x)=w0Bx;x2Rd;`2Nn(8)or,equivalently,u=B0w;`2Nn(9)forsomepdmatrixByettobespecied.WealsodenethepdnfeaturematrixB=[B2Nn]formedbyconcatenatingthenmatricesB2NnNotethat,sincethevectoruinEquation(9)isarbitrary,toensurethatthereexistsasolutionwtothisequationitisnecessarythatthematrixBisoffullrankdforeach2Nn.Moreover,weassumethatthefeaturematrixBisoffullrankdnaswell.Ifthisisnotthecase,thefunctionsarelinearlyrelated.Forexample,ifwechooseB=B0forevery2Nn,whereB0isaprescribedpdmatrix,Equation(8)tellsusthatalltasksarethesametask,thatis,1=2==n.Inparticularifp=dandB0=Idthefunction(11)(seebelow)implementsasingle-tasklearningproblem,asinEquation(2)withallthemndatafromthentasksasiftheyallcomefromthesametask.Saidinotherwords,weviewthevector-valuedfunction=(2Nn)asthereal-valuedfunction(x;`)7!w0BxontheinputspaceRdNnwhosesquarednormisw0w.TheHilbertspaceofallsuchreal-valuedfunctionshasthereproducingkernelgivenbytheformulaK((x;`);(;q))=x0B0Bq;x;2Rd;`;q2Nn:(10)Wecallthiskernelalinearmulti-taskkernelsinceitisbilinearinxandforxedandqUsingthislinearfeaturemaprepresentation,wewishtoconverttheregularizationfunction(6)toafunctionoftheform(2),namely,S(w)=1nmå2Nnåj2NmL(yj;w0Bxj)+gw0w;w2Rp:(11)ThistransformationrelatesmatrixEdeningthehomogeneousquadraticfunctionofuweusedin(6),J(u),andthefeaturematrixB.Wedescribethisrelationshipinthepropositionbelow.Proposition1IfthefeaturematrixBisfullrankandwedenethematrixEinEquation(7)astobeE=(B0B)1thenwehavethatS(w)=R(B0w);w2Rd:(12)Conversely,ifwechooseasymmetricandpositivedenitematrixEinEquation(7)andTisasquaredrootofEthenforthechoiceofB=T0E1Equation(12)holdstrue.620 LEARNINGMULTIPLETASKSWITHKERNELMETHODSPROOF.Werstprovetherstpartoftheproposition.SinceEquation(9)requiresthatthefeaturevectorwiscommontoallvectorsuandthosearearbitrary,thefeaturematrixBmustbeoffullrankdnand,so,thematrixEaboveiswelldened.ThismatrixhasthepropertythatBEB0=Ipthisbeingtheppidentitymatrix.Consequently,wehavethatw0w=J(B0w);w2Rp(13)andEquation(12)follows.Ontheotherdirection,wehavetondamatrixBsuchthatBEB0=Ip.Tothisend,weexpressEintheformE=TT0whereTisadnpmatrix,pdn.ThismaybedoneinvariouswayssinceEispositivedenite.Forexample,withp=dnwecanndadndnmatrixTbyusingtheeigenvaluesandeigenvectorsofE.WiththisrepresentationofEwecanchooseourfeaturestobeB=VT0E1whereVisanarbitrarypporthogonalmatrix.ThisfactfollowsbecauseBEB0=Ip.Inparticular,ifwechooseV=Iptheresultfollows.NotethatthispropositionrequiresthatBisoffullrankbecauseEispositivedenite.Asanexample,considerthecasethatBisadndmatrixallofwhoseddblocksarezeroexceptforthethblockwhichisequaltoId.Thischoicemeansthatwearelearningalltasksindependently,thatis,J(u)=å2Nnkuk2andproposition(1)conrmsthatE=IdnWeconjecturethatifthematrixBisnotfullrank,theequivalencebetweenfunction(11)and(6)statedinproposition1stillholdstrueprovidedmatrixEisgivenbythepseudoinverseofmatrix(B0B)andweminimizethelatterfunctiononthelinearsubspaceSspannedbytheeigenvectorsofEwhichhaveapositiveeigenvalue.Forexample,intheabovecasewhereB=B0forall2NnwehavethatS=f(u2Nn)u1=u2==ung.Thisobservationwouldalsoextendtothecircumstancewheretherearearbitrarylinearrelationsamongstthetaskfunctions.Indeed,wecanimposesuchlinearrelationsonthefeaturesdirectlytoachievethisrelationamongstthetaskfunctions.WediscussaspecicexampleofthissetupinSection3.1.3.However,weleaveacompleteanalysisofthepositivesemidenitecasetoafutureoccasion.Themainimplicationofproposition1istheequivalencebetweenfunction(6)and(11)whenEispositivedenite.Inparticular,thispropositionimpliesthatwhenmatrixBandEarelinkedasstatedintheproposition,theuniqueminimizerswof(11)anduof(6)arerelatedbytheequationsu=B0wSincefunctional(11)islikeasingletaskregularizationfunctional(2),bytherepresentertheorem—seeEquation(5)—itsminimizerhastheformw=åj2Nmå2NncjBxj:Thisimpliesthattheoptimaltaskfunctionsareq(x)=åj2Nmå2NncjK((xj;`);(x;q));x2Rd;q2Nn(14)621 EVGENIOU,MICCHELLIANDPONTILwherethekernelisdenedinEquation(10).NotethattheseequationsholdforanychoiceofthematricesB2NnHavingdenedthekernelfor(10),wecannowusestandardsingle-tasklearningmethodstolearnmultipletaskssimultaneously(weonlyneedtodenetheappropriatekernelfortheinputdata(x;`)).SpecicchoicesofthelossfunctionLinEquation(11)leadtodifferentlearningmethods.Example1Inregularizationnetworks(RN)wechoosethesquarelossL(y;z)=(yz)2y;z2R(see,forexample,Evgeniou,Pontil,andPoggio,2000).InthiscasetheparameterscjinEqua-tion(14)areobtainedbysolvingthesystemoflinearequationsåq2Nnåj2NmK((xjq;q);(xi;`))cjq=yi;2Nm;`2Nn:(15)WhenthekernelKisdenedbyEquation(10)thisisaformofmulti-taskridgeregression.Example2Insupportvectormachines(SVM)forbinaryclassication(Vapnik,1998)wechoosethehingeloss,namelyL(y;z)=(1yz)+where(x)+=max(0;x)andy2f1;1g.Inthiscase,theminimizationoffunction(11)canberewrittenintheusualformProblem3.1min(å2Nnåi2Nmxi+gkwk2)(16)subject,foralli2Nmand2Nn,totheconstraintsthatyiw0Bxi1xi(17)xi0:FollowingthederivationinVapnik(1998)thedualofthisproblemisgivenbyProblem3.2maxci2`(å2Nnåi2Nmci12åq2Nnåij2NmciyicjqyjqK((xi;`);(xjq;q)))(18)subject,foralli2Nnand2Nn,totheconstrainsthat0ci12g:WenowstudyparticularexamplessomeofwhichwealsotestexperimentallyinSection5.622 LEARNINGMULTIPLETASKSWITHKERNELMETHODS3.1ExamplesofLinearMulti-TaskKernelsWediscusssomeexamplesoftheaboveframeworkwhicharevaluableforapplications.ThesecasesarisefromdifferentchoicesofmatricesBthatweusedabovetomodeltaskrelatednessor,equivalently,bydirectlychoosingthefunctionJinEquation(6).NoticethataparticularcaseoftheregularizerJinEquation(7)isgivenbyJ(u)=åq2Nnu0uqGq(19)whereG=(Gq`;q2Nn)isapositivedenitematrix.Proposition(1)impliesthatthekernelhastheformK((x;`);(;q))=x0tG1q:(20)Indeed,Jcanbewrittenas(u;Eu)whereEisthennblockmatrixwhose`;qblockistheddmatrixGqIdandtheresultfollows.Theexampleswediscussarewithkernelsoftheform(20).3.1.1AUSEFULEXAMPLEInourrstexamplewechooseBtobethe(n+1)ddmatrixwhoseddblocksareallzeroexceptforthe1stand(+1)thblockwhichareequaltop1lIdandplnIdrespectively,wherel2(0;1)andIdistheddimensionalidentitymatrix.Thatis,B0=[p1lId;0;:::;0|{z}1;plnId;0;:::;0|{z}n](21)wherehere0standsfortheddmatrixallofwhoseentriesarezero.UsingEquation(10)thekernelisgivenbyK((x;`);(;q))=(1l+lndq)x0;`;q2Nn;x;2Rn:(22)AdirectcomputationshowsthatEq=((B0B)1)q=1ndql1lnlIdwhereEqisthe(`;q)thddblockofmatrixE.Byproposition1wehavethatJ(u)=1n å2Nnkuk2+1llå2Nnku1nåq2Nnuqk2!:(23)Thisregularizerenforcesatrade–offbetweenadesirablesmallsizeforper–taskparametersandclosenessofeachoftheseparameterstotheiraverage.Thistrade-offiscontrolledbythecouplingparameterl.Iflissmallthetasksparametersarerelated(closedtotheiraverage)whereasifl=1thetaskarelearnedindependently.Themodelofminimizing(11)withtheregularizer(24)wasproposedbyEvgeniouandPontil(2004)inthecontextofsupportvectormachines(SVM's).Inthiscasetheaboveregularizertradesofflargemarginofeachper–taskSVMwithclosenessofeachSVMtotheaverageSVM.InSection4wewillpresentnumericalexperimentsshowingthegoodperformanceofthismulti–taskSVM623 EVGENIOU,MICCHELLIANDPONTILcomparedtobothindependentper–taskSVM's(thatis,l=1inEquation(22))andpreviousmulti–tasklearningmethods.WenoteinpassingthatanalternateformforthefunctionJisJ(u)=min(1lnå2Nnkuu0k2+11lku0k2u02Rd):(24)Itwasthisformulawhichoriginatedourinterestinmulti-tasklearninginthecontextofregulariza-tion,see(EvgeniouandPontil,2004)foradiscussion.Moreover,ifwereplacetheidentitymatrixIdinEquation(21)bya(any)ddmatrixAweobtainthekernelK((x;`);(;q))=(1l+lndq)x0Qt;`;q2Nn;x;2Rn(25)whereQ=A0A.InthiscasethenorminEquation(23)and(24)isreplacedbykkQ13.1.2TASKCLUSTERINGREGULARIZATIONTheregularizerinEquation(24)implementstheideathatthetaskparametersuareallrelatedtoeachotherinthesensethateachuisclosetoan“averageparameter”u0.Oursecondexampleextendsthisideatodifferentgroupsoftasks,thatis,weassumethatthetaskparameterscanbeputtogetherindifferentgroupssothattheparametersinthekthgroupareallclosetoanaverageparameteru0k.Moreprecisely,weconsidertheregularizerJ(u)=min(åk2Nc å2Nnr()kkuu0kk2+rku0kk2!u0k2Rd;k2Nc)(26)wherer()k0,r�0,andcisthenumberofclusters.Ourpreviousexamplecorrespondstoc=1,r=11landr()1=1ln.AdirectcomputationshowsthatJ(u)=åq2Nnu0uqGqwheretheelementsofthematrixG=(Gq`;q2Nn)aregivenbyGq=åk2Nc r()kdqr()kr(q)kr+år2Nnr(r)k!:Ifr()khasthepropertythatgivenanythereisaclusterksuchthatr()k�0thenGispositivedenite.ThenJispositivedeniteandbyEquation(20)thekernelisgivenbyK((x;`);(;q))=G1qx0.Inparticular,ifr()h=dhk()withk()theclustertaskbelongsto,matrixGisinvertibleandtakesthesimpleformG1q=dq+1rqq(27)whereqq=1iftasksandqbelongtothesameclusterandzerootherwise.Inparticular,ifc=1andwesetr=1llnthekernelK((x;`);(;q))=(dq+1r)x0isthesame(moduloaconstant)asthekernelinEquation(22).624 LEARNINGMULTIPLETASKSWITHKERNELMETHODS3.1.3GRAPHREGULARIZATIONInourthirdexamplewechooseannnsymmetricmatrixAallofwhoseentriesareintheunitinterval,andconsidertheregularizerJ(u)=12åq2Nnkuuqk2Aq=åq2Nnu0uqLq(28)whereL=DAwithDq=dqåh2NnAh.ThematrixAcouldbetheweightmatrixofagraphwithnverticesandLthegraphLaplacian(Chung,1997).TheequationAq=0meansthattasksandqarenotrelated,whereasAq=1meansstrongrelation.Thequadraticfunction(28)isonlypositivesemidenitesinceJ(u)=0wheneverallthecom-ponentsofuareindependentof.ToidentifythosevectorsuforwhichJ(u)=0weexpresstheLaplacianLintermsofitseigenvaluesandeigenvectors.Thus,wehavethatLq=åk2Nnskvkvkq(29)wherethematrixV=(vk)isorthogonal,s1==srsr+1snaretheeigenvaluesofLandr1isthemultiplicityofthezeroeigenvalue.Thenumberrcanbeexpressedintermsofthenumberofconnectedcomponentsofthegraph,see,forexample,(Chung,1997).Substitutingtheexpression(29)forLintherighthandsideof(28)weobtainthatJ(u)=åk2Nnsk\r\r\r\r\rå2Nnuvk\r\r\r\r\r2:Therefore,weconcludethatJispositivedeniteonthespaceS=(uu2Rdn;å2Nnuvk=0;k2Nr):Clearly,thedimensionofSisd(nr)SgivesusaHilbertspaceofvector-valuedlinearfunctionsH=u(x)=(u0x2Nn)u2S andthereproducingkernelofHisgivenbyK((x;`);(;q))=L+qx0:(30)whereL+isthepseudoinverseofL,thatis,L+q=nåk=r+1s1kvkvkq:Thevericationofthesefactsisstraightforwardandwedonotelaborateonthedetails.WecanusethisobservationtoassertthatonthespaceStheregularizationfunction(6)correspondingtotheLaplacianhasauniqueminimumanditisgivenintheformofarepresentertheoremforkernel(30).625 EVGENIOU,MICCHELLIANDPONTIL4.ExperimentsAsdiscussedintheintroduction,weconductedexperimentstocomparethe(standard)single-taskversionofakernelmachine,inthiscaseSVM,toamulti-taskversiondevelopedabove.Wetestedtwomulti-taskversionsofSVM:a)weconsideredthesimplecasethatthematrixQinEquation(25)istheidentitymatrix,thatis,weusethemulti-taskkernel(22),andb)weestimatethematrixQin(25)byrunningPCAonthepreviouslylearnedtaskparameters.Specically,werstinitializeQtobetheidentitymatrix.Wetheniterateasfollows:1.Weestimateparametersuusing(25)andthecurrentestimateofmatrixQ(which,fortherstiterationistheidentitymatrix).2.WerunPCAontheseestimates,andselectonlythetopprincipalcomponents(correspondingtothelargesteigenvaluesoftheempiricalcorrelationmatrixoftheestimatedu).Inpartic-ular,weonlyselecttheeigenvectorssothatthesumofthecorrespondingeigenvalues(total“energy”kept)isatleast90%ofthesumofalltheeigenvalues(notusingtheremainingeigenvaluesoncewereachthis90%threshold).WethenusethecovarianceoftheseprincipalcomponentsasourestimateofmatrixQin(25)forthenextiteration.Wecanrepeatsteps(1)and(2)untilalleigenvaluesareneededtoreachthe90%energythreshold–typicallyin4-5iterationsfortheexperimentsbelow.Wecanthenpicktheestimateduaftertheiterationthatleadtothebestvalidationerror.Weemphasize,thatthisissimplyaheuristic.Wedonothaveatheoreticaljusticationforthisheuristic.DevelopingatheoryaswellasothermethodsforestimatingmatrixQisanopenquestion.NoticethatinsteadofusingPCAwecoulddirectlyuseformatrixQsimplythecovarianceoftheestimateduofthepreviousiteration.Howeverdoingsoissensitivetoestimationerrorsofuandleads(aswealsoobservedexperimentally–wedon'tshowtheresultshereforsimplicity)topoorerperformance.Oneofthekeyquestionsweconsideredis:howdoesmulti-tasklearningperformrelativetosingle-taskasthenumberofdatapertaskandasthenumberoftaskschange?Thisquestionisalsomotivatedbyatypicalsituationinpractice,whereitmaybeeasytohavedatafrommanyrelatedtasks,butitmaybedifculttohavemanydatapertask.Thiscouldoftenbeforexamplethecaseinanalyzingcustomerdataformarketing,wherewemayhavedataaboutmanycustomers(tensofthousands)butonlyafewsamplespercustomer(onlytens)(AllenbyandRossi,1999;Arora,Allenby,andGinter,1998).Itcanalsobethecaseforbiologicaldata,wherewemayhavedataaboutmanyrelateddiseases(forexample,typesofcancer),butonlyafewsamplesperdisease(Rifkinetal.,2003).Asnotedbyotherresearchersin(Baxter,1997,2000;Ben-David,Gehrke,andSchuller,2002;Ben-DavidandSchuller,2003),oneshouldexpectthatmulti-tasklearninghelpsmore,relativetosingletask,whenwehavemanytasksbutonlyfewdatapertask–whilewhenwehavemanydatapertaskthensingle-tasklearningmaybeasgood.Weperformedexperimentswithtworealdatasets.Onewasoncustomerchoicedata,andtheotherwasonschoolexamsusedby(BakkerandHeskes,2003;Heskes,2000)whichweuseherealsoforcomparisonwith(BakkerandHeskes,2003;Heskes,2000).Wediscusstheseexperimentsnext.626 LEARNINGMULTIPLETASKSWITHKERNELMETHODS4.1CustomerDataExperimentsWetestedtheproposedmethodsusingarealdatasetcapturingchoicesamongproductsmadebymanyindividuals.1Thegoalistoestimateafunctionforeachindividualmodelingthepreferencesoftheindividualbasedonthechoiceshe/shehasmade.Thisfunctionisusedinpracticetopredictwhatproducteachindividualwillchooseamongfuturechoices.Wemodeledthisproblemasaclassicationonealongthelinesof(Evgeniou,Boussios,andZacharia,2002).Therefore,thegoalistoestimateaclassicationfunctionforeachindividual.Wehavedatafrom200individuals,andforeachindividualwehave120datapoints.Thedataarethreedimensional(theproductsweredescribedusingthreeattributes,suchascolor,price,size,etc.)eachfeaturetakingonlydiscretevalues(forexample,thecolorcanbeonlyblue,orblack,orred,etc.).Tohandlethediscretevaluedattributes,wetransformedthemintobinaryones,havingeventually20-dimensionalbinarydata.Weconsidereachindividualasadifferent“task”.Thereforewehave200classicationtasksand12020-dimensionaldatapointsforeachtask–foratotalof24000datapoints.WeconsideralinearSVMclassicationforeachtask–trialswithnon-linear(polynomialofdegree2and3)SVMdidnotimproveperformanceforthisdataset.Totesthowmulti-taskcomparestosingletaskasthenumberofdatapertaskand/orthenumberoftaskschanges,weranexperimentswithvaryingnumbersofdatapertaskandnumberoftasks.Inparticular,weconsidered50,100,and200tasks,splittingthe200tasksinto4groupsof50or2groupsof100(oronegroupof200),andthentakingtheaverageperformanceamongthe4groups,the2groups(andthe1group).Foreachtaskwesplitthe120pointsinto20,30,60,90trainingpoints,and100,90,60,30testpointsrespectively.Giventhelimitednumberofdatapertask,wechosetheregularizationparametergforthesingle-taskSVMamongonlyafewvalues(0.1,1,10)usingtheactualtesterror.2Ontheotherhand,themulti-tasklearningregularizationparametergandparameterlin(22)werechosenusingavalidationsetconsistingofone(training)datapointpertaskwhichwethenincludedbacktothetrainingdataforthenaltrainingaftertheparameterselection.TheparameterslandgusedwhenweestimatedmatrixQthroughPCAwerethesameaswhenweusedtheidentitymatrixasQ.Wenotethatoneoftheadvantagesofmulti-tasklearningisthat,sincethedataaretypicallyfrommanytasks,parameterssuchasregularizationparametergcanbepracticallychosenusingonlyafew,proportionallytoallthedataavailable,validationdatawithoutpractically“losing”manydataforparameterselection–whichmaybeafurtherimportantpracticalreasonformulti-tasklearning.Parameterlwaschosenamongvalues(0,0.2,0.4,0.6,0.8)–value1correspondingtotrainingoneSVMpertask.BelowwealsorecordtheresultsindicatinghowthetestperformanceisaffectedbyparameterlWedisplayalltheresultsinTable4.1.Noticethattheperformanceofthesingle-taskSVMdoesnotchangeasthenumberoftasksincreases–asexpected.WealsonotethatwhenweuseoneSVMforallthetasks—treatingthedataasiftheycomefromthesametask—wegetaverypoorperformance:between38and42percenttesterrorforthe(datatasks)casesconsidered.Fromtheseresultswedrawthefollowingconclusions:1.ThedataareproprietarywereprovidedtotheauthorsbyResearchInternationalInc.andareavailableuponrequest.2.ThisleadtosomeoverttingofthesingletaskSVM,howeveritonlygaveourcompetitoranadvantageoverourapproach.627 EVGENIOU,MICCHELLIANDPONTILTasksDataOneSVMIndivSVMIdentityPCA502041.9729.8628.7229.161002041.4129.8628.3029.262002040.0829.8627.7928.53503040.7326.8425.5325.651003040.6626.8425.2524.792003039.4326.8425.1624.13506040.3322.8422.0621.081006040.0222.8422.2720.792006039.7422.8421.8620.00509038.5119.8419.6818.451009038.9719.8419.3418.082009038.7719.8419.2717.53Table1:ComparisonofMethodsasthenumberofdatapertaskandthenumberoftaskschanges.“OneSVM”standsfortrainingoneSVMwithallthedatafromallthetask,“IndivSVM”standsfortrainingforeachtaskindependently,“Identity”standsforthemulti-taskSVMwiththeidentitymatrix,and“PCA”isthemulti-taskSVMusingthePCAapproach.Mis-classicationerrorsarereported.Bestperformance(s)atthe5%signicancelevelisinbold.Whentherearefewdatapertask(20,30,or60),bothmulti-taskSVMssignicantlyoutper-formthesingle-taskSVM.Asthenumberoftasksincreasestheadvantageofmulti-tasklearningincreases–forexamplefor20datapertask,theimprovementinperformancerelativetosingle-taskSVMis1.14,1.56,and2.07percentforthe50,100,and200tasksrespectively.Whenwehavemanydatapertask(90),thesimplemulti-taskSVMdoesnotprovideanyadvantagerelativetothesingle-taskSVM.However,thePCAbasedmulti-taskSVMsigni-cantlyoutperformstheothertwomethods.Whentherearefewdatapertask,thesimplemulti-taskSVMperformsbetterthanthePCAmulti-taskSVM.ItmaybethatinthiscasethePCAmulti-taskSVMovertsthedata.ThelasttwoobservationsindicatethatitisimportanttohaveagoodestimateofmatrixQin(25)forthemulti-tasklearningmethodthatusesmatrixQ.Achievingthisiscurrentlyanopenques-tionsthatcanbeapproached,forexample,usingconvexoptimizationtechniques,see,forexample,(Lanckrietetal.,2004;MicchelliandPontil,2005b)Toexplorethesecondpointfurther,weshowinFigure1thechangeinperformancefortheidentitymatrixbasedmulti-taskSVMrelativetothesingle-taskSVMinthecaseof20datapertask.Weusel=0:6asbefore.Wenoticethefollowing:Whenthereareonlyafewtasks(forexample,lessthan20inthiscase),multi-taskcanhurttheperformancerelativetosingle-task.Noticethatthisdependsontheparameterlused.628 LEARNINGMULTIPLETASKSWITHKERNELMETHODSForexample,settinglcloseto1leadstousingasingle-taskSVM.Henceourexperimentalndingsindicatethatforfewtasksoneshoulduseeitherasingle-taskSVMoramulti-taskonewithparameterlselectednear1Asthenumberoftasksincreases,performanceimproves–surpassingtheperformanceofthesingle-taskSVMafter20tasksinthiscase.Asdiscussedin(Baxter,1997,2000;Ben-David,Gehrke,andSchuller,2002;Ben-DavidandSchuller,2003),animportanttheoreticalquestionistostudytheeffectsofaddingadditionaltasksonthegeneralizationperformance(Ben-David,Gehrke,andSchuller,2002;Ben-DavidandSchuller,2003).Whatourexperimentsshowisthat,forfewtasksitmaybeinappropriatetofollowamulti-taskapproachifasmalllisused,butasthenumberoftasksincreasesperformancerelativetosingle-tasklearningimproves.Thereforeoneshouldchooseparameterldependingonthenumberoftasks,muchlikeoneshouldchooseregularizationparametergdependingonthenumberofdata.WetestedtheeffectsofparameterlinEquation(22)ontheperformanceoftheproposedap-proach.InFigure2weplotthetesterrorforthesimplemulti-tasklearningmethodusingtheidentitymatrix(kernel(22))forthecaseof20datapertaskwhenthereare200tasks(thirdrowinTable4.1),or10tasks(forwhichsingle-taskSVMoutperformsmulti-taskSVMforl=0:6asshowninFigure1).Parameterlvariesfrom0(oneSVMforalltasks)to1(oneSVMpertask).Noticethatforthe200taskstheerrordropsandthenincreases,havingaatminimumbetweenl=0:4and0.6.Moreover,foranylbetween0.2and1wegetabetterperformancethanthesingle-taskSVM.Thesamebehaviorholdsforthe10tasks,exceptthatnowthespaceofl'sforwhichthemulti-taskapproachoutperformsthesingle-taskoneissmaller–onlyforlbetween0.7and1.Hence,forafewtasksmulti-tasklearningcanstillhelpifalargeenoughlisused.However,aswenotedabove,itisanopenquestionastohowtochooseparameterlinpractice–otherthanusingavalidationset.Figure1:Thehorizontalaxisisthenumberoftasksused.Theverticalaxisisthetotaltestmisclas-sicationerroramongthetasks.Thereare20trainingpointspertask.Wealsoshowtheperformanceofasingle-taskSVM(dashedline)which,ofcourse,isnotchangingasthenumberoftasksincreases.629 EVGENIOU,MICCHELLIANDPONTILFigure2:Thehorizontalaxisistheparameterlforthesimplemulti-taskmethodwiththeidentitymatrixkernel(22).Theverticalaxisisthetotaltestmisclassicationerroramongthetasks.Thereare200taskswith20trainingpointsand100testpointspertask.Leftisfor10tasks,andrightisfor200tasks.Figure3:Performanceontheschooldata.Thehorizontalaxisistheparameterlforthesimplemulti-taskmethodwiththeidentitymatrixwhiletheverticalistheexplainedvariance(percentage)onthetestdata.Thesolidlineistheperformanceoftheproposedapproachwhilethedashedlineisthebestperformancereportedin(BakkerandHeskes,2003).4.2SchoolDataExperimentWealsotestedtheproposedapproachusingthe“schooldata”fromtheInnerLondonEducationAuthorityavailableatmultilevel.ioe.ac.uk/intro/datasets.html.Thisexperimentisalsodiscussedin(EvgeniouandPontil,2004)wheresomeoftheideasofthispaperwererstpresented.We630 LEARNINGMULTIPLETASKSWITHKERNELMETHODSselectedthisdatasetsothatwecanalsocompareourmethoddirectlywiththeworkofBakkerandHeskes(2003)whereanumberofmulti-tasklearningmethodsareappliedtothisdataset.Thisdataconsistsofexaminationrecordsof15362studentsfrom139secondaryschools.Thegoalistopredicttheexamscoresofthestudentsbasedonthefollowinginputs:yearoftheexam,gender,VRband,ethnicgroup,percentageofstudentseligibleforfreeschoolmealsintheschool,percentageofstudentsinVRbandoneintheschool,genderoftheschool(i.e.male,female,mixed),andschooldenomination.Werepresentedthecategoricalvariablesusingbinary(dummy)variables,sothetotalnumberofinputsforeachstudentineachoftheschoolswas27.SincethegoalistopredicttheexamscoresofthestudentsweranregressionusingtheSVMe–lossfunction(Vapnik,1998)forthemulti–tasklearningmethodproposed.Weconsideredeachschooltobe“onetask”.Therefore,wehad139tasksintotal.Wemade10randomsplitsofthedataintotraining(75%ofthedata,hencearound70studentsperschoolonaverage)andtest(theremaining25%ofthedata,hencearound40studentsperschoolonaverage)dataandwemeasuredthegeneralizationperformanceusingtheexplainedvarianceofthetestdataasameasureinordertohaveadirectcomparisonwith(BakkerandHeskes,2003)wherethiserrormeasureisused.Theexplainedvarianceisdenedin(BakkerandHeskes,2003)tobethetotalvarianceofthedataminusthesum–squarederroronthetestsetasapercentageofthetotaldatavariance,whichisapercentageversionofthestandardR2errormeasureforregressionforthetestdata.Finally,weusedasimplelinearkernelforeachofthetasks.TheresultsforthisexperimentareshowninFigure3.Wesetregularizationparametergtobe1andusedalinearkernelforsimplicity.Weusedthesimplemulti-tasklearningmethodproposedwiththeidentitymatrix.Welettheparameterlvarytoseetheeffects.Forcomparisonwealsoreportontheperformanceofthetaskclusteringmethoddescribedin(BakkerandHeskes,2003)–thedashedlineinthegure.Theresultsshowagaintheadvantageoflearningalltasks(forallschools)simultaneouslyin-steadoflearningthemonebyone.Indeed,learningeachtaskseparatelyinthiscasehurtsperfor-mancealot.Moreover,eventhesimpleidentitymatrixbasedapproachsignicantlyoutperformstheBayesianmethodof(BakkerandHeskes,2003),whichinturninbetterthanothermethodsascomparedin(BakkerandHeskes,2003).Note,however,thatforthisdatasetoneSVMforalltasksperformsthebest,whichisalsosimilartousingasmallenoughl(anylbetween0and0.7inthiscase).Hence,itappearsthattheparticulardatasetmaycomefromasingletask(despitethisobservation,weusethisdatasetfordirectcomparisonwith(BakkerandHeskes,2003)).Thisresultalsoindicatesthatwhenthetasksarethesametask,usingtheproposedmulti-tasklearningmethoddoesnothurtaslongasasmallenoughlisused.Noticethatforthisdatasettheperformancedoesnotchangesignicantlyforlbetween0and0.7,whichshowsthat,asforthecustomerdataabove,theproposedmethodisnotverysensitivetol.Atheoreticalstudyofthesensitivityofourapproachtothechoiceoftheparameterlisanopenresearchdirectionwhichmayalsoleadtoabetterun-derstandingoftheeffectsofincreasingthenumberoftasksonthegeneralizationperformanceasdiscussedin(Baxter,1997,2000;Ben-DavidandSchuller,2003).5.DiscussionandConclusionsInthisnalsectionweoutlinetheextensionsoftheideaspresentedabovetonon-linearfunctions,discusssomeopenproblemsonmulti-taskslearninganddrawourconclusions.631 EVGENIOU,MICCHELLIANDPONTIL5.1NonlinearMulti-TaskKernelsWediscussanon-linearextensionofthemulti-tasklearningmethodspresentedabove.Thisgivesusanopportunitytoprovideawidevarietyofmulti-taskkernelswhichmaybeusefulforapplica-tions.Ourpresentationbuildsuponearlierworkonlearningvector–valuedfunctions(MicchelliandPontil,2005)whichdevelopedthetheoryofRKHSoffunctionswhoserangeisaHilbertspace.Asinthelinearcaseweviewthevector-valuedfunction=(2Nn)asareal-valuedfunctionontheinputspaceNn.WeexpressintermsofthefeaturemapsF!W2NnwhereWisaHilbertspacewithinnerproducth;i.Thatis,wehavethat(x)=hw;F(x)i;x2;`2Nn:Thevectorwiscomputedbyminimizingthesingle-taskfunctionalS(w)=1nmå2Nnåj2NmL(yj;hw;F(xj)i)+ghw;wi;w2W:(31)Bytherepresentertheorem,theminimizeroffunctionalShastheforminEquation(14)wherethemulti-taskkernelisgivenbytheformulaK((x;`);(;q))=hF(x);Fq()ix;2;`;q2Nn:(32)InSection3wehavediscussedthisapproachinthecasethatWisanitedimensionalEuclideanspaceandFthelinearmapF(x)=Bx,therebyobtainingthelinearmulti-taskkernel(10).InordertogeneralizethiscaseitisusefultorecallaresultofSchurwhichstatesthattheelementwiseproductoftwopositivesemidenitematricesisalsopositivesemidenite,(Aronszajn,1950,p.358).Thisimpliesthattheelementwiseproductoftwokernelsisakernel.Consequently,weconcludethat,foranyr2NK((x;`);(;q))=(x0B0Bq)r(33)isapolynomialmulti-taskkernel.Moregenerallywehavethefollowinglemma.Lemma2IfGisakernelonTTand,forevery2Nn,thereareprescribedmappingsz!TsuchthatK((x;`);(;q))=G(z(x);zq());x;2;`;q2Nn(34)thenKisamulti-taskkernel.PROOF.Wenotethatforeveryfci2Nm;`2NngRandfxi2Nm;`2Nngwehaveåij2Nmåq2NncicjqG(z(xi);zq(xjq))=åiåjqcicjqG(˜zi;˜zjq)0wherewehavedened˜zi=z(xi)andthelastinequalityfollowsbythehypothesisthatGisakernel.ForthespecialcasethatT=Rpz(x)=BxwithBapdmatrix,2Nn,andGRpRp!Risthehomogeneouspolynomialkernel,G(;s)=(0s)r,thelemmaconrmsthatthefunction(33)isamulti-taskkernel.Similarly,whenGischosentobeaGaussiankernel,weconcludethatK((x;`);(;q))=exp(bkBxBqk2)632 LEARNINGMULTIPLETASKSWITHKERNELMETHODSisamulti-taskkernelforeveryb�0.Lemma2alsoallowsustogeneralizemulti-tasklearningtothecasethateachtaskfunctionhasadifferentinputdomain,asituationwhichisimportantinapplications,see,forexample,(Ben-David,Gehrke,andSchuller,2002)foradiscussion.Tothisend,wespecifysets2Nnfunctionsg!R,andnotethatmulti–tasklearningcanbeplacedintheaboveframeworkbydeningtheinputspace=12n:Weareinterestedinthefunctions(x)=g(Px),wherex=(x1;:::;xn)andP!isdened,foreveryx2byP(x)=x.LetGbeakernelonTTandchoosez()=f(P())wheref!Taresomeprescribedfunctions.Thenbylemma2thekerneldenedbyEquation(34)canbeusedtorepresentthefunctionsg.Inparticular,inthecaseoflinearfunctions,wechoose=Rd`,whered2NT=Rpp2NG(s;)=s0andz=DPwhereDisapdmatrix.Inthiscase,themulti-taskkernelisgivenbyK((x;`);(;q))=x0P0D0DqPqwhichisoftheforminEquation(10)forB=DP;`2NnWenotethatideasrelatedtothosepresentedinthissectionappearin(Girosi,2003).5.2ConclusionandFutureWorkWedevelopedaframeworkformulti-tasklearninginthecontextofregularizationinreproducingkernelHilbertspaces.Thisnaturallyextendsstandardsingle-taskkernellearningmethods,suchasSVMandRN.Theframeworkallowstomodelrelationshipsbetweenthetasksandtolearnthetaskparameterssimultaneously.Forthispurpose,weshowedthatmulti-tasklearningcanbeseenassingle-tasklearningifaparticularfamilyofkernels,thatwecalledmulti-taskkernels,isused.Wealsocharacterizedthenon-linearmulti-taskkernels.Withintheproposedframework,wedenedparticularlinearmulti-taskkernelsthatcorrespondtoparticularchoicesofregularizerswhichmodelrelationshipsbetweenthefunctionparameters.Forexample,inthecaseofSVM,appropriatechoicesofthiskernel/regularizerimplementedatrade–offbetweenlargemarginofeachper–taskindividualSVMandclosenessofeachSVMtolinearcombinationsoftheindividualSVMssuchastheiraverage.Wetestedsomeoftheproposedmethodsusingrealdata.Theexperimentalresultsshowthattheproposedmulti-tasklearningmethodscanleadtosignicantperformanceimprovementsrelativetothesingle-tasklearningmethods,especiallywhenmanytaskswithfewdataeacharelearned.Anumberofresearchquestionscanbestudiedstartingfromtheframeworkandmethodswedevelopedhere.Weclosewithcommentingonsomeissueswhichstemoutofthemainthemeofthispaper.Learningamulti-taskkernel.ThekernelinEquation(22)isperhapsthesimplestnontrivialexampleofamulti-taskkernel.Thiskernelisaconvexcombinationoftwokernels,therstofwhichcorrespondstolearningindependenttasksandthesecondoneisarankonekernelwhichcorrespondstolearningalltasksasthesametask.Thusthiskernellinearlycombinestwooppositemodelstoformamoreexibleone.Ourexperimentalresultsaboveindicatethevalueofthisapproachprovidedtheparameterlischosenfortheapplicationathand.RecentworkbyMicchelliandPontil(2004)showsthat,underrathergeneralconditions,633 EVGENIOU,MICCHELLIANDPONTILtheoptimalconvexcombinationofkernelscanbelearnedbyminimizingthefunctionalinEquation(1)withrespecttoKand2HK,whereKisakernelintheconvexsetofkernels,seealso(Lanckrietetal.,2004).Indeed,inourspeciccasewecanshow—alongthelinesin(MicchelliandPontil,2004)—thattheregularizer(24)isconvexinlandu.ThisapproachisrathergeneralandcanbeadaptedalsoforlearningthematrixQinthekernelinEquation(25)whichinourexperimentweestimatedbyour“adhoc”PCAapproach.Boundsonthegeneralizationerror.Yetanotherimportantquestionishowtoboundthegeneralizationerrorformulti-tasklearning.RecentlydevelopedboundsrelyingonthenotionofalgorithmicstabilityorRademachercomplexityshouldbeeasilyapplicabletoourcontext.ThisshouldhighlighttheroleplayedbythematricesBinEquation(10).Intuitively,ifB=B0weshouldhaveasimple(low-complexity)modelwhereasiftheBareorthogonalamorecomplexmodel.Morespecically,thisanalysisshouldsayhowthegeneralizationerror,whenusingthekernel(22),dependsonlComputationalconsiderations.Adrawbackofourproposedmulti-taskkernelmethodisthatitscomputationalcomplexitytimeisO(p(mn))whichisworstthanthecomplexityofsolvingnindependentkernelmethods,thisbeingnO(p(m)).Thefunctionpdependsonthelossfunctionusedand,typically,p(m)=mawithaapositiveconstant.Forexampleforthesquarelossa=3.Futureworkwillfocusonthestudyofefcientdecompositionmethodsforsolvingthemulti-taskSVMorRN.ThisdecompositionshouldexploitthestructureprovidedbythematricesBinthekernel(10).Forexample,ifweusethekernel(22)andthetaskssharethesameinputexamplesitispossibletoshowthatthelinearsystemofmnEquations(15)canbereducedtosolvingn+1systemsofmequations,whichisessentiallythesameassolvingnindependentridgeregressionproblems.Multi-taskfeatureselection.Continuingonthediscussionabove,weobservethatifwere-strictthematrixQtobediagonalthenlearningQcorrespondstoaformoffeatureselectionacrosstasks.Otherfeatureselectionformulationswherethetasksmayshareonlysomeoftheirfeaturesshouldalsobepossible.SeealsotherecentworkbyJebara(2004)forrelatedworkonthisdirection.Onlinemulti-tasklearning.Aninterestingproblemdeservingofinvestigationisthequestionofhowtolearnasetoftasksonlinewhereateachinstanceoftimeasetofexamplesforanewtaskissampled.Thisproblemisvaluableinapplicationswhereanenvironmentisexploredandnewdata/tasksareprovidedduringthisexploration.Forexample,theenvironmentcouldbeamarketofcustomersinourapplicationabove,orasetofscenesincomputervisionwhichcontainsdifferentobjectswewanttorecognize.Multi-tasklearningextensions.Finallyitwouldbeinterestingtoextenttheframeworkpre-sentedheretootherlearningproblemsbeyondclassicationandregression.Twoexamplewhichcometomindarekerneldensityestimation,see,forexample,(Vapnik,1998),orone-classSVM(TaxandDuin,1999).634 LEARNINGMULTIPLETASKSWITHKERNELMETHODSAcknowledgmentsTheauthorswouldliketothankPhillipCartwrightandSimonTruslerfromResearchInternationalfortheirhelpwiththisdataset.ReferencesG.M.AllenbyandP.E.Rossi.MarketingmodelsofconsumerheterogeneityJournalofEcono-metrics,89,p.57–78,1999.R.K.AndoandT.Zhang.AFrameworkforLearningPredictiveStructuresfromMultipleTasksandUnlabeledData.TechnicalReportRC23462,IBMT.J.WatsonResearchCenter,2004.N.AroraG.MAllenby,andJ.Ginter.AhierarchicalBayesmodelofprimaryandsecondaryde-mand.MarketingScience,17,1,p.29–44,1998.N.Aronszajn.Theoryofreproducingkernels.Trans.Amer.Math.Soc.,686,pp.337–404,1950.B.BakkerandT.Heskes.TaskclusteringandgatingforBayesianmulti–tasklearning.JournalofMachineLearningResearch,4:83–99,2003.J.Baxter.ABayesian/informationtheoreticmodeloflearningtolearnviamultipletasksampling.MachineLearning,28,pp.7–39,1997.J.Baxter.Amodelforinductivebiaslearning.JournalofArticialIntelligenceResearch,12,p.149–198,2000.S.Ben-David,J.Gehrke,andR.Schuller.Atheoreticalframeworkforlearningfromapoolofdisparatedatasources.ProceedingsofKnowledgeDiscoveryandDatamining(KDD),2002.S.Ben-DavidandR.Schuller.Exploitingtaskrelatednessformultipletasklearning.ProceedingsofComputationalLearningTheory(COLT),2003.L.BreimanandJ.HFriedman.Predictingmultivariateresponsesinmultiplelinearregression.RoyalStatisticalSocietySeriesB,1998.P.J.BrownandJ.V.Zidek.Adaptivemultivariateridgeregression.TheAnnalsofStatistics,Vol.8,No.1,p.64–74,1980.R.Caruana.Multi–tasklearning.MachineLearning,28,p.41–75,1997.F.R.K.Chung.SpectralGraphTheoryCBMSSeries,AMS,Providence,1997.T.Evgeniou,M.Pontil,andT.Poggio.Regularizationnetworksandsupportvectormachines.AdvancesinComputationalMathematics,13:1–50,2000.T.Evgeniou,C.Boussios,andG.Zacharia.Generalizedrobustconjointestimation.MarketingScience,2005(forthcoming).T.EvgeniouandM.Pontil.Regularizedmulti-tasklearning.Proceedingsofthe10thConferenceon`KnowledgeDiscoveryandDataMining,Seattle,WA,August2004.635 EVGENIOU,MICCHELLIANDPONTILF.Girosi.DemographicForecasting.PhDThesis,HarvardUniversity,2003.W.Greene.EconometricAnalysis.PrenticeHall,fthedition,2002.B.Heisele,T.Serre,M.Pontil,T.Vetter,andT.Poggio.Categorizationbylearningandcombiningobjectparts.InAdvancesinNeuralInformationProcessingSystems14,Vancouver,Canada,Vol.2,1239–1245,2002.T.Heskes.EmpiricalBayesforlearningtolearn.ProceedingsofICML–2000,ed.Langley,P.,pp.367–374,2000.T.Jebara.Multi-TaskFeatureandKernelSelectionforSVMs.InternationalConferenceonMachineLearning,ICML,July2004.M.I.JordanandR.A.Jacobs.HierarchicalmixturesofexpertsandtheEMalgorithm.NeuralComputation,1993.G.R.G.Lanckriet,N.Cristianini,P.Bartlett,L.ElGhaoui,andM.I.Jordan.Learningthekernelmatrixwithsemi-deniteprogramming.JournalofMachineLearningResearch,5,pp.27–72,2004.G.R.G.Lanckriet,T.DeBie,N.Cristianini,M.I.Jordan,andW.S.Noble.Aframeworkforgenomicdatafusionanditsapplicationtomembraneproteinprediction.TechnicalReportCSD–03–1273,DivisionofComputerScience,UniversityofCalifornia,Berkeley,2003.O.L.Mangasarian.NonlinearProgramming.ClassicsinAppliedMathematics.SIAM,1994.C.A.MicchelliandM.Pontil.Learningthekernelviaregularization.ResearchNoteRN/04/11,DeptofComputerScience,UCL,September,2004.C.A.MicchelliandM.Pontil.Onlearningvector–valuedfunctions.NeuralComputation,17,pp.177–204,2005.C.A.MicchelliandM.Pontil.Kernelsformulti-tasklearning.Proc.ofthe18–thConf.onNeuralInformationProcessingSystems,2005.R.Rifkin,S.Mukherjee,P.Tamayo,S.Ramaswamy,C.Yeang,M.Angelo,M.Reich,T.Poggio,T.Poggio,E.Lander,T.Golub,andJ.Mesirov.Ananalyticalmethodformulti-classmolecularcancerclassicationSIAMReview,Vol.45,No.4,p.706-723,2003.J.Shawe-TaylorandN.Cristianini.KernelMethodsforPatternAnalysis.CambridgeUniversityPress,2004.B.Sch¨olkopfandA.J.Smola.LearningwithKernels.TheMITPress,Cambridge,MA,USA,2002.D.L.SilverandR.EMercer.Theparalleltransferoftaskknowledgeusingdynamiclearningratesbasedonameasureofrelatedness.ConnectionScience,8,p.277–294,1996.V.SrivastavaandT.Dwivedi.Estimationofseeminglyunrelatedregressionequations:AbriefsurveyJournalofEconometrics,10,p.15–32,1971.636 LEARNINGMULTIPLETASKSWITHKERNELMETHODSD.M.J.TaxandR.P.W.Duin.Supportvectordomaindescription.PatternRecognitionLetters,20(11-13),pp.1191–1199,1999.S.ThrunandL.Pratt.LearningtoLearn.KluwerAcademicPublishers,November1997.S.ThrunandJ.O'Sullivan.Clusteringlearningtasksandtheselectivecross–tasktransferofknowl-edge.InLearningtoLearn,S.ThrunandL.Y.PrattEds.,KluwerAcademicPublishers,1998.V.N.Vapnik.StatisticalLearningTheory.Wiley,NewYork,1998.G.Wahba.SplinesModelsforObservationalData.SeriesinAppliedMathematics,Vol.59,SIAM,Philadelphia,1990.A.Zellner.Anefcientmethodforestimatingseeminglyunrelatedregressionequationsandtestsforaggregationbias.JournaloftheAmericanStatisticalAssociation,57,p.348–368,1962.637