WeiChuchuweigatsbyuclacukZoubinGhahramanizoubingatsbyuclacukGatsbyComputationalNeuroscienceUnitUniversityCollegeLondonLondonWC1N3ARUKInthispaperweproposeaprobabilistickernelapproachtopre ID: 523290
Download Pdf The PPT/PDF document "PreferenceLearningwithGaussianProcesses" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
PreferenceLearningwithGaussianProcesses WeiChuchuwei@gatsby.ucl.ac.ukZoubinGhahramanizoubin@gatsby.ucl.ac.ukGatsbyComputationalNeuroscienceUnit,UniversityCollegeLondon,London,WC1N3AR,UKInthispaper,weproposeaprobabilisticker-nelapproachtopreferencelearningbasedonGaussianprocesses.Anewlikelihoodfunc-tionisproposedtocapturethepreferencerelationsintheBayesianframework.Thegeneralizedformulationisalsoapplicableto Appearingin PreferenceLearningwithGaussianProcesses posedforpreferencelearning.Theproblemsizeofthisapproachremainslinearwiththesizeofthetrain-inginstances,ratherthangrowingquadratically.ThisapproachprovidesageneralBayesianframeworkformodeladaptationandprobabilisticprediction.There-sultsofnumericalexperimentscomparedagainstthatoftheconstraintclassicationapproachofHar-Peledetal.(2002)verifytheusefulnessofouralgorithm.Thepaperisorganizedasfollows.Insection2wede-scribetheprobabilisticapproachforpreferencelearn-ingoverinstancesindetail.Insection3wegeneralizethisframeworktolearnlabelpreferences.Insection4,weempiricallystudytheperformanceofouralgorithmonthreelearningtasks,andweconcludeinsection5.2.LearningInstancePreferenceConsiderasetofdistinctinstances,andasetofobservedpairwisepreferencerelationsontheinstances,denoted,andmeansthein-ispreferredto.Forexample,thepair)couldbetwooptionsprovidedbytheauto-matedagentforrouting(Fiechter&Rogers,2000),whiletheusermaydecidetotaketheroutethantheroutebyhis/herownjudgement.2.1.BayesianFrameworkThemainideaistoassumethatthereisanunob-servablelatentfunctionvalue)associatedwitheachtrainingsample,andthatthefunctionvaluespreservethepreferencerelationsobservedinthedataset.WeimposeaGaussianprocesspriorontheselatentfunctionvalues,andemployanappropri-atelikelihoodfunctiontolearnfromthepairwisepref-erencesbetweensamples.TheBayesianframeworkisdescribedwithmoredetailsinthefollowing.2.1.1.PriorProbabilityThelatentfunctionvaluesareassumedtobearealizationofrandomvariablesinazero-meanGaussianprocess(Williams&Rasmussen,1996).TheGaussianprocessescanthenbefullyspeciedbythecovariancematrix.ThecovariancebetweenthelatentfunctionscorrespondingtotheinputsbedenedbyanyMercerkernelfunctions(Sch¨olkopf&Smola,2002).AsimpleexampleistheGaussiankerneldenedas)=exp 0anddenotesthe-thelementofThusthepriorprobabilityoftheselatentfunctionval-isamultivariateGaussian )n 2||1 21 f(x1),f(x2),...,f(xn)]T,andisthecovariancematrixwhose-thelementisthecovariancefunction)denedasin(2).2.1.2.LikelihoodAnewlikelihoodfunctionisproposedtocapturethepreferencerelationsin(1),whichisdenedasfollowsforideallynoise-freecases:))=1if0otherwise.Thisrequiresthatthelatentfunctionvaluesofthein-stancesshouldbeconsistentwiththeirpreferencere-lations.Toallowsometolerancetonoiseintheinputsorthepreferencerelations,wecouldassumethelatentfunctionsarecontaminatedwithGaussiannoise.Gaussiannoiseisofzeromeanandunknownvarianceµ,)isusedtodenoteaGaussianrandomvariablewithmeanandvarianceThenthelikelihoodfunction(4)becomes=( and(Inoptimization-basedapproachestomachinelearning,thequantity))isusuallyreferredtoasthelossfunction,i.e.ln().Thederivativesofthelossfunctionswithrespectto)areneededinBayesianmethods.Therstandsecondorderderivativesofthelossfunctioncanbewrittenasln( xi)=sk(xi) 2N(zk ln( xi)xj)=sk(xi)sk(xj) 22N2(zk 2(zk)+zkN(zk )isanindicatorfunctionwhichis+1if1if;0otherwise.Thelikelihoodisthejointprobabilityofobservingthepreferencerelationsgiventhelatentfunctionvalues,whichcanbeevaluatedasaproductofthelikelihoodfunction(5),i.e. Inprinciple,anydistributionratherthanaGaussiancanbeassumedforthenoiseonthelatentfunctions. PreferenceLearningwithGaussianProcesses 2.1.3.PosteriorProbabilityBasedonBayestheorem,theposteriorprobabilitycanthenbewrittenas ))(9)wherethepriorprobability)isdenedasin(3),thelikelihoodfunctionisdenedasin(5),andthenormalizationfactorTheBayesianframeworkwedescribedaboveiscon-ditionalonthemodelparametersincludingtheker-nelparametersinthecovariancefunction(2)thatcontrolthekernelshape,andthenoiselevelinthelikelihoodfunction(5).Theseparameterscanbecol-lectedinto,whichisthehyperparametervector.Thenormalizationfactor),moreexactly),isknownastheevidenceforthehyperparameters.Inthenextsection,wediscusstechniquesforhyperpara-meterlearning.2.2.ModelSelectionInafullBayesiantreatment,thehyperparametersmustbeintegratedoverthe-spaceforpredic-tion.MonteCarlomethods(e.g.Neal,1996)canbeadoptedheretoapproximatetheintegraleec-tively.Howeverthesemightbecomputationallypro-hibitivetouseinpractice.Alternatively,weconsidermodelselectionbydetermininganoptimalsettingfor.Theoptimalvaluesofthehyperparameterscanbesimplyinferredbymaximizingtheevidence,i.e.=argmax).Apopularideaforcomputingtheevidenceistoapproximatetheposteriordistribu-)asaGaussian,andthentheevidencecanbecalculatedbyanexplicitformula.Inthissection,weappliedtheLaplaceapproximation(MacKay,1994)inevidenceevaluation.TheevidencecanbecalculatedanalyticallyafterapplyingtheLaplaceapproximationatthemaximumaposteriori(MAP)estimate,andgradient-basedoptimizationmethodscanthenbeem-ployedtoimplementmodeladaptationbymaximizingtheevidence.2.2.1.maximumaposterioriestimateTheMAPestimateofthelatentfunctionvaluesreferstothemodeoftheposteriordistribution,i.e.argmax),whichisequivalenttotheminimizerofthefollowingfunctional:ln( Lemma1.Theminimizationofthefunctionaldenedasin(10),isaconvexprogrammingproblem.Proof.TheHessianmatrixof)canbewritten +whereisanmatrixwhose-thentryisln( .FromMercerstheo-rem(Sch¨olkopf&Smola,2002),thecovariancematrixispositivesemidenite.Thematrixcanbeshowntobepositivesemidenitetooasfollows.Letnoteacolumnvector[,andassumethepair()inthe-thpreferencerelationisasso-ciatedwiththe-thand-thsamples.Byexploit-ingthepropertyofthesecondorderderivative(7),wehaveln( ln( .So0holdsThereforetheHessianmatrixisapositivesemidenitematrix.Thisprovesthelemma.TheNewton-Raphsonformulacanbeusedtondthesolutionforsimplecases.As =0attheMAPestimate,wehaveln( 2.2.2.EvidenceApproximationTheLaplaceapproximationof)referstocarryingouttheTaylorexpansionattheMAPpointandretain-ingthetermsuptothesecondorder(MacKay,1994).Thisisequivalenttoapproximatingtheposteriordis-)asaGaussiandistributioncenteredwiththecovariancematrix(wheredenotesthematrixattheMAPesti-mate.Theevidencecanthenbecomputedasanex-plicitexpression,i.e.+ isanidentitymatrix.Thequantity(12)isaconvenientyardstickformodelselection.2.2.3.GradientDescentGridsearchcanbeusedtondtheoptimalhyperpa-rametervalues,butsuchanapproachisveryex-pensivewhenalargenumberofhyperparametersareinvolved.Forexample,automaticrelevancedetermi-nation(ARD)parameterscouldbeembeddedintothecovariancefunction(2)asameansoffeaturese-lection.TheARDGaussiankernelcanbedenedas)=exp 2d(xixj)2 Practicallywecaninsertajittertermonthediago-nalentriesofthematrixtomakeitpositivedenite.ThetechniquesofautomaticrelevancedeterminationwereoriginallyproposedbyMacKay(1994)andNeal(1996)inthecontextofBayesianneuralnetworksasahi-erarchicalpriorovertheweights. PreferenceLearningwithGaussianProcesses 0istheARDparameterforthe-thfeaturethatcontrolsthecontributionofthisfeatureinthemodelling.Thenumberofhyperparametersincreases+1inthiscase.Gradient-basedoptimizationmethodsareregardedassuitabletoolstodeterminethevaluesofthesehyper-parameters,asthegradientsofthelogarithmoftheev-idence(12)withrespecttothehyperparametersbederivedanalytically.Weusuallycollectasthesetofvariablestotune.Thisdenitionoftun-ablevariablesishelpfultoconverttheconstrainedop-timizationproblemintoanunconstrainedoptimizationproblem.Thegradientsofln)withrespecttothesevariablescanbederivedasfollows: = 2fT1 1f 2111 211MAP ,() ln( 2) 211MAP Thengradient-descentmethodscanbeemployedtosearchforthemaximizerofthelogevidence.2.3.PredictionNowletustakeatestpair(r,s)onwhichthepref-erencerelationisunknown.Thezero-meanlatentvariablesriablesf(r),f(s)]Thavecorrelationswithzero-meanrandomvariablesoftrainingsamplesThecorrelationsaredenedbytheco-variancefunctionin(2),sothatwehavethepriorjointmultivariateGaussiandistribution,i.e.r,xr,xr,xs,xs,xs,xr,rr,ss,rs,s.Sotheconditionaldistrib-)isaGaussiantoo.Thepredictivedis-tributionof)canbecomputedasanintegralover-space,whichcanbewrittenasTheposteriordistribution)canbeapproxi-matedasaGaussianbytheLaplaceapproximation.Thepredictivedistribution(16)canbenallysimpli-edasaGaussian)withmean Thelatentvariables)and)areassumedtobedistinctfromandvariance(+Thepredictivepreference)canbeevalu-atedbytheintegral 2.4.DiscussionTheevidenceevaluation(12)involvessolvingacon-vexprogrammingproblem(10)andthencomputingthedeterminantofanmatrix,whichcostsCPUtimeat),whereisthenumberofdistinctin-stancesinthepreferencepairsfortrainingwhichispotentiallymuchfewerthanthenumberofpreference.Activelearningcanbeappliedtolearnonverylargedatasetseciently(Brinker,2004).ThefasttrainingalgorithmforGaussianprocesses(Csat´&Opper,2002)canalsobeadaptedinthesettingsofpreferencelearningforspeedup.Lawrenceetal.(2002)proposedagreedyselectioncriterionrootedininformation-theoreticprinciplesforsparserepresenta-tionofGaussianprocesses.IntheBayesianframeworkwehavedescribed,theexpectedinformativenessofanewpairwisepreferencerelationcanbemeasuredasthechangeinentropyoftheposteriordistributionofthelatentfunctionsbytheinclusionofthispreference.Apromisingapproachtoactivelearningistoselectfromthedatapoolthesamplewiththehighestex-pectedinformationgain.Thisisadirectionforfuturework.3.LearningLabelPreferencePreferencerelationscanbedenedovertheinstanceslabelsinsteadofovertheinstances.Inthiscase,eachinstanceisassociatedwithapredenedsetoflabels,andthepreferencerelationsoverthelabelsetarede-ned.Thislearningtaskisalsoknownaslabelrank-.Thepreferencesofeachtrainingsamplecanbepresentedintheformofadirectedgraph,knownasapreferencegraph(Dekeletal.,2004;Aiolli&Sper-duti,2004),wherethelabelsarethegraphvertices.Thepreferencegraphcanbedecomposedintoasetofpairwisepreferencerelationsoverthelabelsetforeachsample.InFigure1,wepresentthreepopularexamplesasanillustration.Supposethatweareprovidedwithatrainingdatasetisasamplefortrainingandisthesetofdirectededgesinthepreferencegraph PreferenceLearningwithGaussianProcesses (a) Classification (b) Ordinal Regression 1 (c) Hierarchicalmulticlass settingFigure1.Graphsoflabelpreferences,whereanedgefromnodetonodeindicatesthatlabelispreferredtola-bel.(a)standardmulticlassclassicationwhere3isthecorrectlabel.(b)thecaseofordinalregressionwhere3isthecorrectordinalscale.(c)amulti-layergraphinhierar-chicalmulticlasssettingsthatspeciesthreelevelsoflabel,denotedasistheinitiallabelvertexofthe-thedgewhileistheterminallabel,andisthenumberofedges.Eachsamplecanhaveadierentpreferencegraphoverthelabels.TheBayesianframeworkforinstancepref-erencescanbegeneralizedtolearnlabelpreferencessimilarlytoGaussianprocessesformulticlassclassi-cation(Williams&Barber,1998).Weintroducedis-tinctGaussianprocessesforeachpredenedlabel,andthelabelpreferenceofthesamplesarepreservedbythelatentfunctionvaluesintheseGaussianprocessesviathelikelihoodfunction(5).ThepriorprobabilityoftheselatentfunctionsbecomesaproductofmultivariateGaussians,i.e. )n 2|a|1 21 fa(x1),fa(x2),...,fa(xn)]andisthenumberofthelabels.isthecovariancematrixde-nedbythekernelfunctionasin(2).Theobservedrequirethecorrespondingfunc-tionvalues.Usingthelike-lihoodfunctionforpairwisepreferences(5),thelikeli-hoodofobservingthesepreferencegraphscanbecom-putedas)(21) and(.Theposteriorprobabilitycanthenbewrittenas )(22)isthemodelevidence.ApproximateBayesianmethodscanbeappliedtoin-fertheoptimalhyperparameter.WeappliedtheLaplaceapproximationagaintoevaluatetheevidence.TheMAPestimateisequivalenttothesolutiontothefollowingoptimizationproblem: ln()(23)Like(10),thisisalsoaconvexprogrammingproblem.AttheMAPestimatewehaveln( .Theevidence),moreexactly),withtheLaplaceapprox-imation(MacKay,1994),canbeapproximatedasthefollowingexpressionaccordingly: isanidentitymatrix,ln( isanblock-diagonalmatrixwithblocks.Thegradientswithrespecttocanbederivedasin(14) (15)accordingly.Theoptimalhyperparameterscanbediscoveredbyagradient-descentoptimizationpackage.Duringprediction,thetestcaseisassociatedwithlatentfunctionsforthepredenedla-belsrespectively.Thecorrelationsbetweenaredenedbythekernelfunctionasin(2),(2),Ka(xt,x1),Ka(xt,x2),...,Ka(xt,xn)].5Themeanofthepredictivedistribution)canbeapproximatedassfa(xt)]=isde-nedasin(24)at.Thelabelpreferencecanthenbedecidedbyargsortortfa(xt)].(26)ThislabelpreferencelearningmethodcanbeappliedtotackleordinalregressionusingthepreferencegraphinFigure1(b).Suchanapproachisdierentfromourpreviousworkonordinalregression(Chu&Ghahra-mani,2004).BothmethodsuseGaussianprocesspriorandimplementorderinginformationbyinequal-ities.HowevertheaboveapproachneedsprocesseswhiletheapproachinChuandGhahramani(2004)usesonlyasingleGaussianprocess.Foror-dinalregressionproblems,theapproachinChuandGhahramani(2004)seemsmorenatural.4.NumericalExperimentsIntheimplementationofourGaussianprocessalgo-rithmforpreferencelearning,gradient-descentmeth- Inthecurrentwork,wesimplyconstrainedalltheco-variancefunctionstousethesamekernelparameters. PreferenceLearningwithGaussianProcesses odshavebeenemployedtomaximizetheapproxi-matedevidenceformodeladaptation.Westartedfromtheinitialvaluesofthehyperparameterstoin-fertheoptimalones.Wealsoimplementedthecon-straintclassicationmethodofHar-Peledetal.(2002)usingsupportvectormachinesforcomparisonpur-pose().5-foldcrossvalidationwasusedtode-terminetheoptimalvaluesofmodelparameters(theGaussiankernelandtheregularizationfactor)in-volvedintheformulation,andthetesterrorwasobtainedusingtheoptimalmodelparametersforeachformulation.Theinitialsearchwasdoneona7coarsegridlinearlyspacedby1.0intheregion,followedbyanesearchona99uniformgridlinearlyspacedby0.2inthe(log)space.Webeginthissectiontocomparethegeneralizationperformanceofouralgorithmagainsttheproachonvedatasetsofinstancepreferences.Thenweempiricallystudythescalingpropertiesofthetwoalgorithmsonaninformationretrievaldataset.Wealsoapplyouralgorithmtoseveralclassicationandlabelrankingtaskstoverifytheusefulness.4.1.InstancePreferenceWerstcomparedtheperformanceofouralgorithmagainsttheapproachonthetasksoflearn-inginstancepreferences.Wecollectedvebenchmarkdatasetsthatwereusedformetricregressionprob-Thetargetvalueswereusedtodecidethepreferencerelationsbetweenpairsofinstances.Foreachdataset,werandomlyselectedanumberoftrain-ingpairsasspeciedinTable1,and20000pairsfortesting.Theselectionwasrepeated20timesindepen-dently.TheGaussiankernel(2)wasusedforbothandouralgorithm.Intherithm,eachpreferencerelationistransformedtoapairofnewsampleswithlabels+1and1re-spectively.WereporttheirtestresultsinTable1,alongwiththeresultsofouralgorithmusingtheARDGaussiankernel(13).TheGPalgorithmgivessignif-icantlybettertestresultsthanthatoftheapproachonthreeofthevedatasets.TheARDker-nelyieldsbetterperformanceontheBostonHousingandcomparableresultsonotherdatasets. ThesourcecodewritteninANSICcanbefoundathttp://www.gatsby.ucl.ac.uk/chuwei/plgp.htm.Innumericalexperiments,theinitialvaluesofthehy-perparameterswereusuallychosenas0andistheinputdimension.Wesuggesttotrymorestartingpointsinpracticeandthenchoosethebestmodelbytheevidence.Theseregressiondatasetsareavailableathttp://www.liacc.up.pt/ltorgo/Regression/DataSets.html.Table1.Testresultsonthevedatasetsforpreferencelearning.ErrorRateisthepercentofincorrectpref-erencepredictionaveragedover20trialsalongwithstan-darddeviation.isthenumberoftrainingpairsandistheinputdimension.CC-SVMandGPde-notestheCC-SVMandouralgorithmusingtheGaussiankernel.GPARDdenotesouralgorithmusingtheARDGaussiankernel.Weuseboldfacetoindicatethelowesterrorrate.ThesymbolsindicatethecasesofCC-SVMsignicantlyworsethanthatofGP;Ap-valuethresholdof01inWilcoxonranksumtestwasusedtodecidethis. ErrorRate(%) Dataset CC-SVMGPGPARD 10027 30060 MachineCpu 5006 70013 1.0812.85 1000 Hershetal.(1994)generatedtheforinformationretrieval,wheretherelevancelevelofthedocumentswithrespecttothegiventextualquerywereassessedbyhumanexperts,usingthreerankscales:denitelyrelevant,possiblyrelevantornotrelevant.Inourexperimenttostudythescal-ingproperties,weusedtheresultsofQuery3inthatcontain201referencestakenfromthewholedatabase(99denitely,59possibly,and43ir-relevant).Thebag-of-wordsrepresentationwasusedtotranslatethesedocumentsintothevectorsoftermfrequency(TF)componentsscaledbyinversedocu-mentfrequencies(IDF).WeusedtheRainbowsoft-warereleasedbyMcCallum(1996)toscanthetitleandabstractofthesedocumentstocomputetheTFIDFvectors.Inthepreprocessing,weskippedthetermsinthestoplist,andrestrictedourselvestotermsthatappearinatleast3ofthe201documents.SoeachdocumentisrepresentedbyitsTFIDFvectorwith576distinctelements.Toaccountfordierentdocumentlengths,wenormalizedthelengthofeachdocumentvectortounity.Thepreferencerelationofapairofdocumentscanbedeterminedbytheirrankscales.Werandomlyselectedasubsetofpairsofdocuments(havingdierentrankscales)withsize,...,fortraining,andthentestedontheremainingpairs.Ateachsize,therandomse-lectionwascarriedout20times.Thelinearkernelwasusedfortheouralgorithm.ThetestresultsofthetwoalgorithmsarepresentedintheleftgraphofFigure2.Theper-formancesofthetwoalgorithmsareverycompetitiveonthisapplication.IntherightgraphofFigure2,thecirclespresenttheCPUtimeconsumedtosolve(10)andevaluatetheevidence(12)onceinouralgorithm ThestoplististheSMARTsystemslistof524com-monwords,liketheandof. PreferenceLearningwithGaussianProcesses 200 400 600 800 1000 0.05 0 0.05 0.1 0.15 0.2 Number of Preference Pairs in Training CCSVM 2 103 102 101 100 CPU Time in SecondsNumber of Preference Pairs in Training Figure2.TheleftgraphpresentsthetesterrorratesonpreferencerelationsoftheOHSUMEDdatasetatdierenttrainingdatasizes.Thecrossesindicatetheaveragevaluesoverthe20trialsandtheverticallinesindicatethestan-darddeviation.TherightgraphpresentstheCPUtimeinsecondsconsumedbythetwoalgorithms.whilethecrossespresenttheCPUtimeforsolvingthequadraticprogrammingproblemonceintheapproach.Weobservedthatthecomputationalcostoftheapproachisdependentonthenumberofpreferencepairsintrainingwithscalingexponentabout22,whereastheoverheadofouralgorithmisalmostindependentofthenumberoftrainingprefer-encepairs.AswehavediscussedinSection2.4,thecomplexityofouralgorithmismainlydependentonthenumberofdistinctinstancesinvolvedinthetrain-ingdata.Sincethenumberofpairwisepreferencesfortrainingisusuallymuchlargerthanthenumberofinstances,thecomputationaladvantageisoneofthemeritsofouralgorithmoverthe-likeal-4.2.ClassiÞcationNext,weselectedvebenchmarkdatasetsformulti-classclassicationusedbyWuetal.(2004)andap-pliedboththeandouralgorithmonthesetasks.Allthedatasetscontain300trainingsamplesand500testsamples.Thepartitionswererepeated20timesforeachdataset.ThenumberofclassesandfeaturesofthevedatasetsarerecordedinTable2de-notedbyrespectively.Inouralgorithm,thepreferencegraphofeachtrainingsamplecontainsedgesasdepictedinFigure1(a),andthepredictiveclasscanbedeterminedbyargmaxmaxfa(xt)]wherewherefa(xt)]isdenedasin(26).Intherithm,eachtrainingsampleistransformedto2(newsamplesthatrepresentthe1pairwiseprefer-ences(Har-Peledetal.,2002).TheGaussiankernel(2)wasusedforboththeandouralgorithm.WereportthetestresultsinTable2,alongwiththeresultsofSVMwithpairwisecouplingcitedfromtheTable2ofWuetal.(2004).OurGPapproachareverycompetitivewithpairwisecouplingSVMandthe Theseclassicationdatasetsaremaintainedatwww.csie.ntu.edu.tw/cjlin/papers/svmprob/data.Table2.Testresultsonthevedatasetsforstandardmul-ticlassclassication.Listhenumberofclassesandddenotesthenumberofinputfeatures.LabelErrorRatedenotesthepercentofincorrectpredictionsonclasslabelsaveragedover20trials.PrefErrorRatedenotesthepercentofincorrectpredictionsonpreferencerelationsav-eragedover20trialsalongwithstandarddeviation.PWdenotestheresultsofSVMwithpairwisecouplingcitedfromWuetal.(2004).CC-SVMandGPdenotestheCC-SVMandouralgorithmusingtheGaussiankernel.Weuseboldfacetoindicatethelowesterrorrate.ThesymbolsindicatethecasesofCC-SVMsignicantlyworsethanthatofGP;Ap-valuethresholdof001inWilcoxonranksumtestwasusedtodecidethestatisticalsignicance. LabelErrorRate(%) PrefErrorRate(%) Dataset PWCC-SVMGP CC-SVMGP 10.8510.67 Waveform 16.2316.76 Satimage 14.2315.21 4.84 13.3012.13 3.20 algorithmonclasslabelprediction,andsig-nicantlybetterthanthealgorithminpref-erencepredictionontwoofthevedatasets.4.3.LabelRankingTotestonthelabelrankingtasks,weusedthedecision-theoreticsettingsrelatedtoexpectedutil-itytheorydescribedbyF¨urnkranzandH¨(2003).AnagentattemptstotakeoneactionfromasetofalternativeactionswiththepurposeofmaximizingtheexpectedutilityundertheuncertaintyoftheworldstatesTheexpectedutilityofactisgivenbywheretheprobabilityofstatetheutilityofactinginthestatetate,1].Inourexperiment,thesetofsamplescorrespondingtothesetofprobabilityvectors,wererandomlygeneratedaccordingtoauniformdistributionover.Wexedthenumberofworldstates/features=10andthenumberofsam-=50,butvariedthenumberofactions/labelsfrom2to10.Theutilitymatrixwasgeneratedatrandombydrawingindependentlyanduniformlydis-tributedentriestries,1].Ateachlabelsize,weindependentlyrepeatedthisprocedure20times.Thetwoalgorithmsemployedthelinearkerneltolearntheunderlyingutilitymatrix.Inouralgorithm,theprefer-encegraphofeachtrainingsamplecontainsedgesandthelabelpreferencefortestsampleswasde-cidedby(26).Inthealgorithm,eachtrainingsamplewastransformedto1)newsampleswithaugmentedfeatures.ThepreferencetestratesandaveragedSpearmanrankcorrelationsarepresentedinFigure3.Therankcorrelationcoecientforeachtest PreferenceLearningwithGaussianProcesses 4 6 8 10 0 0.02 0.04 0.06 0.08 0.1 Preference Test Error RateNumber of Labels CCSVM 4 6 8 10 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Rank CorrelationNumber of Labels Figure3.Theleftgraphpresentsthepreferencetestratesofthetwoalgorithmsonthelabelrankingtaskswithdier-entnumberoflabels,whiletherightpresentstheaveragedrankcorrelationcoecients.Themiddlecrossesindicatetheaveragevaluesoverthe20trialsandtheverticallinesindicatethestandarddeviations.caseisdenedas1 isthetruerankandisthepredictiverank.Onthisapplication,ourGPalgorithmisclearlysuperiortotheapproachongeneralizationcapacity,especiallywhenthenumberoflabelsbecomeslarge.Thepotentialreasonforthisobservationmightbethatlearningwithinthe-dimensionalaugmentedinputspacebecomesmuchharder.5.ConclusionsInthispaperweproposedanonparametricBayesianapproachtopreferencelearningoverinstancesorla-bels.Theformulationoflearninglabelpreferenceisalsoapplicabletomanymulticlasslearningtasks.Inbothformulations,theproblemsizeremainslin-earwiththenumberofdistinctsamplesinthetrain-ingpreferencepairs.TheexistentfastalgorithmsforGaussianprocessescanbeadaptedtotacklelargedatasets.Experimentalresultsonbenchmarkdatasetsshowthegeneralizationperformanceofouralgorithmiscompetitiveandoftenbetterthantheconstraintclassicationapproachwithsupportvectormachines.AcknowledgmentsThisworkwassupportedbytheNationalInstitutesofHealthanditsNationalInstituteofGeneralMedicalSci-encesdivisionunderGrantNumber1P01GM63208.Aiolli,F.,&Sperduti,A.(2004).Learningpreferencesformulticlassproblems.AdvancesinNeuralInformationProcessingSystems17Bahamonde,A.,Bay´on,G.F.,D´ez,J.,Quevedo,J.R.,Lu-aces,O.,delCoz,J.J.,Alonso,J.,&Goyache,F.(2004).Featuresubsetselectionforlearningpreferences:Acasestudy.Proceedingsofthe21thInternationalConferenceonMachineLearning(pp.49 56).Brinker,K.(2004).Activelearningoflabelrankingfunc-Proceedingsofthe21thInternationalConferenceonMachineLearning(pp.129 136).Chu,W.,&Ghahramani,Z.(2004).Gaussianprocessesforordinalregression(TechnicalReport).GatsbyCompu-tationalNeuroscienceUnit,UniversityCollegeLondon.http://www.gatsby.ucl.ac.uk/chuwei/paper/gpor.pdf.o,L.,&Opper,M.(2002).Sparseonlineprocesses.NeuralComputation,641 668.Dekel,O.,Keshet,J.,&Singer,Y.(2004).Log-linearmod-elsforlabelranking.Proceedingsofthe21stInterna-tionalConferenceonMachineLearning(pp.209 216).Doyle,D.(2004).Prospectsofpreferences.Intelligence,111 136.Fiechter,C.-N.,&Rogers,S.(2000).Learningsubjectivefunctionswithlargemargins.Proc.17thInternationalConf.onMachineLearning(pp.287 294).urnkranz,J.,&H¨ullermeier,E.(2003).Pairwiseprefer-encelearningandranking.Proceedingsofthe14thEu-ropeanConferenceonMachineLearning(pp.145 156).urnkranz,J.,&H¨ullermeier,E.(2005).Preferencelearn-unstlicheIntelligenz.inpress.Har-Peled,S.,Roth,D.,&Zimak,D.(2002).Constraintclassication:Anewapproachtomulticlassclassi-cationandranking.AdvancesinNeuralInformationProcessingSystems15Herbrich,R.,Graepel,T.,Bollmann-Sdorra,P.,&Ober-mayer,K.(1998).Learningpreferencerelationsforinfor-mationretrieval.Proc.ofWorkshopTextCategorizationandMachineLearning,ICML(pp.80 84).Hersh,W.,Buckley,C.,Leone,T.,&Hickam,D.(1994).Ohsumed:Aninteractiveretrievalevaluationandnewlargetestcollectionforresearch.Proceedingsofthe17thAnnualACMSIGIRConference(pp.192 201).Lawrence,N.D.,Seeger,M.,&Herbrich.,R.(2002).Fastaussianprocessmethods:Theinformativevec-tormachine.AdvancesinNeuralInformationProcessingSystems15(pp.609 616).MacKay,D.J.C.(1994).Bayesianmethodsforbackprop-agationnetworks.ModelsofNeuralNetworksIII,211 McCallum,A.K.(1996).Bow:Atoolkitforstatisticallan-guagemodeling,textretrieval,classicationandcluster-ing.http://www.cs.cmu.edu/mccallum/bow.Neal,R.M.(1996).BayesianlearningforneuralnetworksLectureNotesinStatistics.Springer.Sch¨olkopf,B.,&Smola,A.J.(2002).Learningwithker-.Cambridge,MA:TheMITPress.Williams,C.K.I.,&Barber,D.(1998).ayesianclassi-cationwithaussianprocesses.IEEETrans.onPatternAnalysisandMachineIntelligence,1342 1351.Williams,C.K.I.,&Rasmussen,C.E.(1996).processesforregression.AdvancesinNeuralInformationProcessingSystems(pp.598 604).MITPress.Wu,T.-F.,Lin,C.-J.,&Weng,R.C.(2004).Probabilityestimatesformulti-classclassicationbypairwisecou-JournalofMachineLearningResearch,975