/
Factorization Machines Steffen Rendle Department of Reasoning for Intelligence The Institute Factorization Machines Steffen Rendle Department of Reasoning for Intelligence The Institute

Factorization Machines Steffen Rendle Department of Reasoning for Intelligence The Institute - PDF document

stefany-barnette
stefany-barnette . @stefany-barnette
Follow
611 views
Uploaded On 2014-12-15

Factorization Machines Steffen Rendle Department of Reasoning for Intelligence The Institute - PPT Presentation

sankenosakauacjp Abstract In this paper we introduce Factorization Machines FM which are a new model class that combines the advantages of Support Vector Machines SVM with factorization models Like SVMs FMs are a general predictor working with any re ID: 24237

sankenosakauacjp Abstract this paper

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Factorization Machines Steffen Rendle De..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

FactorizationMachinesSteffenRendleDepartmentofReasoningforIntelligenceTheInstituteofScienticandIndustrialResearchOsakaUniversity,Japanrendle@ar.sanken.osaka-u.ac.jpAbstract—Inthispaper,weintroduceFactorizationMachines(FM)whichareanewmodelclassthatcombinestheadvantagesofSupportVectorMachines(SVM)withfactorizationmodels.LikeSVMs,FMsareageneralpredictorworkingwithanyrealvaluedfeaturevector.IncontrasttoSVMs,FMsmodelallinteractionsbetweenvariablesusingfactorizedparameters.Thustheyareabletoestimateinteractionseveninproblemswithhugesparsity(likerecommendersystems)whereSVMsfail.WeshowthatthemodelequationofFMscanbecalculatedinlineartimeandthusFMscanbeoptimizeddirectly.SounlikenonlinearSVMs,atransformationinthedualformisnotnecessaryandthemodelparameterscanbeestimateddirectlywithouttheneedofanysupportvectorinthesolution.WeshowtherelationshiptoSVMsandtheadvantagesofFMsforparameterestimationinsparsesettings.Ontheotherhandtherearemanydifferentfactorizationmod-elslikematrixfactorization,parallelfactoranalysisorspecializedmodelslikeSVD++,PITForFPMC.Thedrawbackofthesemodelsisthattheyarenotapplicableforgeneralpredictiontasksbutworkonlywithspecialinputdata.Furthermoretheirmodelequationsandoptimizationalgorithmsarederivedindividuallyforeachtask.WeshowthatFMscanmimicthesemodelsjustbyspecifyingtheinputdata(i.e.thefeaturevectors).ThismakesFMseasilyapplicableevenforuserswithoutexpertknowledgeinfactorizationmodels.IndexTerms—factorizationmachine;sparsedata;tensorfac-torization;supportvectormachineI.INTRODUCTIONSupportVectorMachinesareoneofthemostpopularpredictorsinmachinelearninganddatamining.Neverthelessinsettingslikecollaborativeltering,SVMsplaynoimportantroleandthebestmodelsareeitherdirectapplicationsofstandardmatrix/tensorfactorizationmodelslikePARAFAC[1]orspecializedmodelsusingfactorizedparameters[2],[3],[4].Inthispaper,weshowthattheonlyreasonwhystandardSVMpredictorsarenotsuccessfulinthesetasksisthattheycannotlearnreliableparameters(`hyperplanes')incomplex(non-linear)kernelspacesunderverysparsedata.Ontheotherhand,thedrawbackoftensorfactorizationmodelsandevenmoreforspecializedfactorizationmodelsisthat(1)theyarenotapplicabletostandardpredictiondata(e.g.arealvaluedfeaturevectorinRn.)and(2)thatspecializedmodelsareusuallyderivedindividuallyforaspecictaskrequiringeffortinmodellinganddesignofalearningalgorithm.Inthispaper,weintroduceanewpredictor,theFactor-izationMachine(FM),thatisageneralpredictorlikeSVMsbutisalsoabletoestimatereliableparametersunderveryhighsparsity.Thefactorizationmachinemodelsallnestedvariableinteractions(comparabletoapolynomialkernelinSVM),butusesafactorizedparametrizationinsteadofadenseparametrizationlikeinSVMs.WeshowthatthemodelequationofFMscanbecomputedinlineartimeandthatitdependsonlyonalinearnumberofparameters.Thisallowsdirectoptimizationandstorageofmodelparameterswithouttheneedofstoringanytrainingdata(e.g.supportvectors)forprediction.Incontrasttothis,non-linearSVMsareusuallyoptimizedinthedualformandcomputingaprediction(themodelequation)dependsonpartsofthetrainingdata(thesupportvectors).WealsoshowthatFMssubsumemanyofthemostsuccessfulapproachesforthetaskofcollaborativelteringincludingbiasedMF,SVD++[2],PITF[3]andFPMC[4].Intotal,theadvantagesofourproposedFMare:1)FMsallowparameterestimationunderverysparsedatawhereSVMsfail.2)FMshavelinearcomplexity,canbeoptimizedintheprimalanddonotrelyonsupportvectorslikeSVMs.WeshowthatFMsscaletolargedatasetslikeNetixwith100millionsoftraininginstances.3)FMsareageneralpredictorthatcanworkwithanyrealvaluedfeaturevector.Incontrasttothis,otherstate-of-the-artfactorizationmodelsworkonlyonveryrestrictedinputdata.Wewillshowthatjustbydeningthefeaturevectorsoftheinputdata,FMscanmimicstate-of-the-artmodelslikebiasedMF,SVD++,PITForFPMC.II.PREDICTIONUNDERSPARSITYThemostcommonpredictiontaskistoestimateafunctionyRn!Tfromarealvaluedfeaturevectorx2RntoatargetdomainT(e.g.T=RforregressionorT=f+gforclassication).Insupervisedsettings,itisassumedthatthereisatrainingdatasetD=f(x(1);y(1))(x(2);y(2));:::gofexamplesforthetargetfunctionygiven.WealsoinvestigatetherankingtaskwherethefunctionywithtargetT=Rcanbeusedtoscorefeaturevectorsxandsortthemaccordingtotheirscore.Scoringfunctionscanbelearnedwithpairwisetrainingdata[5],whereafeaturetuple(x(A)x(B))2Dmeansthatx(A)shouldberankedhigherthanx(B).Asthepairwiserankingrelationisantisymmetric,itissufcienttouseonlypositivetraininginstances.Inthispaper,wedealwithproblemswherexishighlysparse,i.e.almostalloftheelementsxiofavectorxarezero.Letm(x)bethenumberofnon-zeroelementsinthe Fig.1.Exampleforsparserealvaluedfeaturevectorsxthatarecreatedfromthetransactionsofexample1.Everyrowrepresentsafeaturevectorx(i)withitscorrespondingtargety(i).Therst4columns(blue)representindicatorvariablesfortheactiveuser;thenext5(red)indicatorvariablesfortheactiveitem.Thenext5columns(yellow)holdadditionalimplicitindicators(i.e.othermoviestheuserhasrated).Onefeature(green)representsthetimeinmonths.Thelast5columns(brown)haveindicatorsforthelastmovietheuserhasratedbeforetheactiveone.Therightmostcolumnisthetarget–heretherating.featurevectorxand mDbetheaveragenumberofnon-zeroelementsm(x)ofallvectorsx2D.Hugesparsity( mDn)appearsinmanyreal-worlddatalikefeaturevectorsofeventtransactions(e.g.purchasesinrecommendersystems)ortextanalysis(e.g.bagofwordapproach).Onereasonforhugesparsityisthattheunderlyingproblemdealswithlargecategoricalvariabledomains.Example1Assumewehavethetransactiondataofamoviereviewsystem.Thesystemrecordswhichuseru2Uratesamovie(item)i2Iatacertaintimet2Rwitharatingr2f12345g.LettheusersUanditemsIbe:U=fAlice(A)Bob(B)Charlie(C);:::gI=fTitanic(TI)NottingHill(NH)StarWars(SW)StarTrek(ST);:::gLettheobserveddataSbe:S=f(ATI2010-15)(ANH2010-23)(ASW2010-41)(BSW2009-54)(BST2009-85)(CTI2009-91)(CSW2009-125)gAnexampleforapredictiontaskusingthisdata,istoestimateafunction^ythatpredictstheratingbehaviourofauserforanitematacertainpointintime.Figure1showsoneexampleofhowfeaturevectorscanbecreatedfromSforthistask.1Here,rstthereareUbinaryindicatorvariables(blue)thatrepresenttheactiveuserofatransaction–thereisalwaysexactlyoneactiveuserineachtransaction(u;i;t;r)2S,e.g.userAliceintherstone(x(1)A=1).ThenextIbinaryindicatorvariables(red)holdtheactiveitem–againthereisalwaysexactlyoneactiveitem(e.g.x(1)TI=1).Thefeaturevectorsingure1alsocontainindicatorvariables(yellow)foralltheothermoviestheuser1Tosimplifyreadability,wewillusecategoricallevels(e.g.Alice(A))insteadofnumbers(e.g.1)toidentifyelementsinvectorswhereveritmakessense(e.g.wewritexAorxAliceinsteadofx1).haseverrated.Foreachuser,thevariablesarenormalizedsuchthattheysumupto1.E.g.AlicehasratedTitanic,NottingHillandStarWars.Additionallytheexamplecontainsavariable(green)holdingthetimeinmonthsstartingfromJanuary,2009.Andnallythevectorcontainsinformationofthelastmovie(brown)theuserhasratedbefore(s)heratedtheactiveone–e.g.forx(2),AliceratedTitanicbeforesheratedNottingHill.InsectionV,weshowhowfactorizationmachinesusingsuchfeaturevectorsasinputdataarerelatedtospecializedstate-of-the-artfactorizationmodels.Wewillusethisexampledatathroughoutthepaperforillus-tration.HoweverpleasenotethatFMsaregeneralpredictorslikeSVMsandthusareapplicabletoanyrealvaluedfeaturevectorsandarenotrestrictedtorecommendersystems.III.FACTORIZATIONMACHINES(FM)Inthissection,weintroducefactorizationmachines.WediscussthemodelequationindetailandshowshortlyhowtoapplyFMstoseveralpredictiontasks.A.FactorizationMachineModel1)ModelEquation:Themodelequationforafactorizationmachineofdegreed=2isdenedas:^y(x):=w0+nXi=1wixi+nXi=1nXj=i+1hvivjixixj(1)wherethemodelparametersthathavetobeestimatedare:w02Rw2RnV2Rnk(2)Andhiisthedotproductoftwovectorsofsizek:hvivji:=kXf=1vi;fvj;f(3)ArowviwithinVdescribesthei-thvariablewithkfactors.k2N+0isahyperparameterthatdenesthedimensionalityofthefactorization.A2-wayFM(degreed=2)capturesallsingleandpairwiseinteractionsbetweenvariables:w0istheglobalbias.wimodelsthestrengthofthei-thvariable.^wi;j:=hvivjimodelstheinteractionbetweenthei-thandj-thvariable.Insteadofusinganownmodelparameterwi;j2Rforeachinteraction,theFMmodelstheinteractionbyfactorizingit.Wewillseelateron,thatthisisthekeypointwhichallowshighqualityparameterestimatesofhigher-orderinteractions(d2)undersparsity.2)Expressiveness:ItiswellknownthatforanypositivedenitematrixW,thereexistsamatrixVsuchthatW=VVtprovidedthatkissufcientlylarge.ThisshowsthataFMcanexpressanyinteractionmatrixWifkischosenlargeenough.Neverthelessinsparsesettings,typicallyasmallkshouldbechosenbecausethereisnotenoughdatatoestimatecomplexinteractionsW.Restrictingk–andthustheexpressivenessoftheFM–leadstobettergeneralizationandthusimprovedinteractionmatricesundersparsity. 3)ParameterEstimationUnderSparsity:Insparsesettings,thereisusuallynotenoughdatatoestimateinteractionsbetweenvariablesdirectlyandindependently.Factorizationmachinescanestimateinteractionseveninthesesettingswellbecausetheybreaktheindependenceoftheinteractionparametersbyfactorizingthem.Ingeneralthismeansthatthedataforoneinteractionhelpsalsotoestimatetheparametersforrelatedinteractions.Wewillmaketheideamoreclearwithanexamplefromthedataingure1.AssumewewanttoestimatetheinteractionbetweenAlice(A)andStarTrek(ST)forpredictingthetargety(heretherating).Obviously,thereisnocasexinthetrainingdatawherebothvariablesxAandxSTarenon-zeroandthusadirectestimatewouldleadtonointeraction(wA;ST=0).ButwiththefactorizedinteractionparametershvAvSTiwecanestimatetheinteractioneveninthiscase.Firstofall,BobandCharliewillhavesimilarfactorvectorsvBandvCbecausebothhavesimilarinteractionswithStarWars(vSW)forpredictingratings–i.e.hvBvSWiandhvCvSWihavetobesimilar.Alice(vA)willhaveadifferentfactorvectorfromCharlie(vC)becauseshehasdifferentinteractionswiththefactorsofTitanicandStarWarsforpredictingratings.Next,thefactorvectorsofStarTrekarelikelytobesimilartotheoneofStarWarsbecauseBobhassimilarinteractionsforbothmoviesforpredictingy.Intotal,thismeansthatthedotproduct(i.e.theinteraction)ofthefactorvectorsofAliceandStarTrekwillbesimilartotheoneofAliceandStarWars–whichalsomakesintuitivelysense.4)Computation:Next,weshowhowtomakeFMsappli-cablefromacomputationalpointofview.Thecomplexityofstraightforwardcomputationofeq.(1)isinO(kn2)becauseallpairwiseinteractionshavetobecomputed.Butwithreformulatingitdropstolinearruntime.Lemma3.1:Themodelequationofafactorizationmachine(eq.(1))canbecomputedinlineartimeO(kn).Proof:Duetothefactorizationofthepairwiseinterac-tions,thereisnomodelparameterthatdirectlydependsontwovariables(e.g.aparameterwithanindex(i;j)).Sothepairwiseinteractionscanbereformulated:nXi=1nXj=i+1hvivjixixj=1 2nXi=1nXj=1hvivjixixj1 2nXi=1hviviixixi=1 20@nXi=1nXj=1kXf=1vi;fvj;fxixjnXi=1kXf=1vi;fvi;fxixi1A=1 2kXf=10@ nXi=1vi;fxi!0@nXj=1vj;fxj1AnXi=1v2i;fx2i1A=1 2kXf=10@ nXi=1vi;fxi!2nXi=1v2i;fx2i1AThisequationhasonlylinearcomplexityinbothkandn–i.e.itscomputationisinO(kn). Moreover,undersparsitymostoftheelementsinxare0(i.e.m(x)issmall)andthus,thesumshaveonlytobecomputedoverthenon-zeroelements.Thusinsparseapplications,thecomputationofthefactorizationmachineisinO(k mD)–e.g. mD=2fortypicalrecommendersystemslikeMFapproaches(seesectionV-A).B.FactorizationMachinesasPredictorsFMcanbeappliedtoavarietyofpredictiontasks.Amongthemare:Regression:^y(x)canbeuseddirectlyasthepredictorandtheoptimizationcriterionise.g.theminimalleastsquareerroronD.Binaryclassication:thesignof^y(x)isusedandtheparametersareoptimizedforhingelossorlogitloss.Ranking:thevectorsxareorderedbythescoreof^y(x)andoptimizationisdoneoverpairsofinstancevectors(x(a)x(b))2Dwithapairwiseclassicationloss(e.g.likein[5]).Inallthesecases,regularizationtermslikeL2areusuallyaddedtotheoptimizationobjectivetopreventovertting.C.LearningFactorizationMachinesAswehaveshown,FMshaveaclosedmodelequationthatcanbecomputedinlineartime.Thus,themodelparameters(w0,wandV)ofFMscanbelearnedefcientlybygradientdescentmethods–e.g.stochasticgradientdescent(SGD)–foravarietyoflosses,amongthemaresquare,logitorhingeloss.ThegradientoftheFMmodelis: @^y(x)=8�&#x-3.3;〱:1ifisw0xiifiswixiPnj=1vj;fxjvi;fx2iifisvi;f(4)ThesumPnj=1vj;fxjisindependentofiandthuscanbeprecomputed(e.g.whencomputing^y(x)).Ingeneral,eachgradientcanbecomputedinconstanttimeO(1).Andallparameterupdatesforacase(x;y)canbedoneinO(kn)–orO(km(x))undersparsity.Weprovideagenericimplementation,LIBFM2,thatusesSGDandsupportsbothelement-wiseandpairwiselosses.D.d-wayFactorizationMachineThe2-wayFMdescribedsofarcaneasilybegeneralizedtoad-wayFM:^y(x):=w0+nXi=1wixi+dXl=2nXi1=1:::nXil=il1+10@lYj=1xij1A0@klXf=1lYj=1v(l)ij;f1A(5)2http://www.libfm.org wheretheinteractionparametersforthel-thinteractionarefactorizedbythePARAFACmodel[1]withthemodelpa-rameters:V(l)2Rnkl;kl2N+0(6)Thestraight-forwardcomplexityforcomputingeq.(5)isO(kdnd).Butwiththesameargumentsasinlemma3.1,onecanshowthatitcanbecomputedinlineartime.E.SummaryFMsmodelallpossibleinteractionsbetweenvaluesinthefeaturevectorxusingfactorizedinteractionsinsteadoffullparametrizedones.Thishastwomainadvantages:1)Theinteractionsbetweenvaluescanbeestimatedevenunderhighsparsity.Especially,itispossibletogeneral-izetounobservedinteractions.2)Thenumberofparametersaswellasthetimeforpredictionandlearningislinear.ThismakesdirectoptimizationusingSGDfeasibleandallowsoptimizingagainstavarietyoflossfunctions.Intheremainderofthispaper,wewillshowtherelationshipsbetweenfactorizationmachinesandsupportvectormachinesaswellasmatrix,tensorandspecializedfactorizationmodels.IV.FMSVS.SVMSA.SVMmodelThemodelequationofanSVM[6]canbeexpressedasthedotproductbetweenthetransformedinputxandmodelparametersw:^y(x)=h(x)wi,whereisamappingfromthefeaturespaceRnintoamorecomplexspaceF.Themappingisrelatedtothekernelwith:KRnRn!R;K(xz)=h(x);(z)iInthefollowing,wediscusstherelationshipsofFMsandSVMsbyanalyzingtheprimalformoftheSVMs3.1)Linearkernel:Themostsimplekernelisthelinearker-nel:Kl(xz):=1+hxzi,whichcorrespondstothemapping(x):=(1;x1;:::;xn).AndthusthemodelequationofalinearSVMcanberewrittenas:^y(x)=w0+nXi=1wixi;w02Rw2Rn(7)ItisobviousthatalinearSVM(eq.(7))isidenticaltoaFMofdegreed=1(eq.(5)).2)Polynomialkernel:ThepolynomialkernelallowstheSVMtomodelhigherinteractionsbetweenvariables.ItisdenedasK(xz):=(hxzi+1)d.E.g.ford=2thiscorrespondstothefollowingmapping:(x):=(1p 2x1;:::;p 2xn;x21;:::;x2np 2x1x2;:::;p 2x1xnp 2x2x3;:::;p 2xn1xn)(8)3Inpractice,SVMsaresolvedinthedualformandthemappingisnotperformedexplicitly.Nevertheless,theprimalanddualhavethesamesolution(optimum),soallourargumentsabouttheprimalholdalsoforthedualform. 020406080100120 0.900.920.940.960.98 Netflix: Rating Prediction Error Factorization MachineSupport Vector Machine Fig.2.FMssucceedinestimating2-wayvariableinteractionsinverysparseproblemswhereSVMsfail(seesectionIII-A3andIV-Bfordetails.)Andso,themodelequationforpolynomialSVMscanberewrittenas:^y(x)=w0+p 2nXi=1wixi+nXi=1w(2)i;ix2i+p 2nXi=1nXj=i+1w(2)i;jxixj(9)wherethemodelparametersare:w02Rw2RnW(2)2Rnn(symmetricmatrix)ComparingapolynomialSVM(eq.(9))toaFM(eq.(1)),onecanseethatbothmodelallnestedinteractionsuptodegreed=2.ThemaindifferencebetweenSVMsandFMsistheparametrization:allinteractionparameterswi;jofSVMsarecompletelyindependent,e.g.wi;jandwi;l.IncontrasttothistheinteractionparametersofFMsarefactorizedandthushvivjiandhvivlidependoneachotherastheyoverlapandshareparameters(herevi).B.ParameterEstimationUnderSparsityInthefollowing,wewillshowwhylinearandpolynomialSVMsfailforverysparseproblems.Weshowthisfortheexampleofcollaborativelteringwithuseranditemindicatorvariables(seethersttwogroups(blueandred)intheexampleofgure1).Here,thefeaturevectorsaresparseandonlytwoelementsarenon-zero(theactiveuseruandactiveitemi).1)LinearSVM:Forthiskindofdatax,thelinearSVMmodel(eq.(7))isequivalentto:^y(x)=w0+wu+wi(10)Becausexj=1ifandonlyifj=uorj=i.Thismodelcorrespondstooneofthemostbasiccollaborativelteringmodelswhereonlytheuseranditembiasesarecaptured.Asthismodelisverysimple,theparameterscanbeestimatedwellevenundersparsity.However,theempiricalpredictionqualitytypicallyislow(seegure2).2)PolynomialSVM:Withthepolynomialkernel,theSVMcancapturehigher-orderinteractions(herebetweenusersanditems).Inoursparsecasewithm(x)=2,themodelequationforSVMsisequivalentto:^y(x)=w0+p 2(wu+wi)+w(2)u;u+w(2)i;i+p 2w(2)u;i Firstofall,wuandw(2)u;uexpressthesame–i.e.onecandroponeofthem(e.g.w(2)u;u).Nowthemodelequationisthesameasforthelinearcasebutwithanadditionaluser-iteminteractionw(2)u;i.Intypicalcollaborativeltering(CF)problems,foreachinteractionparameterw(2)u;ithereisatmostoneobservation(u;i)inthetrainingdataandforcases(u0;i0)inthetestdatathereareusuallynoobservationsatallinthetrainingdata.Forexampleingure1thereisjustoneobservationfortheinteraction(Alice,Titanic)andnonfortheinteraction(Alice,StarTrek).Thatmeansthemaximummarginsolutionfortheinteractionparametersw(2)u;iforalltestcases(u;i)are0(e.g.w(2)A;ST=0).AndthusthepolynomialSVMcanmakenouseofany2-wayinteractionforpredictingtestexamples;sothepolynomialSVMonlyreliesontheuseranditembiasesandcannotprovidebetterestimationsthanalinearSVM.ForSVMs,estimatinghigher-orderinteractionsisnotonlyanissueinCFbutinallscenarioswherethedataishugelysparse.Becauseforareliableestimateoftheparameterw(2)i;jofapairwiseinteraction(i;j),theremustbe`enough'casesx2Dwherexi=0^xj=0.Assoonaseitherxi=0orxj=0,thecasexcannotbeusedforestimatingtheparameterw(2)i;j.Tosummarize,ifthedataistoosparse,i.e.therearetoofeworevennocasesfor(i;j),SVMsarelikelytofail.C.Summary1)ThedenseparametrizationofSVMsrequiresdirectobservationsfortheinteractionswhichisoftennotgiveninsparsesettings.ParametersofFMscanbeestimatedwellevenundersparsity(seesectionIII-A3).2)FMscanbedirectlylearnedintheprimal.Non-linearSVMsareusuallylearnedinthedual.3)ThemodelequationofFMsisindependentofthetrainingdata.PredictionwithSVMsdependsonpartsofthetrainingdata(thesupportvectors).V.FMSVS.OTHERFACTORIZATIONMODELSThereisavarietyoffactorizationmodels,rangingfromstandardmodelsform-aryrelationsovercategoricalvariables(e.g.MF,PARAFAC)tospecializedmodelsforspecicdataandtasks(e.g.SVD++,PITF,FPMC).Next,weshowthatFMscanmimicmanyofthesemodelsjustbyusingtherightinputdata(e.g.featurevectorx).A.MatrixandTensorFactorizationMatrixfactorization(MF)isoneofthemoststudiedfactor-izationmodels(e.g.[7],[8],[2]).Itfactorizesarelationshipbetweentwocategoricalvariables(e.g.UandI).ThestandardapproachtodealwithcategoricalvariablesistodenebinaryindicatorvariablesforeachlevelofUandI(e.g.seeg.1,rst(blue)andsecond(red)group)4:n:=U[I;xj:=(j=i_j=u)(11)4Toshortennotation,weaddresselementsinx(e.g.xj)andtheparametersbothbynumbers(e.g.j2f1;:::;ng)andcategoricallevels(e.g.j2(U[I)).Thatmeansweimplicitlyassumeabijectivemappingfromnumberstocategoricallevels.AFMusingthisfeaturevectorxisidenticaltothematrixfactorizationmodel[2]becausexjisonlynon-zeroforuandi,soallotherbiasesandinteractionsdrop:^y(x)=w0+wu+wi+hvuvii(12)Withthesameargument,onecanseethatforproblemswithmorethantwocategoricalvariables,FMsincludesanestedparallelfactoranalysismodel(PARAFAC)[1].B.SVD++Forthetaskofratingprediction(i.e.regression),KorenimprovesthematrixfactorizationmodeltotheSVD++model[2].AFMcanmimicthismodelbyusingthefollowinginputdatax(likeintherstthreegroupsofgure1):n:=U[I[L;xj:=8��&#x-3.3;➆&#x-3.3;➆:1ifj=i_j=u1 p jNujifj2Nu0elsewhereNuisthesetofallmoviestheuserhaseverrated5.AFM(d=2)wouldbehavethefollowingusingthisdata:^y(x)=SVD++z }| {w0+wu+wi+hvuvii+1 p NuXl2Nuhvivli+1 p NuXl2Nu0@wl+hvuvli+1 p NuXl02Nu;l0�lhvlv0li1AwheretherstpartisexactlythesameastheSVD++model.ButtheFMcontainsalsosomeadditionalinteractionsbetweenusersandmoviesNuaswellasbasiceffectsforthemoviesNuandinteractionsbetweenpairsofmoviesinNu.C.PITFforTagRecommendationTheproblemoftagpredictionisdenedasrankingtagsforagivenuseranditemcombination.Thatmeanstherearethreecategoricaldomainsinvolved:usersU,itemsIandtagsT.IntheECML/PKDDDiscoveryChallengeabouttagrecom-mendation,amodelbasedonfactorizingpairwiseinteractions(PITF)hasachievedthebestscore[3].WewillshowhowaFMcanmimicthismodel.Afactorizationmachinewithbinaryindicatorvariablesfortheactiveuseru,itemiandtagtresultsinthefollowingmodel:n:=U[I[T;xj:=(j=i_j=u_j=t)(13))^y(x)=w0+wu+wi+wt+hvuvii+hvuvti+hvivtiAsthismodelisusedforrankingbetweentwotagstA,tBwithinthesameuser/itemcombination(u;i)[3],boththeoptimizationandthepredictionalwaysworkondifferencesbetweenscoresforthecases(u;i;tA)and(u;i;tB).Thus5TodistinguishelementsinNufromelementsinI,theyaretransformedwithanybijectivefunction!:I!LintoaspaceLwithL\I=;. 0.0e+004.0e+068.0e+061.2e+07 0.050.150.250.35 ECML Discovery Challenge 2009, Task 2Number of Parameters Factorization MachinePairwise Interaction TF (PITF) Fig.3.RecommendationqualityofaFMcomparedtothewinningPITFmodel[3]oftheECML/PKDDDiscoveryChallenge2009.Thequalityisplottedagainstthenumberofmodelparameters.withoptimizationforpairwiseranking(likein[5],[3]),theFMmodelisequivalentto:^y(x):=wt+hvuvti+hvivti(14)NowtheoriginalPITFmodel[3]andtheFMmodelwithbinaryindicators(eq.(14))arealmostidentical.Theonlydifferenceisthat(i)theFMmodelhasabiastermwtfortand(ii)thefactorizationparametersforthetags(vt)betweenthe(u;t)-and(i;t)-interactionaresharedfortheFMmodelbutindividualfortheoriginalPITFmodel.Besidesthistheoreticalanalysis,gure3showsempiricallythatbothmodelsalsoachievecomparablepredictionqualityforthistask.D.FactorizedPersonalizedMarkovChains(FPMC)TheFPMCmodel[4]triestorankproductsinanonlineshopbasedonthelastpurchases(attimet1)oftheuseru.Againjustbyfeaturegeneration,afactorizationmachine(d=2)behavessimilarly:n:=U[I[L;xj:=8�&#x-3.3;〱:1ifj=i_j=u1 jBut1jifj2But10else(15)whereButListheset(`basket')ofallitemsauseruhaspurchasedattimet(fordetailssee[4]).Then:^y(x)=w0+wu+wi+hvuvii+1 But1Xl2But1hvivli+1 But1Xl2But10@wl+hvuvli+1 But1Xl02But1;l0�lhvlv0li1ALikefortagrecommendationthismodelisusedandopti-mizedforranking(hererankingitemsi)andthusonlyscoredifferencesbetween(u;iA;t)and(u;iB;t)areusedinthepredictionandoptimizationcriterion[4].Thus,alladditivetermsthatdonotdependonivanishandtheFMmodelequationisequivalentto:^y(x)=wi+hvuvii+1 But1Xl2But1hvivli(16)NowonecanseethattheoriginalFPMCmodel[4]andtheFMmodelarealmostidenticalanddifferonlyintheadditionalitembiaswiandthesharingoffactorizationparametersoftheFMmodelfortheitemsinboththe(u;i)-and(i;l)-interaction.E.Summary1)StandardfactorizationmodelslikePARAFACorMFarenotgeneralpredictionmodelslikefactorizationmachines.Insteadtheyrequirethatthefeaturevectorispartitionedinmpartsandthatineachpartexactlyoneelementis1andtherest0.2)Therearemanyproposalsforspecializedfactorizationmodelsdesignedforasingletask.Wehaveshownthatfactorizationmachinescanmimicmanyofthemostsuc-cessfulfactorizationmodels(includingMF,PARAFAC,SVD++,PITF,FPMC)justbyfeatureextractionwhichmakesFMeasilyapplicableinpractice.VI.CONCLUSIONANDFUTUREWORKInthispaper,wehaveintroducedfactorizationmachines.FMsbringtogetherthegeneralityofSVMswiththebenetsoffactorizationmodels.IncontrasttoSVMs,(1)FMsareabletoestimateparametersunderhugesparsity,(2)themodelequationislinearanddependsonlyonthemodelparametersandthus(3)theycanbeoptimizeddirectlyintheprimal.TheexpressivenessofFMsiscomparabletotheoneofpolynomialSVMs.IncontrasttotensorfactorizationmodelslikePARAFAC,FMsareageneralpredictorthatcanhandleanyrealvaluedvector.Moreover,simplybyusingtherightindicatorsintheinputfeaturevector,FMsareidenticalorverysimilartomanyofthespecializedstate-of-the-artmodelsthatareapplicableonlyforaspecictask,amongthemarebiasedMF,SVD++,PITFandFPMC.REFERENCES[1]R.A.Harshman,“Foundationsoftheparafacprocedure:modelsandcon-ditionsforan'exploratory'multimodalfactoranalysis.”UCLAWorkingPapersinPhonetics,pp.1–84,1970.[2]Y.Koren,“Factorizationmeetstheneighborhood:amultifacetedcollabo-rativelteringmodel,”inKDD'08:Proceedingofthe14thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining.NewYork,NY,USA:ACM,2008,pp.426–434.[3]S.RendleandL.Schmidt-Thieme,“Pairwiseinteractiontensorfactoriza-tionforpersonalizedtagrecommendation,”inWSDM'10:ProceedingsofthethirdACMinternationalconferenceonWebsearchanddatamining.NewYork,NY,USA:ACM,2010,pp.81–90.[4]S.Rendle,C.Freudenthaler,andL.Schmidt-Thieme,“Factorizingper-sonalizedmarkovchainsfornext-basketrecommendation,”inWWW'10:Proceedingsofthe19thinternationalconferenceonWorldwideweb.NewYork,NY,USA:ACM,2010,pp.811–820.[5]T.Joachims,“Optimizingsearchenginesusingclickthroughdata,”inKDD'02:ProceedingsoftheeighthACMSIGKDDinternationalconfer-enceonKnowledgediscoveryanddatamining.NewYork,NY,USA:ACM,2002,pp.133–142.[6]V.N.Vapnik,Thenatureofstatisticallearningtheory.NewYork,NY,USA:Springer-VerlagNewYork,Inc.,1995.[7]N.Srebro,J.D.M.Rennie,andT.S.Jaakola,“Maximum-marginmatrixfactorization,”inAdvancesinNeuralInformationProcessingSystems17.MITPress,2005,pp.1329–1336.[8]R.SalakhutdinovandA.Mnih,“Bayesianprobabilisticmatrixfac-torizationusingMarkovchainMonteCarlo,”inProceedingsoftheInternationalConferenceonMachineLearning,vol.25,2008.