com ABSTRACT Online advertising allows advertisers to only bid and pay for measurable user responses such as clicks on ads As a consequence click prediction systems are central to most on line advertising systems With over 750 million daily active us ID: 67692
Download Pdf The PPT/PDF document "Practical Lessons from Predicting Clicks..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
classiersanddiverseonlinelearningalgorithms.Inthecon-textoflinearclassicationwegoontoevaluatetheimpactoffeaturetransformsanddatafreshness.Inspiredbythepracticallessonslearned,particularlyarounddatafreshnessandonlinelearning,wepresentamodelarchitecturethatin-corporatesanonlinelearninglayer,whilstproducingfairlycompactmodels.Section4describesakeycomponentre-quiredfortheonlinelearninglayer,theonlinejoiner,anexperimentalpieceofinfrastructurethatcangeneratealivestreamofreal-timetrainingdata.Lastlywepresentwaystotradeaccuracyformemoryandcomputetimeandtocopewithmassiveamountsoftrainingdata.InSection5wedescribepracticalwaystokeepmem-oryandlatencycontainedformassivescaleapplicationsandinSection6wedelveintothetradeobetweentrainingdatavolumeandaccuracy.2.EXPERIMENTALSETUPInordertoachieverigorousandcontrolledexperiments,wepreparedoinetrainingdatabyselectinganarbitraryweekofthe4thquarterof2013.Inordertomaintainthesametrainingandtestingdataunderdierentconditions,wepre-paredoinetrainingdatawhichissimilartothatobservedonline.Wepartitionthestoredoinedataintotrainingandtestingandusethemtosimulatethestreamingdataforon-linetrainingandprediction.Thesametraining/testingdataareusedastestbedforalltheexperimentsinthepaper.Evaluationmetrics:Sincewearemostconcernedwiththeimpactofthefactorstothemachinelearningmodel,weusetheaccuracyofpredictioninsteadofmetricsdirectlyrelatedtoprotandrevenue.Inthiswork,weuseNormal-izedEntropy(NE)andcalibrationasourmajorevaluationmetric.NormalizedEntropyormoreaccurately,NormalizedCross-Entropyisequivalenttotheaverageloglossperimpressiondividedbywhattheaverageloglossperimpressionwouldbeifamodelpredictedthebackgroundclickthroughrate(CTR)foreveryimpression.Inotherwords,itisthepre-dictiveloglossnormalizedbytheentropyofthebackgroundCTR.ThebackgroundCTRistheaverageempiricalCTRofthetrainingdataset.Itwouldbeperhapsmoredescrip-tivetorefertothemetricastheNormalizedLogarithmicLoss.Thelowerthevalueis,thebetteristhepredictionmadebythemodel.ThereasonforthisnormalizationisthatthecloserthebackgroundCTRistoeither0or1,theeasieritistoachieveabetterlogloss.Dividingbytheen-tropyofthebackgroundCTRmakestheNEinsensitivetothebackgroundCTR.AssumeagiventrainingdatasethasNexampleswithlabelsyi2f1;+1gandestimatedprob-abilityofclickpiwherei=1;2;:::N.TheaverageempiricalCTRaspNE=1 NPni=1(1+yi 2log(pi)+1yi 2log(1pi)) (plog(p)+(1p)log(1p))(1)NEisessentiallyacomponentincalculatingRelativeInfor-mationGain(RIG)andRIG=1NE Figure1:Hybridmodelstructure.Inputfeaturesaretransformedbymeansofboosteddecisiontrees.Theoutputofeachindividualtreeistreatedasacategoricalinputfeaturetoasparselinearclassier.Boosteddecisiontreesprovetobeverypowerfulfeaturetransforms.CalibrationistheratiooftheaverageestimatedCTRandempiricalCTR.Inotherwords,itistheratioofthenumberofexpectedclickstothenumberofactuallyobservedclicks.Calibrationisaveryimportantmetricsinceaccurateandwell-calibratedpredictionofCTRisessentialtothesuccessofonlinebiddingandauction.Thelessthecalibrationdiersfrom1,thebetterthemodelis.Weonlyreportcalibrationintheexperimentswhereitisnon-trivial.Notethat,Area-Under-ROC(AUC)isalsoaprettygoodmetricformeasuringrankingqualitywithoutconsideringcalibration.Inarealisticenvironment,weexpectthepre-dictiontobeaccurateinsteadofmerelygettingtheopti-malrankingordertoavoidpotentialunder-deliveryorover-delivery.NEmeasuresthegoodnessofpredictionsandim-plicitlyre ectscalibration.Forexample,ifamodelover-predictsby2xandweapplyaglobalmultiplier0.5toxthecalibration,thecorrespondingNEwillbealsoimprovedeventhoughAUCremainsthesame.See[12]forin-depthstudyonthesemetrics.3.PREDICTIONMODELSTRUCTUREInthissectionwepresentahybridmodelstructure:theconcatenationofboosteddecisiontreesandofaprobabilis-ticsparselinearclassier,illustratedinFigure1.InSec-tion3.1weshowthatdecisiontreesareverypowerfulinputfeaturetransformations,thatsignicantlyincreasetheac-curacyofprobabilisticlinearclassiers.InSection3.2weshowhowfreshertrainingdataleadstomoreaccuratepre-dictions.Thismotivatestheideatouseanonlinelearningmethodtotrainthelinearclassier.InSection3.3wecom-pareanumberofonlinelearningvariantsfortwofamiliesofprobabilisticlinearclassiers.Theonlinelearningschemesweevaluatearebasedonthe Table1:LogisticRegression(LR)andboosteddeci-siontrees(Trees)makeapowerfulcombination.WeevaluatethembytheirNormalizedEntropy(NE)relativetothatoftheTreesonlymodel. ModelStructure NE(relativetoTreesonly) LR+Trees 96:58% LRonly 99:43% Treesonly 100%(reference) Figure2:Predictionaccuracyasafunctionofthedelaybetweentrainingandtestsetindays.Accu-racyisexpressedasNormalizedEntropyrelativetotheworstresult,obtainedforthetrees-onlymodelwithadelayof6days.thattheLRandTreemodelsusedinisolationhavecompa-rablepredictionaccuracy(LRisabitbetter),butthatitistheircombinationthatyieldanaccuracyleap.Thegaininpredictionaccuracyissignicant;forreference,themajorityoffeatureengineeringexperimentsonlymanagetodecreaseNormalizedEntropybyafractionofapercentage.3.2DatafreshnessClickpredictionsystemsareoftendeployedindynamicenvi-ronmentswherethedatadistributionchangesovertime.Westudytheeectoftrainingdatafreshnessonpredictiveper-formance.Todothiswetrainamodelononeparticulardayandtestitonconsecutivedays.Weruntheseexperimentsbothforaboosteddecisiontreemodel,andforalogisiticregressionmodelwithtree-transformedinputfeatures.Inthisexperimentwetrainononedayofdata,andevaluateonthesixconsecutivedaysandcomputethenormalizedentropyoneach.TheresultsareshownonFigure2.Predictionaccuracyclearlydegradesforbothmodelsasthedelaybetweentrainingandtestsetincreases.Forbothmod-elsitcanbeenseenthatNEcanbereducedbyapproxi-mately1%bygoingfromtrainingweeklytotrainingdaily.Thesendingsindicatethatitisworthretrainingonadailybasis.Oneoptionwouldbetohavearecurringdailyjobthatretrainsthemodels,possiblyinbatch.Thetimeneededtoretrainboosteddecisiontreesvaries,dependingonfactorssuchasnumberofexamplesfortraining,numberoftrees,numberofleavesineachtree,cpu,memory,etc.Itmaytakemorethan24hourstobuildaboostingmodelwithhundredsoftreesfromhundredsofmillionsofinstanceswithasin-glecorecpu.Inapracticalcase,thetrainingcanbedonewithinafewhoursviasucientconcurrencyinamulti-coremachinewithlargeamountofmemoryforholdingthewholetrainingset.Inthenextsectionweconsideranalternative.Theboosteddecisiontreescanbetraineddailyoreverycou-pleofdays,butthelinearclassiercanbetrainedinnearreal-timebyusingsome avorofonlinelearning.3.3OnlinelinearclassierInordertomaximizedatafreshness,oneoptionistotrainthelinearclassieronline,thatis,directlyasthelabelledadimpressionsarrive.IntheupcomingSection4wede-scibeapieceofinfrastructurethatcouldgeneratereal-timetrainingdata.InthissectionweevaluateseveralwaysofsettinglearningratesforSGD-basedonlinelearningforlo-gisticregression.WethencomparethebestvarianttoonlinelearningfortheBOPRmodel.Intermsof(6),weexplorethefollowingchoices:1.Per-coordinatelearningrate:Thelearningrateforfea-tureiatiterationtissettot;i= +q Ptj=1r2j;i:;aretwotunableparameters(proposedin[8]).2.Per-weightsquarerootlearningrate:t;i= p nt;i;wherent;iisthetotaltraininginstanceswithfeatureitilliterationt.3.Per-weightlearningrate:t;i= nt;i:4.Globallearningrate:t;i= p t:5.Constantlearningrate:t;i=:Therstthreeschemessetlearningratesindividuallyperfeature.Thelasttwousethesamerateforallfeatures.Allthetunableparametersareoptimizedbygridsearch(optimadetailedinTable2.)Welowerboundthelearningratesby0:00001forcontinuouslearning.WetrainandtestLRmodelsonsamedatawiththeabovelearningrateschemes.TheexperimentresultsareshowninFigure3.Fromtheaboveresult,SGDwithper-coordinatelearningrateachievesthebestpredictionaccuracy,withaNEal-most5%lowerthanwhenusingperweightlearningrate, turethecumulativelossreductionattributabletoafeature.Ineachtreenodeconstruction,abestfeatureisselectedandsplittomaximizethesquarederrorreduction.Sinceafea-turecanbeusedinmultipletrees,the(BoostingFeatureImportance)foreachfeatureisdeterminedbysummingthetotalreductionforaspecicfeatureacrossalltrees.Typically,asmallnumberoffeaturescontributesthemajor-ityofexplanatorypowerwhiletheremainingfeatureshaveonlyamarginalcontribution.Weseethissamepatternwhenplottingthenumberoffeaturesversustheircumu-lativefeatureimportanceinFigure6. Figure6:Boostingfeatureimportance.X-axiscor-respondstonumberoffeatures.Wedrawfeatureimportanceinlogscaleontheleft-handsideprimaryy-axis,whilethecumulativefeatureimportanceisshownwiththeright-handsidesecondaryy-axis.Fromtheaboveresult,wecanseethatthetop10featuresareresponsibleforabouthalfofthetotalfeatureimportance,whilethelast300featurescontributelessthan1%featureimportance.Basedonthisnding,wefurtherexperimentwithonlykeepingthetop10,20,50,100and200features,andevaluatehowtheperformanceiseected.TheresultoftheexperimentisshowninFigure7.Fromthegure,wecanseethatthenormalizedentropyhassimilardiminishingreturnpropertyasweincludemorefeatures.Inthefollowing,wewilldosomestudyontheusefulnessofhistoricalandcontextualfeatures.Duetothedatasen-sitivityinnatureandthecompanypolicy,wearenotabletorevealthedetailontheactualfeaturesweuse.Someex-amplecontextualfeaturescanbelocaltimeofday,dayofweek,etc.Historicalfeaturescanbethecumulativenumberofclicksonanad,etc.5.3HistoricalfeaturesThefeaturesusedintheBoostingmodelcanbecategorizedintotwotypes:contextualfeaturesandhistoricalfeatures.Thevalueofcontextualfeaturesdependsexclusivelyoncur-rentinformationregardingthecontextinwhichanadistobeshown,suchasthedeviceusedbytheusersorthecur-rentpagethattheuserison.Onthecontrary,thehistoricalfeaturesdependonpreviousinteractionfortheadoruser,forexampletheclickthroughrateoftheadinlastweek,ortheaverageclickthroughrateoftheuser. Figure7:ResultsforBoostingmodelwithtopfea-tures.Wedrawcalibrationontheleft-handsidepri-maryy-axis,whilethenormalizedentropyisshownwiththeright-handsidesecondaryy-axis.Inthispart,westudyhowtheperformanceofthesystemdependsonthetwotypesoffeatures.Firstlywechecktherelativeimportanceofthetwotypesoffeatures.Wedosobysortingallfeaturesbyimportance,thencalculatetheper-centageofhistoricalfeaturesinrstk-importantfeatures.TheresultisshowninFigure8.Fromtheresult,wecansee Figure8:Resultsforhistoricalfeaturepercentage.X-axiscorrespondstonumberoffeatures.Y-axisgivethepercentageofhistoricalfeaturesintopk-importantfeatures.thathistoricalfeaturesprovideconsiderablymoreexplana-torypowerthancontextualfeatures.Thetop10featuresor-deredbyimportanceareallhistoricalfeatures.Amongthetop20features,thereareonly2contextualfeaturesdespitehistoricalfeatureoccupyingroughly75%ofthefeaturesinthisdataset.TobetterunderstandthecomparativevalueofthefeaturesfromeachtypeinaggregatewetraintwoBoost-ingmodelswithonlycontextualfeaturesandonlyhistoricalfeatures,thencomparethetwomodelswiththecompletemodelwithallfeatures.TheresultisshowninTable4.Fromthetable,wecanagainverifythatinaggregatehis-toricalfeaturesplayalargerrolethancontextualfeatures. Figure11:Experimentresultfornegativedownsampling.TheX-axiscorrespondstodierentnega-tivedownsamplingrate.Wedrawcalibrationontheleft-handsideprimaryy-axis,whilethenormalizedentropyisshownwiththeright-handsidesecondaryy-axis.setwithnegativedownsampling,italsocalibratesthepre-dictioninthedownsamplingspace.Forexample,iftheaver-ageCTRbeforesamplingis0.1%andwedoa0.01negativedownsampling,theempiricalCTRwillbecomeroughly10%.Weneedtore-calibratethemodelforlivetracexperimentandgetbacktothe0.1%predictionwithq=p p+(1p)=wwherepisthepredictionindownsamplingspaceandwthenegativedownsamplingrate.7.DISCUSSIONWehavepresentedsomepracticallessonsfromexperiment-ingwithFacebookadsdata.Thishasinspiredapromisinghybridmodelarchitectureforclickprediction.Datafreshnessmatters.Itisworthretrainingatleastdaily.Inthispaperwehavegonefurtheranddiscussedvariousonlinelearningschemes.Wealsopresentedinfrastructurethatallowsgeneratingreal-timetrainingdata.Transformingreal-valuedinputfeatureswithboosteddecisiontreessignicantlyincreasesthepredictionac-curacyofprobabilisticlinearclassiers.Thismotivatesahybridmodelarchitecturethatconcatenatesboosteddecisiontreesandasparselinearclassier.Bestonlinelearningmethod:LRwithper-coordinatelearningrate,whichendsupbeingcomparableinper-formancewithBOPR,andperformsbetterthanallotherLRSGDschemesunderstudy.(Table4,Fig12)Wehavedescribedtrickstokeepmemoryandlatencycon-tainedinmassivescalemachinelearningapplicationsWehavepresentedthetradeobetweenthenumberofboosteddecisiontreesandaccuracy.Itisadvantageoustokeepthenumberoftreessmalltokeepcomputationandmemorycontained.Boosteddecisiontreesgiveaconvenientwayofdoingfeatureselectionbymeansoffeatureimportance.Onecanaggressivelyreducethenumberofactivefeatureswhilstonlymoderatelyhurtingpredictionaccuracy.Wehaveanalyzedtheeectofusinghistoricalfea-turesincombinationwithcontextfeatures.Foradsanduserswithhistory,thesefeaturesprovidesuperiorpredictiveperformancethancontextfeatures.Finally,wehavediscussedwaysofsubsamplingthetrainingdata,bothuniformlybutalsomoreinterestinglyinabiasedwaywhereonlythenegativeexamplesaresubsampled.8.REFERENCES[1]R.Ananthanarayanan,V.Basker,S.Das,A.Gupta,H.Jiang,T.Qiu,A.Reznichenko,D.Ryabkov,M.Singh,andS.Venkataraman.Photon:Fault-tolerantandscalablejoiningofcontinuousdatastreams.InSIGMOD,2013.[2]L.Bottou.Onlinealgorithmsandstochasticapproximations.InD.Saad,editor,OnlineLearningandNeuralNetworks.CambridgeUniversityPress,Cambridge,UK,1998.revised,oct2012.[3]O.ChapelleandL.Li.Anempiricalevaluationofthompsonsampling.InAdvancesinNeuralInformationProcessingSystems,volume24,2012.[4]B.Edelman,M.Ostrovsky,andM.Schwarz.Internetadvertisingandthegeneralizedsecondpriceauction:Sellingbillionsofdollarsworthofkeywords.InAmericanEconomicReview,volume97,pages242{259,2007.[5]J.H.Friedman.Greedyfunctionapproximation:Agradientboostingmachine.AnnalsofStatistics,29:1189{1232,1999.[6]L.GolabandM.T.Ozsu.Processingslidingwindowmulti-joinsincontinuousqueriesoverdatastreams.InVLDB,pages500{511,2003.[7]T.Graepel,J.Qui~noneroCandela,T.Borchert,andR.Herbrich.Web-scalebayesianclick-throughratepredictionforsponsoredsearchadvertisinginMicrosoft'sBingsearchengine.InICML,pages13{20,2010.[8]H.B.McMahan,G.Holt,D.Sculley,M.Young,D.Ebner,J.Grady,L.Nie,T.Phillips,E.Davydov,D.Golovin,S.Chikkerur,D.Liu,M.Wattenberg,A.M.Hrafnkelsson,T.Boulos,andJ.Kubica.Adclickprediction:aviewfromthetrenches.InKDD,2013.[9]M.Richardson,E.Dominowska,andR.Ragno.Predictingclicks:Estimatingtheclick-throughratefornewads.InWWW,pages521{530,2007.[10]A.Thusoo,S.Antony,N.Jain,R.Murthy,Z.Shao,D.Borthakur,J.Sarma,andH.Liu.Datawarehousingandanalyticsinfrastructureatfacebook.InSIGMOD,pages1013{1020,2010.[11]H.R.Varian.Positionauctions.InInternationalJournalofIndustrialOrganization,volume25,pages1163{1178,2007.[12]J.Yi,Y.Chen,J.Li,S.Sett,andT.W.Yan.Predictivemodelperformance:Oineandonlineevaluations.InKDD,pages1294{1302,2013.