/
Practical Lessons from Predicting Clicks on Ads at Fac Practical Lessons from Predicting Clicks on Ads at Fac

Practical Lessons from Predicting Clicks on Ads at Fac - PDF document

danika-pritchard
danika-pritchard . @danika-pritchard
Follow
446 views
Uploaded On 2015-05-16

Practical Lessons from Predicting Clicks on Ads at Fac - PPT Presentation

com ABSTRACT Online advertising allows advertisers to only bid and pay for measurable user responses such as clicks on ads As a consequence click prediction systems are central to most on line advertising systems With over 750 million daily active us ID: 67692

com ABSTRACT Online advertising allows

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Practical Lessons from Predicting Clicks..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

classi ersanddiverseonlinelearningalgorithms.Inthecon-textoflinearclassi cationwegoontoevaluatetheimpactoffeaturetransformsanddatafreshness.Inspiredbythepracticallessonslearned,particularlyarounddatafreshnessandonlinelearning,wepresentamodelarchitecturethatin-corporatesanonlinelearninglayer,whilstproducingfairlycompactmodels.Section4describesakeycomponentre-quiredfortheonlinelearninglayer,theonlinejoiner,anexperimentalpieceofinfrastructurethatcangeneratealivestreamofreal-timetrainingdata.Lastlywepresentwaystotradeaccuracyformemoryandcomputetimeandtocopewithmassiveamountsoftrainingdata.InSection5wedescribepracticalwaystokeepmem-oryandlatencycontainedformassivescaleapplicationsandinSection6wedelveintothetradeo betweentrainingdatavolumeandaccuracy.2.EXPERIMENTALSETUPInordertoachieverigorousandcontrolledexperiments,wepreparedoinetrainingdatabyselectinganarbitraryweekofthe4thquarterof2013.Inordertomaintainthesametrainingandtestingdataunderdi erentconditions,wepre-paredoinetrainingdatawhichissimilartothatobservedonline.Wepartitionthestoredoinedataintotrainingandtestingandusethemtosimulatethestreamingdataforon-linetrainingandprediction.Thesametraining/testingdataareusedastestbedforalltheexperimentsinthepaper.Evaluationmetrics:Sincewearemostconcernedwiththeimpactofthefactorstothemachinelearningmodel,weusetheaccuracyofpredictioninsteadofmetricsdirectlyrelatedtopro tandrevenue.Inthiswork,weuseNormal-izedEntropy(NE)andcalibrationasourmajorevaluationmetric.NormalizedEntropyormoreaccurately,NormalizedCross-Entropyisequivalenttotheaverageloglossperimpressiondividedbywhattheaverageloglossperimpressionwouldbeifamodelpredictedthebackgroundclickthroughrate(CTR)foreveryimpression.Inotherwords,itisthepre-dictiveloglossnormalizedbytheentropyofthebackgroundCTR.ThebackgroundCTRistheaverageempiricalCTRofthetrainingdataset.Itwouldbeperhapsmoredescrip-tivetorefertothemetricastheNormalizedLogarithmicLoss.Thelowerthevalueis,thebetteristhepredictionmadebythemodel.ThereasonforthisnormalizationisthatthecloserthebackgroundCTRistoeither0or1,theeasieritistoachieveabetterlogloss.Dividingbytheen-tropyofthebackgroundCTRmakestheNEinsensitivetothebackgroundCTR.AssumeagiventrainingdatasethasNexampleswithlabelsyi2f�1;+1gandestimatedprob-abilityofclickpiwherei=1;2;:::N.TheaverageempiricalCTRaspNE=�1 NPni=1(1+yi 2log(pi)+1�yi 2log(1�pi)) �(plog(p)+(1�p)log(1�p))(1)NEisessentiallyacomponentincalculatingRelativeInfor-mationGain(RIG)andRIG=1�NE Figure1:Hybridmodelstructure.Inputfeaturesaretransformedbymeansofboosteddecisiontrees.Theoutputofeachindividualtreeistreatedasacategoricalinputfeaturetoasparselinearclassi er.Boosteddecisiontreesprovetobeverypowerfulfeaturetransforms.CalibrationistheratiooftheaverageestimatedCTRandempiricalCTR.Inotherwords,itistheratioofthenumberofexpectedclickstothenumberofactuallyobservedclicks.Calibrationisaveryimportantmetricsinceaccurateandwell-calibratedpredictionofCTRisessentialtothesuccessofonlinebiddingandauction.Thelessthecalibrationdi ersfrom1,thebetterthemodelis.Weonlyreportcalibrationintheexperimentswhereitisnon-trivial.Notethat,Area-Under-ROC(AUC)isalsoaprettygoodmetricformeasuringrankingqualitywithoutconsideringcalibration.Inarealisticenvironment,weexpectthepre-dictiontobeaccurateinsteadofmerelygettingtheopti-malrankingordertoavoidpotentialunder-deliveryorover-delivery.NEmeasuresthegoodnessofpredictionsandim-plicitlyre ectscalibration.Forexample,ifamodelover-predictsby2xandweapplyaglobalmultiplier0.5to xthecalibration,thecorrespondingNEwillbealsoimprovedeventhoughAUCremainsthesame.See[12]forin-depthstudyonthesemetrics.3.PREDICTIONMODELSTRUCTUREInthissectionwepresentahybridmodelstructure:theconcatenationofboosteddecisiontreesandofaprobabilis-ticsparselinearclassi er,illustratedinFigure1.InSec-tion3.1weshowthatdecisiontreesareverypowerfulinputfeaturetransformations,thatsigni cantlyincreasetheac-curacyofprobabilisticlinearclassi ers.InSection3.2weshowhowfreshertrainingdataleadstomoreaccuratepre-dictions.Thismotivatestheideatouseanonlinelearningmethodtotrainthelinearclassi er.InSection3.3wecom-pareanumberofonlinelearningvariantsfortwofamiliesofprobabilisticlinearclassi ers.Theonlinelearningschemesweevaluatearebasedonthe Table1:LogisticRegression(LR)andboosteddeci-siontrees(Trees)makeapowerfulcombination.WeevaluatethembytheirNormalizedEntropy(NE)relativetothatoftheTreesonlymodel. ModelStructure NE(relativetoTreesonly) LR+Trees 96:58% LRonly 99:43% Treesonly 100%(reference) Figure2:Predictionaccuracyasafunctionofthedelaybetweentrainingandtestsetindays.Accu-racyisexpressedasNormalizedEntropyrelativetotheworstresult,obtainedforthetrees-onlymodelwithadelayof6days.thattheLRandTreemodelsusedinisolationhavecompa-rablepredictionaccuracy(LRisabitbetter),butthatitistheircombinationthatyieldanaccuracyleap.Thegaininpredictionaccuracyissigni cant;forreference,themajorityoffeatureengineeringexperimentsonlymanagetodecreaseNormalizedEntropybyafractionofapercentage.3.2DatafreshnessClickpredictionsystemsareoftendeployedindynamicenvi-ronmentswherethedatadistributionchangesovertime.Westudythee ectoftrainingdatafreshnessonpredictiveper-formance.Todothiswetrainamodelononeparticulardayandtestitonconsecutivedays.Weruntheseexperimentsbothforaboosteddecisiontreemodel,andforalogisiticregressionmodelwithtree-transformedinputfeatures.Inthisexperimentwetrainononedayofdata,andevaluateonthesixconsecutivedaysandcomputethenormalizedentropyoneach.TheresultsareshownonFigure2.Predictionaccuracyclearlydegradesforbothmodelsasthedelaybetweentrainingandtestsetincreases.Forbothmod-elsitcanbeenseenthatNEcanbereducedbyapproxi-mately1%bygoingfromtrainingweeklytotrainingdaily.These ndingsindicatethatitisworthretrainingonadailybasis.Oneoptionwouldbetohavearecurringdailyjobthatretrainsthemodels,possiblyinbatch.Thetimeneededtoretrainboosteddecisiontreesvaries,dependingonfactorssuchasnumberofexamplesfortraining,numberoftrees,numberofleavesineachtree,cpu,memory,etc.Itmaytakemorethan24hourstobuildaboostingmodelwithhundredsoftreesfromhundredsofmillionsofinstanceswithasin-glecorecpu.Inapracticalcase,thetrainingcanbedonewithinafewhoursviasucientconcurrencyinamulti-coremachinewithlargeamountofmemoryforholdingthewholetrainingset.Inthenextsectionweconsideranalternative.Theboosteddecisiontreescanbetraineddailyoreverycou-pleofdays,butthelinearclassi ercanbetrainedinnearreal-timebyusingsome avorofonlinelearning.3.3OnlinelinearclassierInordertomaximizedatafreshness,oneoptionistotrainthelinearclassi eronline,thatis,directlyasthelabelledadimpressionsarrive.IntheupcomingSection4wede-scibeapieceofinfrastructurethatcouldgeneratereal-timetrainingdata.InthissectionweevaluateseveralwaysofsettinglearningratesforSGD-basedonlinelearningforlo-gisticregression.WethencomparethebestvarianttoonlinelearningfortheBOPRmodel.Intermsof(6),weexplorethefollowingchoices:1.Per-coordinatelearningrate:Thelearningrateforfea-tureiatiterationtissettot;i= +q Ptj=1r2j;i: ; aretwotunableparameters(proposedin[8]).2.Per-weightsquarerootlearningrate:t;i= p nt;i;wherent;iisthetotaltraininginstanceswithfeatureitilliterationt.3.Per-weightlearningrate:t;i= nt;i:4.Globallearningrate:t;i= p t:5.Constantlearningrate:t;i= :The rstthreeschemessetlearningratesindividuallyperfeature.Thelasttwousethesamerateforallfeatures.Allthetunableparametersareoptimizedbygridsearch(optimadetailedinTable2.)Welowerboundthelearningratesby0:00001forcontinuouslearning.WetrainandtestLRmodelsonsamedatawiththeabovelearningrateschemes.TheexperimentresultsareshowninFigure3.Fromtheaboveresult,SGDwithper-coordinatelearningrateachievesthebestpredictionaccuracy,withaNEal-most5%lowerthanwhenusingperweightlearningrate, turethecumulativelossreductionattributabletoafeature.Ineachtreenodeconstruction,abestfeatureisselectedandsplittomaximizethesquarederrorreduction.Sinceafea-turecanbeusedinmultipletrees,the(BoostingFeatureImportance)foreachfeatureisdeterminedbysummingthetotalreductionforaspeci cfeatureacrossalltrees.Typically,asmallnumberoffeaturescontributesthemajor-ityofexplanatorypowerwhiletheremainingfeatureshaveonlyamarginalcontribution.Weseethissamepatternwhenplottingthenumberoffeaturesversustheircumu-lativefeatureimportanceinFigure6. Figure6:Boostingfeatureimportance.X-axiscor-respondstonumberoffeatures.Wedrawfeatureimportanceinlogscaleontheleft-handsideprimaryy-axis,whilethecumulativefeatureimportanceisshownwiththeright-handsidesecondaryy-axis.Fromtheaboveresult,wecanseethatthetop10featuresareresponsibleforabouthalfofthetotalfeatureimportance,whilethelast300featurescontributelessthan1%featureimportance.Basedonthis nding,wefurtherexperimentwithonlykeepingthetop10,20,50,100and200features,andevaluatehowtheperformanceise ected.TheresultoftheexperimentisshowninFigure7.Fromthe gure,wecanseethatthenormalizedentropyhassimilardiminishingreturnpropertyasweincludemorefeatures.Inthefollowing,wewilldosomestudyontheusefulnessofhistoricalandcontextualfeatures.Duetothedatasen-sitivityinnatureandthecompanypolicy,wearenotabletorevealthedetailontheactualfeaturesweuse.Someex-amplecontextualfeaturescanbelocaltimeofday,dayofweek,etc.Historicalfeaturescanbethecumulativenumberofclicksonanad,etc.5.3HistoricalfeaturesThefeaturesusedintheBoostingmodelcanbecategorizedintotwotypes:contextualfeaturesandhistoricalfeatures.Thevalueofcontextualfeaturesdependsexclusivelyoncur-rentinformationregardingthecontextinwhichanadistobeshown,suchasthedeviceusedbytheusersorthecur-rentpagethattheuserison.Onthecontrary,thehistoricalfeaturesdependonpreviousinteractionfortheadoruser,forexampletheclickthroughrateoftheadinlastweek,ortheaverageclickthroughrateoftheuser. Figure7:ResultsforBoostingmodelwithtopfea-tures.Wedrawcalibrationontheleft-handsidepri-maryy-axis,whilethenormalizedentropyisshownwiththeright-handsidesecondaryy-axis.Inthispart,westudyhowtheperformanceofthesystemdependsonthetwotypesoffeatures.Firstlywechecktherelativeimportanceofthetwotypesoffeatures.Wedosobysortingallfeaturesbyimportance,thencalculatetheper-centageofhistoricalfeaturesin rstk-importantfeatures.TheresultisshowninFigure8.Fromtheresult,wecansee Figure8:Resultsforhistoricalfeaturepercentage.X-axiscorrespondstonumberoffeatures.Y-axisgivethepercentageofhistoricalfeaturesintopk-importantfeatures.thathistoricalfeaturesprovideconsiderablymoreexplana-torypowerthancontextualfeatures.Thetop10featuresor-deredbyimportanceareallhistoricalfeatures.Amongthetop20features,thereareonly2contextualfeaturesdespitehistoricalfeatureoccupyingroughly75%ofthefeaturesinthisdataset.TobetterunderstandthecomparativevalueofthefeaturesfromeachtypeinaggregatewetraintwoBoost-ingmodelswithonlycontextualfeaturesandonlyhistoricalfeatures,thencomparethetwomodelswiththecompletemodelwithallfeatures.TheresultisshowninTable4.Fromthetable,wecanagainverifythatinaggregatehis-toricalfeaturesplayalargerrolethancontextualfeatures. Figure11:Experimentresultfornegativedownsampling.TheX-axiscorrespondstodi erentnega-tivedownsamplingrate.Wedrawcalibrationontheleft-handsideprimaryy-axis,whilethenormalizedentropyisshownwiththeright-handsidesecondaryy-axis.setwithnegativedownsampling,italsocalibratesthepre-dictioninthedownsamplingspace.Forexample,iftheaver-ageCTRbeforesamplingis0.1%andwedoa0.01negativedownsampling,theempiricalCTRwillbecomeroughly10%.Weneedtore-calibratethemodelforlivetracexperimentandgetbacktothe0.1%predictionwithq=p p+(1�p)=wwherepisthepredictionindownsamplingspaceandwthenegativedownsamplingrate.7.DISCUSSIONWehavepresentedsomepracticallessonsfromexperiment-ingwithFacebookadsdata.Thishasinspiredapromisinghybridmodelarchitectureforclickprediction.Datafreshnessmatters.Itisworthretrainingatleastdaily.Inthispaperwehavegonefurtheranddiscussedvariousonlinelearningschemes.Wealsopresentedinfrastructurethatallowsgeneratingreal-timetrainingdata.Transformingreal-valuedinputfeatureswithboosteddecisiontreessigni cantlyincreasesthepredictionac-curacyofprobabilisticlinearclassi ers.Thismotivatesahybridmodelarchitecturethatconcatenatesboosteddecisiontreesandasparselinearclassi er.Bestonlinelearningmethod:LRwithper-coordinatelearningrate,whichendsupbeingcomparableinper-formancewithBOPR,andperformsbetterthanallotherLRSGDschemesunderstudy.(Table4,Fig12)Wehavedescribedtrickstokeepmemoryandlatencycon-tainedinmassivescalemachinelearningapplicationsWehavepresentedthetradeo betweenthenumberofboosteddecisiontreesandaccuracy.Itisadvantageoustokeepthenumberoftreessmalltokeepcomputationandmemorycontained.Boosteddecisiontreesgiveaconvenientwayofdoingfeatureselectionbymeansoffeatureimportance.Onecanaggressivelyreducethenumberofactivefeatureswhilstonlymoderatelyhurtingpredictionaccuracy.Wehaveanalyzedthee ectofusinghistoricalfea-turesincombinationwithcontextfeatures.Foradsanduserswithhistory,thesefeaturesprovidesuperiorpredictiveperformancethancontextfeatures.Finally,wehavediscussedwaysofsubsamplingthetrainingdata,bothuniformlybutalsomoreinterestinglyinabiasedwaywhereonlythenegativeexamplesaresubsampled.8.REFERENCES[1]R.Ananthanarayanan,V.Basker,S.Das,A.Gupta,H.Jiang,T.Qiu,A.Reznichenko,D.Ryabkov,M.Singh,andS.Venkataraman.Photon:Fault-tolerantandscalablejoiningofcontinuousdatastreams.InSIGMOD,2013.[2]L.Bottou.Onlinealgorithmsandstochasticapproximations.InD.Saad,editor,OnlineLearningandNeuralNetworks.CambridgeUniversityPress,Cambridge,UK,1998.revised,oct2012.[3]O.ChapelleandL.Li.Anempiricalevaluationofthompsonsampling.InAdvancesinNeuralInformationProcessingSystems,volume24,2012.[4]B.Edelman,M.Ostrovsky,andM.Schwarz.Internetadvertisingandthegeneralizedsecondpriceauction:Sellingbillionsofdollarsworthofkeywords.InAmericanEconomicReview,volume97,pages242{259,2007.[5]J.H.Friedman.Greedyfunctionapproximation:Agradientboostingmachine.AnnalsofStatistics,29:1189{1232,1999.[6]L.GolabandM.T.Ozsu.Processingslidingwindowmulti-joinsincontinuousqueriesoverdatastreams.InVLDB,pages500{511,2003.[7]T.Graepel,J.Qui~noneroCandela,T.Borchert,andR.Herbrich.Web-scalebayesianclick-throughratepredictionforsponsoredsearchadvertisinginMicrosoft'sBingsearchengine.InICML,pages13{20,2010.[8]H.B.McMahan,G.Holt,D.Sculley,M.Young,D.Ebner,J.Grady,L.Nie,T.Phillips,E.Davydov,D.Golovin,S.Chikkerur,D.Liu,M.Wattenberg,A.M.Hrafnkelsson,T.Boulos,andJ.Kubica.Adclickprediction:aviewfromthetrenches.InKDD,2013.[9]M.Richardson,E.Dominowska,andR.Ragno.Predictingclicks:Estimatingtheclick-throughratefornewads.InWWW,pages521{530,2007.[10]A.Thusoo,S.Antony,N.Jain,R.Murthy,Z.Shao,D.Borthakur,J.Sarma,andH.Liu.Datawarehousingandanalyticsinfrastructureatfacebook.InSIGMOD,pages1013{1020,2010.[11]H.R.Varian.Positionauctions.InInternationalJournalofIndustrialOrganization,volume25,pages1163{1178,2007.[12]J.Yi,Y.Chen,J.Li,S.Sett,andT.W.Yan.Predictivemodelperformance:Oineandonlineevaluations.InKDD,pages1294{1302,2013.