Practical Lessons from Predicting Clicks on Ads at Fac - PDF document

Practical Lessons from Predicting Clicks on Ads at Fac
Practical Lessons from Predicting Clicks on Ads at Fac

Practical Lessons from Predicting Clicks on Ads at Fac - Description


com ABSTRACT Online advertising allows advertisers to only bid and pay for measurable user responses such as clicks on ads As a consequence click prediction systems are central to most on line advertising systems With over 750 million daily active us ID: 67692 Download Pdf

Tags

com ABSTRACT Online advertising allows

Download Section

Please download the presentation from below link :


Download Pdf - The PPT/PDF document "Practical Lessons from Predicting Clicks..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Embed / Share - Practical Lessons from Predicting Clicks on Ads at Fac


Presentation on theme: "Practical Lessons from Predicting Clicks on Ads at Fac"— Presentation transcript


classi ersanddiverseonlinelearningalgorithms.Inthecon-textoflinearclassi cationwegoontoevaluatetheimpactoffeaturetransformsanddatafreshness.Inspiredbythepracticallessonslearned,particularlyarounddatafreshnessandonlinelearning,wepresentamodelarchitecturethatin-corporatesanonlinelearninglayer,whilstproducingfairlycompactmodels.Section4describesakeycomponentre-quiredfortheonlinelearninglayer,theonlinejoiner,anexperimentalpieceofinfrastructurethatcangeneratealivestreamofreal-timetrainingdata.Lastlywepresentwaystotradeaccuracyformemoryandcomputetimeandtocopewithmassiveamountsoftrainingdata.InSection5wedescribepracticalwaystokeepmem-oryandlatencycontainedformassivescaleapplicationsandinSection6wedelveintothetradeo betweentrainingdatavolumeandaccuracy.2.EXPERIMENTALSETUPInordertoachieverigorousandcontrolledexperiments,wepreparedoinetrainingdatabyselectinganarbitraryweekofthe4thquarterof2013.Inordertomaintainthesametrainingandtestingdataunderdi erentconditions,wepre-paredoinetrainingdatawhichissimilartothatobservedonline.Wepartitionthestoredoinedataintotrainingandtestingandusethemtosimulatethestreamingdataforon-linetrainingandprediction.Thesametraining/testingdataareusedastestbedforalltheexperimentsinthepaper.Evaluationmetrics:Sincewearemostconcernedwiththeimpactofthefactorstothemachinelearningmodel,weusetheaccuracyofpredictioninsteadofmetricsdirectlyrelatedtopro tandrevenue.Inthiswork,weuseNormal-izedEntropy(NE)andcalibrationasourmajorevaluationmetric.NormalizedEntropyormoreaccurately,NormalizedCross-Entropyisequivalenttotheaverageloglossperimpressiondividedbywhattheaverageloglossperimpressionwouldbeifamodelpredictedthebackgroundclickthroughrate(CTR)foreveryimpression.Inotherwords,itisthepre-dictiveloglossnormalizedbytheentropyofthebackgroundCTR.ThebackgroundCTRistheaverageempiricalCTRofthetrainingdataset.Itwouldbeperhapsmoredescrip-tivetorefertothemetricastheNormalizedLogarithmicLoss.Thelowerthevalueis,thebetteristhepredictionmadebythemodel.ThereasonforthisnormalizationisthatthecloserthebackgroundCTRistoeither0or1,theeasieritistoachieveabetterlogloss.Dividingbytheen-tropyofthebackgroundCTRmakestheNEinsensitivetothebackgroundCTR.AssumeagiventrainingdatasethasNexampleswithlabelsyi2f�1;+1gandestimatedprob-abilityofclickpiwherei=1;2;:::N.TheaverageempiricalCTRaspNE=�1 NPni=1(1+yi 2log(pi)+1�yi 2log(1�pi)) �(plog(p)+(1�p)log(1�p))(1)NEisessentiallyacomponentincalculatingRelativeInfor-mationGain(RIG)andRIG=1�NE Figure1:Hybridmodelstructure.Inputfeaturesaretransformedbymeansofboosteddecisiontrees.Theoutputofeachindividualtreeistreatedasacategoricalinputfeaturetoasparselinearclassi er.Boosteddecisiontreesprovetobeverypowerfulfeaturetransforms.CalibrationistheratiooftheaverageestimatedCTRandempiricalCTR.Inotherwords,itistheratioofthenumberofexpectedclickstothenumberofactuallyobservedclicks.Calibrationisaveryimportantmetricsinceaccurateandwell-calibratedpredictionofCTRisessentialtothesuccessofonlinebiddingandauction.Thelessthecalibrationdi ersfrom1,thebetterthemodelis.Weonlyreportcalibrationintheexperimentswhereitisnon-trivial.Notethat,Area-Under-ROC(AUC)isalsoaprettygoodmetricformeasuringrankingqualitywithoutconsideringcalibration.Inarealisticenvironment,weexpectthepre-dictiontobeaccurateinsteadofmerelygettingtheopti-malrankingordertoavoidpotentialunder-deliveryorover-delivery.NEmeasuresthegoodnessofpredictionsandim-plicitlyre ectscalibration.Forexample,ifamodelover-predictsby2xandweapplyaglobalmultiplier0.5to xthecalibration,thecorrespondingNEwillbealsoimprovedeventhoughAUCremainsthesame.See[12]forin-depthstudyonthesemetrics.3.PREDICTIONMODELSTRUCTUREInthissectionwepresentahybridmodelstructure:theconcatenationofboosteddecisiontreesandofaprobabilis-ticsparselinearclassi er,illustratedinFigure1.InSec-tion3.1weshowthatdecisiontreesareverypowerfulinputfeaturetransformations,thatsigni cantlyincreasetheac-curacyofprobabilisticlinearclassi ers.InSection3.2weshowhowfreshertrainingdataleadstomoreaccuratepre-dictions.Thismotivatestheideatouseanonlinelearningmethodtotrainthelinearclassi er.InSection3.3wecom-pareanumberofonlinelearningvariantsfortwofamiliesofprobabilisticlinearclassi ers.Theonlinelearningschemesweevaluatearebasedonthe Table1:LogisticRegression(LR)andboosteddeci-siontrees(Trees)makeapowerfulcombination.WeevaluatethembytheirNormalizedEntropy(NE)relativetothatoftheTreesonlymodel. ModelStructure NE(relativetoTreesonly) LR+Trees 96:58% LRonly 99:43% Treesonly 100%(reference) Figure2:Predictionaccuracyasafunctionofthedelaybetweentrainingandtestsetindays.Accu-racyisexpressedasNormalizedEntropyrelativetotheworstresult,obtainedforthetrees-onlymodelwithadelayof6days.thattheLRandTreemodelsusedinisolationhavecompa-rablepredictionaccuracy(LRisabitbetter),butthatitistheircombinationthatyieldanaccuracyleap.Thegaininpredictionaccuracyissigni cant;forreference,themajorityoffeatureengineeringexperimentsonlymanagetodecreaseNormalizedEntropybyafractionofapercentage.3.2DatafreshnessClickpredictionsystemsareoftendeployedindynamicenvi-ronmentswherethedatadistributionchangesovertime.Westudythee ectoftrainingdatafreshnessonpredictiveper-formance.Todothiswetrainamodelononeparticulardayandtestitonconsecutivedays.Weruntheseexperimentsbothforaboosteddecisiontreemodel,andforalogisiticregressionmodelwithtree-transformedinputfeatures.Inthisexperimentwetrainononedayofdata,andevaluateonthesixconsecutivedaysandcomputethenormalizedentropyoneach.TheresultsareshownonFigure2.Predictionaccuracyclearlydegradesforbothmodelsasthedelaybetweentrainingandtestsetincreases.Forbothmod-elsitcanbeenseenthatNEcanbereducedbyapproxi-mately1%bygoingfromtrainingweeklytotrainingdaily.These ndingsindicatethatitisworthretrainingonadailybasis.Oneoptionwouldbetohavearecurringdailyjobthatretrainsthemodels,possiblyinbatch.Thetimeneededtoretrainboosteddecisiontreesvaries,dependingonfactorssuchasnumberofexamplesfortraining,numberoftrees,numberofleavesineachtree,cpu,memory,etc.Itmaytakemorethan24hourstobuildaboostingmodelwithhundredsoftreesfromhundredsofmillionsofinstanceswithasin-glecorecpu.Inapracticalcase,thetrainingcanbedonewithinafewhoursviasucientconcurrencyinamulti-coremachinewithlargeamountofmemoryforholdingthewholetrainingset.Inthenextsectionweconsideranalternative.Theboosteddecisiontreescanbetraineddailyoreverycou-pleofdays,butthelinearclassi ercanbetrainedinnearreal-timebyusingsome avorofonlinelearning.3.3OnlinelinearclassierInordertomaximizedatafreshness,oneoptionistotrainthelinearclassi eronline,thatis,directlyasthelabelledadimpressionsarrive.IntheupcomingSection4wede-scibeapieceofinfrastructurethatcouldgeneratereal-timetrainingdata.InthissectionweevaluateseveralwaysofsettinglearningratesforSGD-basedonlinelearningforlo-gisticregression.WethencomparethebestvarianttoonlinelearningfortheBOPRmodel.Intermsof(6),weexplorethefollowingchoices:1.Per-coordinatelearningrate:Thelearningrateforfea-tureiatiterationtissettot;i= +q Ptj=1r2j;i: ; aretwotunableparameters(proposedin[8]).2.Per-weightsquarerootlearningrate:t;i= p nt;i;wherent;iisthetotaltraininginstanceswithfeatureitilliterationt.3.Per-weightlearningrate:t;i= nt;i:4.Globallearningrate:t;i= p t:5.Constantlearningrate:t;i= :The rstthreeschemessetlearningratesindividuallyperfeature.Thelasttwousethesamerateforallfeatures.Allthetunableparametersareoptimizedbygridsearch(optimadetailedinTable2.)Welowerboundthelearningratesby0:00001forcontinuouslearning.WetrainandtestLRmodelsonsamedatawiththeabovelearningrateschemes.TheexperimentresultsareshowninFigure3.Fromtheaboveresult,SGDwithper-coordinatelearningrateachievesthebestpredictionaccuracy,withaNEal-most5%lowerthanwhenusingperweightlearningrate, turethecumulativelossreductionattributabletoafeature.Ineachtreenodeconstruction,abestfeatureisselectedandsplittomaximizethesquarederrorreduction.Sinceafea-turecanbeusedinmultipletrees,the(BoostingFeatureImportance)foreachfeatureisdeterminedbysummingthetotalreductionforaspeci cfeatureacrossalltrees.Typically,asmallnumberoffeaturescontributesthemajor-ityofexplanatorypowerwhiletheremainingfeatureshaveonlyamarginalcontribution.Weseethissamepatternwhenplottingthenumberoffeaturesversustheircumu-lativefeatureimportanceinFigure6. Figure6:Boostingfeatureimportance.X-axiscor-respondstonumberoffeatures.Wedrawfeatureimportanceinlogscaleontheleft-handsideprimaryy-axis,whilethecumulativefeatureimportanceisshownwiththeright-handsidesecondaryy-axis.Fromtheaboveresult,wecanseethatthetop10featuresareresponsibleforabouthalfofthetotalfeatureimportance,whilethelast300featurescontributelessthan1%featureimportance.Basedonthis nding,wefurtherexperimentwithonlykeepingthetop10,20,50,100and200features,andevaluatehowtheperformanceise ected.TheresultoftheexperimentisshowninFigure7.Fromthe gure,wecanseethatthenormalizedentropyhassimilardiminishingreturnpropertyasweincludemorefeatures.Inthefollowing,wewilldosomestudyontheusefulnessofhistoricalandcontextualfeatures.Duetothedatasen-sitivityinnatureandthecompanypolicy,wearenotabletorevealthedetailontheactualfeaturesweuse.Someex-amplecontextualfeaturescanbelocaltimeofday,dayofweek,etc.Historicalfeaturescanbethecumulativenumberofclicksonanad,etc.5.3HistoricalfeaturesThefeaturesusedintheBoostingmodelcanbecategorizedintotwotypes:contextualfeaturesandhistoricalfeatures.Thevalueofcontextualfeaturesdependsexclusivelyoncur-rentinformationregardingthecontextinwhichanadistobeshown,suchasthedeviceusedbytheusersorthecur-rentpagethattheuserison.Onthecontrary,thehistoricalfeaturesdependonpreviousinteractionfortheadoruser,forexampletheclickthroughrateoftheadinlastweek,ortheaverageclickthroughrateoftheuser. Figure7:ResultsforBoostingmodelwithtopfea-tures.Wedrawcalibrationontheleft-handsidepri-maryy-axis,whilethenormalizedentropyisshownwiththeright-handsidesecondaryy-axis.Inthispart,westudyhowtheperformanceofthesystemdependsonthetwotypesoffeatures.Firstlywechecktherelativeimportanceofthetwotypesoffeatures.Wedosobysortingallfeaturesbyimportance,thencalculatetheper-centageofhistoricalfeaturesin rstk-importantfeatures.TheresultisshowninFigure8.Fromtheresult,wecansee Figure8:Resultsforhistoricalfeaturepercentage.X-axiscorrespondstonumberoffeatures.Y-axisgivethepercentageofhistoricalfeaturesintopk-importantfeatures.thathistoricalfeaturesprovideconsiderablymoreexplana-torypowerthancontextualfeatures.Thetop10featuresor-deredbyimportanceareallhistoricalfeatures.Amongthetop20features,thereareonly2contextualfeaturesdespitehistoricalfeatureoccupyingroughly75%ofthefeaturesinthisdataset.TobetterunderstandthecomparativevalueofthefeaturesfromeachtypeinaggregatewetraintwoBoost-ingmodelswithonlycontextualfeaturesandonlyhistoricalfeatures,thencomparethetwomodelswiththecompletemodelwithallfeatures.TheresultisshowninTable4.Fromthetable,wecanagainverifythatinaggregatehis-toricalfeaturesplayalargerrolethancontextualfeatures. Figure11:Experimentresultfornegativedownsampling.TheX-axiscorrespondstodi erentnega-tivedownsamplingrate.Wedrawcalibrationontheleft-handsideprimaryy-axis,whilethenormalizedentropyisshownwiththeright-handsidesecondaryy-axis.setwithnegativedownsampling,italsocalibratesthepre-dictioninthedownsamplingspace.Forexample,iftheaver-ageCTRbeforesamplingis0.1%andwedoa0.01negativedownsampling,theempiricalCTRwillbecomeroughly10%.Weneedtore-calibratethemodelforlivetracexperimentandgetbacktothe0.1%predictionwithq=p p+(1�p)=wwherepisthepredictionindownsamplingspaceandwthenegativedownsamplingrate.7.DISCUSSIONWehavepresentedsomepracticallessonsfromexperiment-ingwithFacebookadsdata.Thishasinspiredapromisinghybridmodelarchitectureforclickprediction.Datafreshnessmatters.Itisworthretrainingatleastdaily.Inthispaperwehavegonefurtheranddiscussedvariousonlinelearningschemes.Wealsopresentedinfrastructurethatallowsgeneratingreal-timetrainingdata.Transformingreal-valuedinputfeatureswithboosteddecisiontreessigni cantlyincreasesthepredictionac-curacyofprobabilisticlinearclassi ers.Thismotivatesahybridmodelarchitecturethatconcatenatesboosteddecisiontreesandasparselinearclassi er.Bestonlinelearningmethod:LRwithper-coordinatelearningrate,whichendsupbeingcomparableinper-formancewithBOPR,andperformsbetterthanallotherLRSGDschemesunderstudy.(Table4,Fig12)Wehavedescribedtrickstokeepmemoryandlatencycon-tainedinmassivescalemachinelearningapplicationsWehavepresentedthetradeo betweenthenumberofboosteddecisiontreesandaccuracy.Itisadvantageoustokeepthenumberoftreessmalltokeepcomputationandmemorycontained.Boosteddecisiontreesgiveaconvenientwayofdoingfeatureselectionbymeansoffeatureimportance.Onecanaggressivelyreducethenumberofactivefeatureswhilstonlymoderatelyhurtingpredictionaccuracy.Wehaveanalyzedthee ectofusinghistoricalfea-turesincombinationwithcontextfeatures.Foradsanduserswithhistory,thesefeaturesprovidesuperiorpredictiveperformancethancontextfeatures.Finally,wehavediscussedwaysofsubsamplingthetrainingdata,bothuniformlybutalsomoreinterestinglyinabiasedwaywhereonlythenegativeexamplesaresubsampled.8.REFERENCES[1]R.Ananthanarayanan,V.Basker,S.Das,A.Gupta,H.Jiang,T.Qiu,A.Reznichenko,D.Ryabkov,M.Singh,andS.Venkataraman.Photon:Fault-tolerantandscalablejoiningofcontinuousdatastreams.InSIGMOD,2013.[2]L.Bottou.Onlinealgorithmsandstochasticapproximations.InD.Saad,editor,OnlineLearningandNeuralNetworks.CambridgeUniversityPress,Cambridge,UK,1998.revised,oct2012.[3]O.ChapelleandL.Li.Anempiricalevaluationofthompsonsampling.InAdvancesinNeuralInformationProcessingSystems,volume24,2012.[4]B.Edelman,M.Ostrovsky,andM.Schwarz.Internetadvertisingandthegeneralizedsecondpriceauction:Sellingbillionsofdollarsworthofkeywords.InAmericanEconomicReview,volume97,pages242{259,2007.[5]J.H.Friedman.Greedyfunctionapproximation:Agradientboostingmachine.AnnalsofStatistics,29:1189{1232,1999.[6]L.GolabandM.T.Ozsu.Processingslidingwindowmulti-joinsincontinuousqueriesoverdatastreams.InVLDB,pages500{511,2003.[7]T.Graepel,J.Qui~noneroCandela,T.Borchert,andR.Herbrich.Web-scalebayesianclick-throughratepredictionforsponsoredsearchadvertisinginMicrosoft'sBingsearchengine.InICML,pages13{20,2010.[8]H.B.McMahan,G.Holt,D.Sculley,M.Young,D.Ebner,J.Grady,L.Nie,T.Phillips,E.Davydov,D.Golovin,S.Chikkerur,D.Liu,M.Wattenberg,A.M.Hrafnkelsson,T.Boulos,andJ.Kubica.Adclickprediction:aviewfromthetrenches.InKDD,2013.[9]M.Richardson,E.Dominowska,andR.Ragno.Predictingclicks:Estimatingtheclick-throughratefornewads.InWWW,pages521{530,2007.[10]A.Thusoo,S.Antony,N.Jain,R.Murthy,Z.Shao,D.Borthakur,J.Sarma,andH.Liu.Datawarehousingandanalyticsinfrastructureatfacebook.InSIGMOD,pages1013{1020,2010.[11]H.R.Varian.Positionauctions.InInternationalJournalofIndustrialOrganization,volume25,pages1163{1178,2007.[12]J.Yi,Y.Chen,J.Li,S.Sett,andT.W.Yan.Predictivemodelperformance:Oineandonlineevaluations.InKDD,pages1294{1302,2013.

Shom More....