Hellerstein UC Berkeley Caleb Welton Greenplum ABSTRACT As massive data acquisition and storage becomes increas ingly a64256ordable a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis In this paper we h ID: 5223
Download Pdf The PPT/PDF document "MAD Skills New Analysis Practices for Bi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
housecankeeppacetodayonlybybeing\magnetic":attractingallthedatasourcesthatcropupwithinanorganizationregardlessofdataqualityniceties.Agile:DataWarehousingorthodoxyisbasedonlong-range,carefuldesignandplanning.Givengrowingnumbersofdatasourcesandincreasinglysophisticatedandmission-criticaldataanalyses,amodernwarehousemustinsteadallowanalyststoeasilyingest,digest,produceandadaptdataatarapidpace.Thisrequiresadatabasewhosephysicalandlogicalcontentscanbeincontinuousrapidevolution.Deep:Moderndataanalysesinvolveincreasinglyso-phisticatedstatisticalmethodsthatgowellbeyondtherollupsanddrilldownsoftraditionalBI.Moreover,an-alystsoftenneedtoseeboththeforestandthetreesinrunningthesealgorithms{theywanttostudyenor-mousdatasetswithoutresortingtosamplesandex-tracts.Themoderndatawarehouseshouldservebothasadeepdatarepositoryandasasophisticatedalgo-rithmicruntimeengine.AsnotedbyVarian,thereisagrowingpremiumonan-alystswithMADskillsindataanalysis.Theseareoftenhighlytrainedstatisticians,whomayhavestrongsoftwareskillsbutwouldtypicallyratherfocusondeepdataanaly-sisthandatabasemanagement.Theyneedtobecomple-mentedbyMADapproachestodatawarehousedesignanddatabasesysteminfrastructure.Thesegoalsraiseinterest-ingchallengesthataredierentthanthetraditionalfocusinthedatawarehousingresearchandindustry.1.1ContributionsInthispaper,wedescribetechniquesandexperienceswehavedevelopedinourdevelopmentofMADanalyticsforFoxAudienceNetwork,usingalargeinstallationoftheGreen-plumDatabasesystem.Wediscussourdatabasedesignmethodologythatfocusesonenablinganagileyetorganizedapproachtodataanalysis(Section4).Wepresentanumberofdata-parallelstatisticalalgorithmsdevelopedforthisset-ting,whichfocusonmodelingandcomparingthedensitiesofdistributions.TheseincludespecicmethodslikeOrdinaryLeastSquares,ConjugateGradiant,andMann-WhitneyUTesting,aswellasgeneralpurposetechniqueslikematrixmultiplicationandBootstrapping(Section5).Finally,were ectoncriticaldatabasesystemfeaturesthatenableagiledesignand exiblealgorithmdevelopment,includinghigh-performancedataingress/egress,heterogeneousstoragefa-cilities,and exibleprogrammingviabothextensibleSQLandMapReduceinterfacestoasinglesystem(Section6).Underlyingourdiscussionarechallengestoanumberofpointsofconventionalwisdom.Indesigningandanalyzingdatawarehouses,weadvocatethetheme\ModelLess,It-erateMore".Thischallengesdatawarehousingorthodoxy,andarguesforashiftinthelocusofpowerfromDBAsto-wardanalysts.Wedescribetheneedforuniedsystemsthatembraceandintegrateawidevarietyofdata-intensiveprogrammingstyles,sinceanalystscomefrommanywalksoflife.ThisinvolvesmovingbeyondreligiousdebatesabouttheadvantagesofSQLoverMapReduce,orRoverJava,tofocusonevolvingasingleparalleldata owenginethatcansupportadiversityofprogrammingstyles,totacklesub-stantivestatisticalanalytics.Finally,wearguethatmanydatasourcesandstorageformatscanandshouldbeknit-tedtogetherbytheparalleldata owengine.Thispointstowardmore uidintegrationorconsolidationoftradition-allydiversetoolsincludingtraditionalrelationaldatabases,columnstores,ETLtools,anddistributedlesystems.2.BACKGROUND:IFYOU'RENOTMADDataanalyticsisnotanewarea.Inthissectionwede-scribestandardpracticeandrelatedworkinBusinessIntel-ligenceandlarge-scaledataanalysis,tosetthecontextforourMADapproach.2.1OLAPandDataCubesDataCubesandOn-LineAnalyticProcessing(OLAP)werepopularizedinthe1990's,leadingtointensecommer-cialdevelopmentandsignicantacademicresearch.TheSQLCUBEBYextensiontranslatedthebasicideaofOLAPintoarelationalsetting[8].BItoolspackagethesesum-mariesintofairlyintuitive\cross-tabs"visualizations.Bygroupingonfewdimensions,theanalystseesacoarse\roll-up"bar-chart;bygroupingonmoredimensionsthey\drilldown"intonergraineddetail.Statisticiansusethephrasedescriptivestatisticsforthiskindofanalysis,andtradition-allyapplytheapproachtotheresultsofanexperimentalstudy.Thisfunctionalityisusefulforgainingintuitionaboutaprocessunderlyingtheexperiment.Forexample,byde-scribingtheclickstreamatawebsiteonecangetbetterinu-itionaboutunderlyingpropertiesoftheuserpopulation.Bycontrast,inferentialorinductivestatisticstrytodi-rectlycapturetheunderlyingpropertiesofthepopulation.Thisincludesttingmodelsandparameterstodataandcomputinglikelihoodfunctions.InferentialstatisticsrequiremorecomputationthanthesimplesummariesprovidedbyOLAP,butprovidemoreprobabilisticpowerthatcanbeusedfortaskslikeprediction(e.g.,\whichuserswouldbelikelytoclickonthisnewad?"),causalityanalysis(\whatfeaturesofapageresultinuserrevisits?"),anddistribu-tionalcomparison(e.g.,\howdothebuyingpatternsoftruckownersdierfromsedanowners?")Inferentialap-proachesarealsomorerobusttooutliersandotherpar-ticularsofagivendataset.WhileOLAPandDataCubesremainusefulforintuition,theuseofinferentialstatisticshasbecomeanimperativeinmanyimportantautomatedorsemi-automatedbusinessprocessestoday,includingadplacement,websiteoptimization,andcustomerrelationshipmanagement.2.2DatabasesandStatisticalPackagesBItoolsprovidefairlylimitedstatisticalfunctionality.Itisthereforestandardpracticeinmanyorganizationstoex-tractportionsofadatabaseintodesktopsoftwarepackages:statisticalpackagelikeSAS,MatlaborR,spreadsheetslikeExcel,orcustomcodewritteninlanguageslikeJava.Therearevariousproblemswiththisapproach.First,copyingoutalargedatabaseextractisoftenmuchlesse-cientthanpushingcomputationtothedata;itiseasytogetordersofmagnitudeperformancegainsbyrunningcodeinthedatabase.Second,moststatpackagesrequiretheirdatatotinRAM.Forlargedatasets,thismeanssamplingthedatabasetoformanextract,whichlosesdetail.Inmodernsettingslikeadvertisingplacement,microtargetingrequiresanunderstandingofevensmallsubpopulations.Samplesandsynopsescanlosethe\longtail"inadataset,andthatisincreasinglywherethecompetitionforeectivenesslies. supervisedandunsupervisedlearning,neuralnetworkandnaturallanguageprocessing.Thesearenottechniquestra-ditionallyaddressedbyanRDBMS,andtheimplementationinHadoopresultsinlargedatamigrationeortsforspecicsingle-purposestudies.Theavailabilityofmachinelearningmethodsdirectlywithinthewarehousewouldoerasigni-cantsavingsintime,training,andsystemmanagement,andisoneofthegoalsoftheworkdescribedhere.4.MADDATABASEDESIGNTraditionalDataWarehousephilosophyrevolvesaroundadisciplinedapproachtomodelinginformationandprocessesinanenterprise.InthewordsofwarehousingadvocateBillInmon,itisan\architectedenvironment"[13].Thisviewofwarehousingisatoddswiththemagnetismandagilitydesiredinmanynewanalysissettings,aswedescribebelow.4.1NewRequirementsAsdatasavvypeople,analystsintroduceanewsetofre-quirementstoadatabaseenvironment.Theyhaveadeepunderstandingoftheenterprisedataandtendtobeearlyadoptersofnewdatasources.Inthesamewaythatsys-temsengineersalwayswantthelatest-and-greatesthardwaretechnologies,analystsarealwayshungryfornewsourcesofdata.Whennewdata-generatingbusinessprocessesarelaunched,analystsdemandthenewdataimmediately.Thesedesiresforspeedandbreadthofdataraiseten-sionswithDataWarehousingorthodoxy.Inmondescribesthetraditionalview:Thereisnopointinbringingdata...intothedatawarehouseenvironmentwithoutintegratingit.Ifthedataarrivesatthedatawarehouseinanunintegratedstate,itcannotbeusedtosupportacorporateviewofdata.Andacorporateviewofdataisoneoftheessencesofthearchitectedenvironment.[13]Unfortunately,thechallengeofperfectlyintegratinganewdatasourceintoan\architected"warehouseisoftensub-stantial,andcanholdupaccesstodataformonths{orinmanycases,forever.Thearchitecturalviewintroducesfric-tionintoanalytics,repelsdatasourcesfromthewarehouse,andasaresultproducesshallowincompletewarehouses.ItistheoppositeoftheMADideal.Giventhegrowingsophisticationofanalystsandthegrow-ingvalueofanalytics,wetaketheviewthatitismuchmoreimportanttoprovideagilitytoanalyststhantoaspiretoanelusiveidealoffullintegration.Infact,analystsserveaskeydatamagnetsinanorganization,scoutingforinterest-ingdatathatshouldbepartofthecorporatebigpicture.Theycanalsoactasanearlywarningsystemfordataqual-ityissues.Fortheprivilegeofbeingthersttoseethedata,theyaremoretolerantofdirtydata,andwillacttoapplypressureonoperationaldataproducersupstreamofthewarehousetorectifythedatabeforeitarrives.AnalyststypicallyhavemuchhigherstandardsforthedatathanatypicalbusinessunitworkingwithBItools.Theyareun-dauntedbybig, atfacttablesthatholdcompletedatasets,scorningsamplesandaggregates,whichcanbothmasker-rorsandloseimportantfeaturesinthetailsofdistributions.Henceitisourexperiencethatagoodrelationshipwiththeanalyticsteamisanexcellentpreventativemeasurefordatamanagementissueslateron.Feedingtheirappetitesandrespondingtotheirconcernsimprovestheoverallhealthofthewarehouse.Ultimately,theanalystsproducenewdataproductsthatarevaluabletotheenterprise.Theyarenotjustconsumers,butproducersofenterprisedata.Thisrequirestheware-housetobepreparedto\productionalize"thedatagener-atedbytheanalystsintostandardbusinessreportingtools.Itisalsouseful,whenpossible,toleverageasinglepar-allelcomputingplatform,andpushasmuchfunctionalityaspossibleintoit.Thislowersthecostofoperations,andeasestheevolutionofsoftwarefromanalystexperimentstoproductioncodethataectsoperationalbehavior.Forex-ample,thelifecycleofanadplacementalgorithmmightstartinaspeculativeanalytictask,andendasacustomerfac-ingfeature.Ifitisadata-drivenfeature,itisbesttohavethatentirelifecyclefocusedinasingledevelopmentenviron-mentonthefullenterprisedataset.InthisrespectweagreewithacentraltenetofDataWarehouseorthodoxy:thereistangiblebenettogettinganorganization'sdataintoonerepository.Wedieronhowtoachievethatgoalinausefulandsophisticatedmanner.Insum,ahealthybusinessshouldnotassumeanarchi-tecteddatawarehouse,butratheranevolvingstructurethatiteratesthroughacontinuingcycleofchange:1.Thebusinessperformsanalyticstoidentifyareasofpotentialimprovement.2.Thebusinesseitherreactstoorignoresthisanalysis.3.Areactionresultsinnewordierentbusinesspractices{perhapsnewprocessesoroperationalsystems{thattypicallygeneratenewdatasets.4.Analystsincorporatenewdatasetsintotheirmodels.5.Thebusinessagainasksitself\Howcanweimprove?"Ahealthy,competitivebusinesswilllooktoincreasethepaceofthiscycle.TheMADapproachwedescribenextisadesignpatternforkeepingupwiththatincreasingpace.4.2GettingMoreMADThecentralphilosophyinMADdatamodelingistogettheorganization'sdataintothewarehouseassoonaspossi-ble.Secondarytothatgoal,thecleaningandintegrationofthedatashouldbestagedintelligently.Toturnthesethemesintopractice,weadvocatethethree-layerapproach.AStagingschemashouldbeusedwhenloadingrawfacttablesorlogs.Onlyengineersandsomeanalystsarepermittedtomanipulatedatainthisschema.TheProductionDataWarehouseschemaholdstheaggre-gatesthatservemostusers.Moresophisticateduserscom-fortableinalarge-scaleSQLenvironmentaregivenaccesstothisschema.AseparateReportingschemaismaintainedtoholdspecialized,staticaggregatesthatsupportreportingtoolsandcasualusers.Itshouldbetunedtoproviderapidaccesstomodestamountsofdata.Thesethreelayersarenotphysicallyseparated.Userswiththecorrectpermissionsareabletocross-joinbetweenlayersandschemas.IntheFANmodel,theStagingschemaholdsrawactionlogs.Analystsaregivenaccesstotheselogsforresearchpurposesandtoencouragealaboratoryapproachtodataanalysis.Questionsthatstartattheeventlogleveloftenbecomebroaderinscope,allowingcustomaggregates.CommunicationbetweentheresearchersandtheDBAsuncoverscommonquestionsandoftenresultsinaggregatesthatwereoriginallypersonalizedforananalyst Thiswillproduceasquarematrixandavectorbasedonthesizeoftheindependentvariablevector.Thenalcal-culationiscomputedbyinvertingthesmallAmatrixandmultiplyingbythevectortoderivethecoecients.Additionally,calculationofthecoecientofdetermina-tionR2canbecalculatedconcurrentlybySSR=b01 nXyi2TSS=Xy2i1 nXyi2R2=SSR TSSInthefollowingSQLquery,wecomputethecoecients,aswellasthecomponentsofthecoecientofdetermi-nation:CREATEVIEWolsASSELECTpseudo_inverse(A)*basbeta_star,(transpose(b)*(pseudo_inverse(A)*b)-sum_y2/count)--SSR/(sum_yy-sumy2/n)--TSSasr_squaredFROM(SELECTsum(transpose(d.vector)*d.vector)asA,sum(d.vector*y)asb,sum(y)^2assum_y2,sum(y^2)assum_yy,count(*)asnFROMdesignd)ols_aggs;Notetheuseofauser-denedfunctionforvectortranspose,anduser-denedaggregatesforsummationof(multidimen-sional)arrayobjects.ThearrayAisasmallin-memorymatrixthatwetreatasasingleobject;thepseudo-inversefunctionimplementsthetextbookMoore-Penrosepseudoin-verseofthematrix.Alloftheabovecanbeecientlycalculatedinasinglepassofthedata.Forconvenience,weencapsulatedthisyetfurtherviatwouser-denedaggregatefunctions:SELECTols_coef(d.y,d.vector),ols_r2(d.y,d.vector)FROMdesignd;PriortotheimplementationofthisfunctionalitywithintheDBMS,oneGreenplumcustomerwasaccustomedtocal-culatingtheOLSbyexportingdataandimportingthedataintoRforcalculation,aprocessthattookseveralhourstocomplete.Theyreportedsignicantperformanceimprove-mentwhentheymovedtorunningtheregressionwithintheDBMS.Mostofthebenetderivedfromrunningtheanaly-sisinparallelclosetothedatawithminimaldatamovement.5.2.2ConjugateGradientInthissubsectionwedevelopadata-parallelimplementa-tionoftheConjugateGradiantmethodforsolvingasystemoflinearequations.WecanusethistoimplementSup-portVectorMachines,astate-of-the-arttechniqueforbinaryclassication.Binaryclassiersareacommontoolinmod-ernadplacement,usedtoturncomplexmulti-dimensionaluserfeaturesintosimplebooleanlabelslike\isacaren-thusiast"thatcanbecombinedintoenthusiastcharts.InadditiontoservingasabuildingblockforSVMs,theConju-gateGradiantmethodallowsustooptimizealargeclassoffunctionsthatcanbeapproximatedbysecondorderTaylorexpansions.Toamathematician,thesolutiontothematrixequationAx=bissimplewhenitexists:x=A1b.AsnotedinSection5.1,wecannotassumewecanndA1.IfmatrixAisnnsymmetricandpositivedenite(SPD),wecanusetheConjugateGradientmethod.Thismethodrequiresneitherdf(y)norA1andconvergesinnomorethanninterations.Ageneraltreatmentisgivenin[17].HereweoutlinethesolutiontoAx=basanextremumoff(x)=1 2x0Ax+b0x+c.Broadly,wehaveanestimate^xtooursolutionx.Since^xisonlyanestimate,r0=A^xbisnon-zero.Subtractingthiserrorr0fromtheestimateallowsustogenerateaseriespi=ri1fA^xbgoforthogonalvectors.Thesolutionwillbex=Piipiforidenedbelow.Weendatthepointkrkk2forasuitable.Thereareseveralupdatealgorithms,wehavewrittenoursinmatrixnotation.r0=bA^x0;0=r00r0 v00Av0;v0=r0;i=0Beginiterationoveri.i=r0iri v0iAvixi+1=xi+iviri+1=riiAvicheckkri+1k2vi+1=ri+1+r0i+1ri+1 r0iriviToincorporatethismethodintothedatabase,westored(vi;xi;ri;i)asarowandinsertedrowi+1inonepass.Thisrequiredtheconstructionoffunctionsupdate alpha(r i,p i,A),update x(x i,alpha i,v i),update r(x i,alpha i,v i,A),andupdate v(r i,alpha i,v i,A).Thoughthefunctioncallswereredundant(forinstance,update v()alsorunstheupdateofri+1),thisallowedustoinsertonefullrowatatime.Anexternaldriverprocessthenchecksthevalueofribeforeproceeding.Uponconvergence,itisrudimentarytocomputex.ThepresenceoftheconjugategradientmethodenablesevenmoresophisticatedtechniqueslikeSupportVectorMa-chines(SVM).Attheircore,SVMsseektomaximizethedistancebetweenasetofpointsandacandiatehyperplane.Thisdistanceisdenotedbythemagnitudeofthenormalvectorskwk2.Mostmethodsincorporatetheintegersf0;1gaslabelsc,sotheproblembecomesargmaxw;bf(w)=1 2kwk2;subjecttoc0wb0:ThismethodappliestothemoregeneralissueofhighdimensionalfunctionsunderaTaylorexpansionfx0(x)f(x0)+df(x)(xx0)+1 2(xxo)0d2f(x)(xx0)Withagoodinitialguessforxandthecommonassumptionofcontinuityoff(),weknowthethematrixwillbeSPDnearx.See[17]fordetails.5.3FunctionalsBasicstatisticsarenotnewtorelationaldatabases{mostsupportmeans,variancesandsomeformofquantiles.Butmodelingandcomparativestatisticsarenottypicallybuilt-infunctionality.Inthissectionweprovidedata-parallel implementationsofanumberofcomparativestatisticsex-pressedinSQL.Intheprevioussection,scalarsorvectorsweretheatomicunit.Hereaprobabilitydensityfunctionisthefounda-tionalobject.ForinstancetheNormal(Gaussian)densityf(x)=e(x)2=22isconsideredbymathematiciansasasingle\entity"withtwoattributes:themeanandvari-ance.Acommonstatisticalquestionistoseehowwelladatasettstoatargetdensityfunction.Thezscoreofadatumxisgivenz(x)=(x) =p nandiseasytoobtaininstandardSQL.SELECTx.value,(x.value-d.mu)*d.n/d.sigmaASz_scoreFROMx,designd5.3.1MannWhitneyUTestRankandorderstatisticsarequiteamenabletorelationaltreatments,sincetheirmainpurposeistoevaluateasetofdata,ratherthenonedatumatatime.Thenextexampleillustratesthenotionofcomparingtwoentiresetsofdatawithouttheoverheadofdescribingaparameterizeddensity.TheMann-WhitneyUTest(MWU)isapopularsubsti-tuteforStudent'st-testinthecaseofnon-parametricdata.ThegeneralideaittotaketwopopulationsAandBanddecideiftheyarefromthesameunderlyingpopulationbyexaminingtherankorderinwhichmembersofAandBshowupinageneralordering.Thecartoonisthatifmem-bersofAareatthe\front"ofthelineandmembersofBareatthe\back"oftheline,thenAandBaredier-entpopulations.Inanadvertisingsetting,click-throughratesforwebadstendtodefysimpleparametricmodelslikeGaussiansorlog-normaldistributions.Butitisoftenusefultocompareclick-throughratedistributionsfordier-entadcampaigns,e.g.,tochooseonewithabettermedianclick-through.MWUaddressesthistask.GivenatableTwithcolumnsSAMPLE ID,VALUE,rownum-bersareobtainedandsummedviaSQLwindowingfunc-tions.CREATEVIEWRASSELECTsample_id,avg(value)ASsample_avgsum(rown)ASrank_sum,count(*)ASsample_n,sum(rown)-count(*)*(count(*)+1)ASsample_usFROM(SELECTsample_id,row_number()OVER(ORDERBYvalueDESC)ASrown,valueFROMT)ASorderedGROUPBYsample_idAssumingtheconditionoflargesamplesizes,forinstancegreaterthen5,000,thenormalapproximationcanbejusti-ed.UsingthepreviousviewR,thenalreportedstatisticsaregivenbySELECTr.sample_u,r.sample_avg,r.sample_n(r.sample_u-a.sum_u/2)/sqrt(a.sum_u*(a.sum_n+1)/12)ASz_scoreFROMRasr,(SELECTsum(sample_u)ASsum_u,sum(sample_n)ASsum_nFROMR)ASaGROUPBYr.sample_u,r.sample_avg,r.sample_n,a.sum_n,a.sum_uTheendresultisasmallsetofnumbersthatdescribearelationshipoffunctions.Thissimpleroutinecanbeen-capsulatedbystoredproceduresandmadeavailabletotheanalystsviaasimpleSELECTmann whitney(value)FROMtablecall,elevatingthevocabularyofthedatabasetremendously.5.3.2LogLikelihoodRatiosLikelihoodratiosareusefulforcomparingasubpopula-tiontoanoverallpopulationonaparticularattributed.Asanexampleinadvertising,considertwoattributesofusers:beveragechoice,andfamilystatus.Onemightwanttoknowwhethercoeeattractsnewparentsmorethanthegeneralpopulation.Thisisacaseofhavingtwodensity(ormass)functionsforthesamedatasetX.Denoteonedistributionasnullhypothesisf0andtheotherasalternatefA.Typically,f0andfAaredierentparameterizationsofthesamedensity.Forinstance,N(0;0)andN(A;A).ThelikelihoodLunderfiisgivenbyLfi=L(Xjfi)=Ykfi(xk):Thelog-likelihoodratioisgivenbythequotient2log(Lf0=LfA).Takingthelogallowsustousethewell-known2approxi-mationforlargen.Also,theproductsturnnicelyintosumsandanRDBMScanhandleiteasilyinparallel.LLR=2XklogfA(xk)2Xklogf0(xk):Thiscalculationdistributesnicelyiffi:R!R,whichmostdo.Iffi:Rn!R,thencaremustbetakeninmanagingthevectorsasdistributedobjects.SupposethevaluesareintableTandthefunctionfA()hasbeenwrittenasauser-denedfunctionf llk(xnumeric,paramnumeric).ThentheentireexperimentiscanbeperformedwiththecallSELECT2*sum(log(f_llk(T.value,d.alt_param)))-2*sum(log(f_llk(T.value,d.null_param)))ASllrFROMT,designASdThisrepresentsasignicantgainin exibilityandsophisti-cationforanyRDBMS.Example:TheMultinomialDistributionThemultinomialdistributionextendsthebinomialdis-tribution.ConsiderarandomvariableXwithkdiscreteoutcomes.Thesehaveprobabilitiesp=(p1;:::;pk).Inntrials,thejointprobabilitydistributionisgivenbyP(Xjp)= n(n1;:::;nk)!pn11pnkk:Toobtainpi,weassumeatableoutcomewithcolumnoutcomerepresentingthebasepopulation.CREATEVIEWBASSELECToutcome,outcome_count/sum(outcome_count)over()ASpFROM(SELECToutcome,count(*)::numericASoutcome_countFROMinputGROUPBYoutcome)ASaInthecontextofmodelselection,itisoftenconvenienttocomparethesamedatasetundertwodierentmultinomialdistributions.LLR=2logP(Xjp) P(Xj~p)=2log nn1;:::;nkpn11pnkk nn1;:::;nk~pn11~pnkk!=2Xinilog~piXinilogpi:OrinSQL: discussedinthecontextofdataintegration[18].ButthefocusinaMADwarehousecontextisonmassivelyparallelaccesstoledatathatlivesonalocalhigh-speednetwork.Greenplumimplementsfullyparallelaccessforbothload-ingandqueryprocessingoverexternaltablesviaatechniquecalledScatter/GatherStreaming.Theideaissimilartotra-ditionalshared-nothingdatabaseinternals[7],butrequirescoordinationwithexternalprocessesto\feed"alltheDBMSnodesinparallel.Asthedataisstreamedintothesystemitcanbelandedindatabasetablesforsubsequentaccess,oruseddirectlyasapurelyexternaltablewithparallelI/O.Usingthistechnology,Greenplumcustomershavereportingloadingspeedsofafully-mirrored,productiondatabaseinexcessoffourterabytesperhourwithnegligibleimpactonconcurrentdatabaseoperations.6.1.1ETLandELTTraditionaldatawarehousingissupportedbycustomtoolsfortheExtract-Transform-Load(ETL)task.Inrecentyears,thereisincreasingpressuretopushtheworkoftransforma-tionintotheDBMS,toenableparallelexecutionviaSQLtransformationscripts.ThisapproachhasbeendubbedELTsincetransformationisdoneafterloading.TheELTapproachbecomesevenmorenaturalwithexternaltables.Transformationqueriescanbewrittenagainstexternalta-bles,removingtheneedtoeverloaduntransformeddata.Thiscanspeedupthedesignloopfortransformationssub-stantially{especiallywhencombinedwithSQL'sLIMITclauseasa\poorman'sOnlineAggregation"[11]todebugtrans-formations.InadditiontotransformationswritteninSQL,Green-plumsupportsMapReducescriptingintheDBMS,whichcanrunovereitherexternaldataviaScatter/Gather,orin-databasetables(Section6.3).Thisallowsprogrammerstowritetransformationscriptsinthedata ow-styleprogram-mingusedbymanyETLtools,whilerunningatscaleusingtheDBMS'facilitiesforparallelism.6.2DataEvolution:StorageandPartitioningThedatalifecycleinaMADwarehouseincludesdatainvariousstates.Whenadatasourceisrstbroughtintothesystem,analystswilltypicallyiterateoveritfrequentlywithsignicantanalysisandtransformation.Astransfor-mationsandtabledenitionsbegintosettleforaparticulardatasource,theworkloadlooksmoreliketraditionalEDWsettings:frequentappendstolarge\fact"tables,andocca-sionalupdatesto\detail"tables.Thismaturedataislikelytobeusedforad-hocanalysisaswellasforstandardre-portingtasks.Asdatainthe\fact"tablesagesovertime,itmaybeaccessedlessfrequentlyoreven\rolledo"toanexternalarchive.Notethatallthesestagesco-occurinasinglewarehouseatagiventime.HenceagoodDBMSforMADanalyticsneedstosupportmultiplestoragemechanisms,targetedatdierentstagesofthedatalifecycle.Intheearlystage,externaltablespro-videalightweightapproachtoexperimentwithtransforma-tions.Detailtablesareoftenmodestinsizeandundergoperiodicupdates;theyarewellservedbytraditionaltrans-actionalstoragetechniques.Append-mostlyfacttablescanbebetterservedbycompressedstorage,whichcanhandleappendsandreadseciently,attheexpenseofmakingup-datesslower.Itshouldbepossibletorollthisdataoofthewarehouseasitages,withoutdisruptingongoingprocessing.Greenplumprovidesmultiplestorageengines,witharichSQLpartitioningspecicationtoapplythem exiblyacrossandwithintables.Asmentionedabove,Greenplumincludesexternaltablesupport.Greenplumalsoprovidesatradi-tional\heap"storageformatfordatathatseesfrequentupdates,andahighly-compressed\append-only"(AO)ta-blefeaturefordatathatisnotgoingtobeupdated;bothareintegratedwithinatransactionalframework.Green-plumAOstorageunitscanhaveavarietyofcompressionmodes.Atoneextreme,withcompressiono,bulkloadsrunveryquickly.Alternatively,themostaggressivecompressionmodesaretunedtouseaslittlespaceaspossible.Thereisalsoamiddlegroundwith\medium"compressiontoprovideimprovedtablescantimeattheexpenseofslightlyslowerloads.InarecentversionGreenplumalsoadds\column-store"partitioningofappend-onlytables,akintoideasintheliterature[20].Thiscanimprovecompression,anden-suresthatqueriesoverlargearchivaltablesonlydoI/Oforthecolumnstheyneedtosee.ADBAshouldbeabletospecifythestoragemechanismtobeusedina exibleway.Greenplumsupportsmanywaystopartitiontablesinordertoincreasequeryanddataloadperformance,aswellastoaidinmanaginglargedatasets.Thetop-mostlayerofpartitioningisadistributionpolicyspeciedviaaDISTRIBUTEDBYclauseintheCREATETABLEstatementthatdetermineshowtherowsofatablearedistributedacrosstheindividualnodesthatcompriseaGreenplumcluster.Whilealltableshaveadistributionpolicy,userscanoptionallyspecifyapartitioningpolicyforatable,whichseparatesthedatainthetableintoparti-tionsbyrangeorlist.Arangepartitioningpolicyletsusersspecifyanordered,non-overlappingsetofpartitionsforapartitioningcolumn,whereeachpartitionhasaSTARTandENDvalue.Alistpartitioningpolicyletsusersspecifyasetofpartitionsforacollectionofcolumns,whereeachparti-tioncorrespondstoaparticularvalue.Forexample,asalestablemaybehash-distributedoverthenodesbysales id.Oneachnode,therowsarefurtherpartitionedbyrangeintoseparatepartitionsforeachmonth,andeachofthesepartitionsissubpartitionedintothreeseparatesalesregions.Notethatthepartitioningstructureiscompletelymutable:ausercanaddnewpartitionsordropexistingpartitionsorsubpartitionsatanypoint.Partitioningisimportantforanumberofreasons.First,thequeryoptimizerisawareofthepartitioningstructure,andcananalyzepredicatestoperformpartitionexclusion:scanningonlyasubsetofthepartitionsinsteadoftheentiretable.Second,eachpartitionofatablecanhaveadierentstorageformat,tomatchtheexpectedworkload.Atypicalarrangementistopartitionbyatimestampeld,andhaveolderpartitionsbestoredinahighly-compressedappend-onlyformatwhilenewer,\hotter"partitionsarestoredinamoreupdate-friendlyformattoaccommodateauditingup-dates.Third,itenablesatomicpartitionexchange.Ratherthaninsertingdataarowatatime,ausercanuseETLorELTtostagetheirdatatoatemporarytable.Afterthedataisscrubbedandtransformed,theycanusetheALTERTABLE...EXCHANGEPARTITIONcommandtobindthetemporarytableasanewpartitionofanexistingtableinaquickatomicoperation.Thiscapabilitymakespartitioningparticularlyusefulforbusinessesthatperformbulkdataloadsonadaily,weekly,ormonthlybasis,especiallyiftheydroporarchiveolderdatatokeepsomexedsize\window"ofdataonline inthewarehouse.Thesameideaalsoallowsuserstodophysicalmigrationoftablesandstorageformatmodica-tionsinawaythatmostlyisolatesproductiontablesfromloadingandtransformationoverheads.6.3MADProgrammingAlthoughMADdesignfavorsquickimportandfrequentiterationovercarefulmodeling,itisnotintendedtore-jectstructureddatabasesperse.AsmentionedinSec-tion4,thestructureddatamanagementfeaturesofaDBMScanbeveryusefulfororganizingexperimentalresults,trialdatasets,andexperimentalwork ows.Infact,shopsthatusetoolslikeHadooptypicallyhaveDBMSsinaddition,and/orevolvelightdatabasesystemslikeHive.ButaswealsonoteinSection4,itisadvantageoustounifythestruc-turedenvironmentwiththeanalysts'favoriteprogrammingenvironments.Dataanalystscomefrommanywalksoflife.SomeareexpertsinSQL,butmanyarenot.AnalyststhatcomefromascienticormathematicalbackgroundaretypicallytrainedinstatisticalpackageslikeR,SAS,orMatlab.Thesearememory-bound,single-machinesolutions,buttheypro-videconvenientabstractionsformathprogramming,andac-cesstolibrariescontaininghundredsofstatisticalroutines.OtheranalystshavefacilitywithtraditionalprogramminglanguageslikeJava,Perl,andPython,buttypicallydonotwanttowriteparallelorI/O-centriccode.ThekindofdatabaseextensibilitypioneeredbyPostgres[19]isnolongeranexoticDBMSfeature{itisakeytomoderndataanalytics,enablingcodetorunclosetothedata.Tobeinvitingtoavarietyofprogrammers,agoodDBMSexten-sibilityinterfaceshouldaccommodatemultiplelanguages.PostgreSQLhasbecomequitepowerfulinthisregard,sup-portingawiderangeofextensionlanguagesincludingR,PythonandPerl.Greenplumtakestheseinterfacesanden-ablesthemtorundata-parallelonacluster.Thisdoesnotprovideautomaticparallelismofcourse:developersmustthinkthroughhowtheircodeworksinadata-parallelenvi-ronmentwithoutsharedmemory,aswedidinSection5.Inadditiontoworklikeourstoimplementstatisticalmeth-odsinextensibleSQL,thereisagroundswellofeorttoim-plementmethodswiththeMapReduceprogrammingparadigmpopularizedbyGoogle[4]andHadoop.Fromtheperspec-tiveofprogramminglanguagedesign,MapReduceandmod-ernSQLarequitesimilartakesonparallelism:botharedata-parallelprogrammingmodelsforshared-nothingarchi-tecturesthatprovideextensionhooks(\upcalls")tointer-ceptindividualtuplesorsetsoftupleswithinadata ow.Butasaculturalphenomenon,MapReducehascapturedtheinterestofmanydevelopersinterestedinrunninglarge-scaleanalysesonBigData,andiswidelyviewedasamoreat-tractiveprogrammingenvironmentthanSQL.AMADdatawarehouseneedstoattracttheseprogrammers,andallowthemtoenjoythefamiliarityofMapReduceprogramminginacontextthatbothintegrateswiththerestofthedataintheenterprise,andoersmoresophisticatedtoolsforman-agingdataproducts.GreenplumapproachedthischallengebyimplementingaMapReduceprogramminginterfacewhoseruntimeengineisthesamequeryexecutorusedforSQL[9].UserswriteMapandReducefunctionsinfamiliarlanguageslikePython,Perl,orR,andconnectthemupintoMapReducescriptsviaasimplecongurationle.Theycanthenexecutethesescriptsviaacommandlineinterfacethatpassesthecongu-rationandMapReducecodetotheDBMS,returningoutputtoacongurablelocation:commandline,les,orDBMStables.TheonlyrequiredDBMSinteractionisthespeci-cationofanIPaddressfortheDBMS,andauthenticationcredentials(user/password,PGPkeys,etc.)Hencedevelop-erswhoareusedtotraditionalopensourcetoolscontinuetousetheirfavoritecodeeditors,sourcecodemanagement,andshellprompts;theydonotneedtolearnaboutdatabaseutilities,SQLsyntax,schemadesign,etc.TheGreenplumexecutoraccesseslesforMapReducejobsviathesameScatter/GathertechniquethatitusesforexternaltablesinSQL.Inaddition,GreenplumMapReducescriptsinteroperatewithallthefeaturesofthedatabase,andviceversa.MapReducescriptscanusedatabasetablesorviewsastheirinputs,and/orstoretheirresultsasdatabasetablesthatcanbedirectlyaccessedviaSQL.Hencecom-plexpipelinescanevolvethatincludesomestagesinSQL,andsomeinMapReducesyntax.Executioncanbedoneen-tirelyondemand{runningtheSQLandMapReducestagesinapipeline{orviamaterializationofstepsalongthewayeitherinsideoroutsidethedatabase.Programmersofdier-entstripescaninteroperateviafamiliarinterfaces:databasetablesandviews,orMapReduceinputstreams,incorporat-ingavarietyoflanguagesfortheMapandReducefunctions,andforSQLextensionfunctions.ThiskindofinteroperabilitybetweenprogrammingmetaphorsiscriticalforMADanalytics.Itattractsanalysts{andhencedata{tothewarehouse.Itprovidesagilitytode-velopersbyfacilitatingfamiliarprogramminginterfacesandenablinginteroperabilityamongprogrammingstyles.Fi-nally,itallowsanalyststododeepdevelopmentusingthebesttoolsofthetrade,includingmanydomainspecicmod-uleswrittenfortheimplementationlanguages.InexperiencewithavarietyofGreenplumcustomers,wehavefoundthatdeveloperscomfortablewithbothSQLandMapReducewillchooseamongthem exiblyfordierenttasks.Forexample,MapReducehasprovedmoreconve-nientforwritingETLscriptsonleswheretheinputor-derisknownandshouldbeexploitedinthetransformation.MapReducealsomakesiteasytospecifytransformationsthattakeoneinputandproducemultipleoutputs{thisisalsocommoninETLsettingsthat\shred"inputrecordsandproduceastreamofoutputtupleswithmixedformats.SQL,surprisingly,hasbeenmoreconvenientthanMapRe-ducefortasksinvolvinggraphdatalikeweblinksandso-cialnetworks,sincemostofthealgorithmsinthatsetting(PageRank,ClusteringCoecients,etc.)canbecodedcom-pactlyas\self-joins"ofalinktable.7.DIRECTIONSANDREFLECTIONSTheworkinthispaperresultedfromafairlyquick,it-erativediscussionamongdata-centricpeoplewithvaryingjobdescriptionsandtraining.Theprocessofarrivingatthepaper'slessonsechoedthelessonsthemselves.Wedidnotdesignadocumentupfront,butinstead\gotMAD":webroughtmanydatapointstogether,fosteredquickiterationamongmultipleparties,andtriedtodigdeeplyintodetails.AsinMADanalysis,weexpecttoarriveatnewquestionsandnewconclusionsasmoredataisbroughttolight.Afewoftheissueswearecurrentlyconsideringincludethefollowing: