/
MAD Skills New Analysis Practices for Big Data Jeffrey Cohen Greenplum Brian Dolan Fox MAD Skills New Analysis Practices for Big Data Jeffrey Cohen Greenplum Brian Dolan Fox

MAD Skills New Analysis Practices for Big Data Jeffrey Cohen Greenplum Brian Dolan Fox - PDF document

myesha-ticknor
myesha-ticknor . @myesha-ticknor
Follow
473 views
Uploaded On 2014-10-15

MAD Skills New Analysis Practices for Big Data Jeffrey Cohen Greenplum Brian Dolan Fox - PPT Presentation

Hellerstein UC Berkeley Caleb Welton Greenplum ABSTRACT As massive data acquisition and storage becomes increas ingly a64256ordable a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis In this paper we h ID: 5223

Hellerstein Berkeley Caleb

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "MAD Skills New Analysis Practices for Bi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

housecankeeppacetodayonlybybeing\magnetic":attractingallthedatasourcesthatcropupwithinanorganizationregardlessofdataqualityniceties.Agile:DataWarehousingorthodoxyisbasedonlong-range,carefuldesignandplanning.Givengrowingnumbersofdatasourcesandincreasinglysophisticatedandmission-criticaldataanalyses,amodernwarehousemustinsteadallowanalyststoeasilyingest,digest,produceandadaptdataatarapidpace.Thisrequiresadatabasewhosephysicalandlogicalcontentscanbeincontinuousrapidevolution.Deep:Moderndataanalysesinvolveincreasinglyso-phisticatedstatisticalmethodsthatgowellbeyondtherollupsanddrilldownsoftraditionalBI.Moreover,an-alystsoftenneedtoseeboththeforestandthetreesinrunningthesealgorithms{theywanttostudyenor-mousdatasetswithoutresortingtosamplesandex-tracts.Themoderndatawarehouseshouldservebothasadeepdatarepositoryandasasophisticatedalgo-rithmicruntimeengine.AsnotedbyVarian,thereisagrowingpremiumonan-alystswithMADskillsindataanalysis.Theseareoftenhighlytrainedstatisticians,whomayhavestrongsoftwareskillsbutwouldtypicallyratherfocusondeepdataanaly-sisthandatabasemanagement.Theyneedtobecomple-mentedbyMADapproachestodatawarehousedesignanddatabasesysteminfrastructure.Thesegoalsraiseinterest-ingchallengesthataredi erentthanthetraditionalfocusinthedatawarehousingresearchandindustry.1.1ContributionsInthispaper,wedescribetechniquesandexperienceswehavedevelopedinourdevelopmentofMADanalyticsforFoxAudienceNetwork,usingalargeinstallationoftheGreen-plumDatabasesystem.Wediscussourdatabasedesignmethodologythatfocusesonenablinganagileyetorganizedapproachtodataanalysis(Section4).Wepresentanumberofdata-parallelstatisticalalgorithmsdevelopedforthisset-ting,whichfocusonmodelingandcomparingthedensitiesofdistributions.Theseincludespeci cmethodslikeOrdinaryLeastSquares,ConjugateGradiant,andMann-WhitneyUTesting,aswellasgeneralpurposetechniqueslikematrixmultiplicationandBootstrapping(Section5).Finally,were ectoncriticaldatabasesystemfeaturesthatenableagiledesignand exiblealgorithmdevelopment,includinghigh-performancedataingress/egress,heterogeneousstoragefa-cilities,and exibleprogrammingviabothextensibleSQLandMapReduceinterfacestoasinglesystem(Section6).Underlyingourdiscussionarechallengestoanumberofpointsofconventionalwisdom.Indesigningandanalyzingdatawarehouses,weadvocatethetheme\ModelLess,It-erateMore".Thischallengesdatawarehousingorthodoxy,andarguesforashiftinthelocusofpowerfromDBAsto-wardanalysts.Wedescribetheneedforuni edsystemsthatembraceandintegrateawidevarietyofdata-intensiveprogrammingstyles,sinceanalystscomefrommanywalksoflife.ThisinvolvesmovingbeyondreligiousdebatesabouttheadvantagesofSQLoverMapReduce,orRoverJava,tofocusonevolvingasingleparalleldata owenginethatcansupportadiversityofprogrammingstyles,totacklesub-stantivestatisticalanalytics.Finally,wearguethatmanydatasourcesandstorageformatscanandshouldbeknit-tedtogetherbytheparalleldata owengine.Thispointstowardmore uidintegrationorconsolidationoftradition-allydiversetoolsincludingtraditionalrelationaldatabases,columnstores,ETLtools,anddistributed lesystems.2.BACKGROUND:IFYOU'RENOTMADDataanalyticsisnotanewarea.Inthissectionwede-scribestandardpracticeandrelatedworkinBusinessIntel-ligenceandlarge-scaledataanalysis,tosetthecontextforourMADapproach.2.1OLAPandDataCubesDataCubesandOn-LineAnalyticProcessing(OLAP)werepopularizedinthe1990's,leadingtointensecommer-cialdevelopmentandsigni cantacademicresearch.TheSQLCUBEBYextensiontranslatedthebasicideaofOLAPintoarelationalsetting[8].BItoolspackagethesesum-mariesintofairlyintuitive\cross-tabs"visualizations.Bygroupingonfewdimensions,theanalystseesacoarse\roll-up"bar-chart;bygroupingonmoredimensionsthey\drilldown"into nergraineddetail.Statisticiansusethephrasedescriptivestatisticsforthiskindofanalysis,andtradition-allyapplytheapproachtotheresultsofanexperimentalstudy.Thisfunctionalityisusefulforgainingintuitionaboutaprocessunderlyingtheexperiment.Forexample,byde-scribingtheclickstreamatawebsiteonecangetbetterinu-itionaboutunderlyingpropertiesoftheuserpopulation.Bycontrast,inferentialorinductivestatisticstrytodi-rectlycapturetheunderlyingpropertiesofthepopulation.Thisincludes ttingmodelsandparameterstodataandcomputinglikelihoodfunctions.InferentialstatisticsrequiremorecomputationthanthesimplesummariesprovidedbyOLAP,butprovidemoreprobabilisticpowerthatcanbeusedfortaskslikeprediction(e.g.,\whichuserswouldbelikelytoclickonthisnewad?"),causalityanalysis(\whatfeaturesofapageresultinuserrevisits?"),anddistribu-tionalcomparison(e.g.,\howdothebuyingpatternsoftruckownersdi erfromsedanowners?")Inferentialap-proachesarealsomorerobusttooutliersandotherpar-ticularsofagivendataset.WhileOLAPandDataCubesremainusefulforintuition,theuseofinferentialstatisticshasbecomeanimperativeinmanyimportantautomatedorsemi-automatedbusinessprocessestoday,includingadplacement,websiteoptimization,andcustomerrelationshipmanagement.2.2DatabasesandStatisticalPackagesBItoolsprovidefairlylimitedstatisticalfunctionality.Itisthereforestandardpracticeinmanyorganizationstoex-tractportionsofadatabaseintodesktopsoftwarepackages:statisticalpackagelikeSAS,MatlaborR,spreadsheetslikeExcel,orcustomcodewritteninlanguageslikeJava.Therearevariousproblemswiththisapproach.First,copyingoutalargedatabaseextractisoftenmuchlesse-cientthanpushingcomputationtothedata;itiseasytogetordersofmagnitudeperformancegainsbyrunningcodeinthedatabase.Second,moststatpackagesrequiretheirdatato tinRAM.Forlargedatasets,thismeanssamplingthedatabasetoformanextract,whichlosesdetail.Inmodernsettingslikeadvertisingplacement,microtargetingrequiresanunderstandingofevensmallsubpopulations.Samplesandsynopsescanlosethe\longtail"inadataset,andthatisincreasinglywherethecompetitionfore ectivenesslies. supervisedandunsupervisedlearning,neuralnetworkandnaturallanguageprocessing.Thesearenottechniquestra-ditionallyaddressedbyanRDBMS,andtheimplementationinHadoopresultsinlargedatamigratione ortsforspeci csingle-purposestudies.Theavailabilityofmachinelearningmethodsdirectlywithinthewarehousewouldo erasigni -cantsavingsintime,training,andsystemmanagement,andisoneofthegoalsoftheworkdescribedhere.4.MADDATABASEDESIGNTraditionalDataWarehousephilosophyrevolvesaroundadisciplinedapproachtomodelinginformationandprocessesinanenterprise.InthewordsofwarehousingadvocateBillInmon,itisan\architectedenvironment"[13].Thisviewofwarehousingisatoddswiththemagnetismandagilitydesiredinmanynewanalysissettings,aswedescribebelow.4.1NewRequirementsAsdatasavvypeople,analystsintroduceanewsetofre-quirementstoadatabaseenvironment.Theyhaveadeepunderstandingoftheenterprisedataandtendtobeearlyadoptersofnewdatasources.Inthesamewaythatsys-temsengineersalwayswantthelatest-and-greatesthardwaretechnologies,analystsarealwayshungryfornewsourcesofdata.Whennewdata-generatingbusinessprocessesarelaunched,analystsdemandthenewdataimmediately.Thesedesiresforspeedandbreadthofdataraiseten-sionswithDataWarehousingorthodoxy.Inmondescribesthetraditionalview:Thereisnopointinbringingdata...intothedatawarehouseenvironmentwithoutintegratingit.Ifthedataarrivesatthedatawarehouseinanunintegratedstate,itcannotbeusedtosupportacorporateviewofdata.Andacorporateviewofdataisoneoftheessencesofthearchitectedenvironment.[13]Unfortunately,thechallengeofperfectlyintegratinganewdatasourceintoan\architected"warehouseisoftensub-stantial,andcanholdupaccesstodataformonths{orinmanycases,forever.Thearchitecturalviewintroducesfric-tionintoanalytics,repelsdatasourcesfromthewarehouse,andasaresultproducesshallowincompletewarehouses.ItistheoppositeoftheMADideal.Giventhegrowingsophisticationofanalystsandthegrow-ingvalueofanalytics,wetaketheviewthatitismuchmoreimportanttoprovideagilitytoanalyststhantoaspiretoanelusiveidealoffullintegration.Infact,analystsserveaskeydatamagnetsinanorganization,scoutingforinterest-ingdatathatshouldbepartofthecorporatebigpicture.Theycanalsoactasanearlywarningsystemfordataqual-ityissues.Fortheprivilegeofbeingthe rsttoseethedata,theyaremoretolerantofdirtydata,andwillacttoapplypressureonoperationaldataproducersupstreamofthewarehousetorectifythedatabeforeitarrives.AnalyststypicallyhavemuchhigherstandardsforthedatathanatypicalbusinessunitworkingwithBItools.Theyareun-dauntedbybig, atfacttablesthatholdcompletedatasets,scorningsamplesandaggregates,whichcanbothmasker-rorsandloseimportantfeaturesinthetailsofdistributions.Henceitisourexperiencethatagoodrelationshipwiththeanalyticsteamisanexcellentpreventativemeasurefordatamanagementissueslateron.Feedingtheirappetitesandrespondingtotheirconcernsimprovestheoverallhealthofthewarehouse.Ultimately,theanalystsproducenewdataproductsthatarevaluabletotheenterprise.Theyarenotjustconsumers,butproducersofenterprisedata.Thisrequirestheware-housetobepreparedto\productionalize"thedatagener-atedbytheanalystsintostandardbusinessreportingtools.Itisalsouseful,whenpossible,toleverageasinglepar-allelcomputingplatform,andpushasmuchfunctionalityaspossibleintoit.Thislowersthecostofoperations,andeasestheevolutionofsoftwarefromanalystexperimentstoproductioncodethata ectsoperationalbehavior.Forex-ample,thelifecycleofanadplacementalgorithmmightstartinaspeculativeanalytictask,andendasacustomerfac-ingfeature.Ifitisadata-drivenfeature,itisbesttohavethatentirelifecyclefocusedinasingledevelopmentenviron-mentonthefullenterprisedataset.InthisrespectweagreewithacentraltenetofDataWarehouseorthodoxy:thereistangiblebene ttogettinganorganization'sdataintoonerepository.Wedi eronhowtoachievethatgoalinausefulandsophisticatedmanner.Insum,ahealthybusinessshouldnotassumeanarchi-tecteddatawarehouse,butratheranevolvingstructurethatiteratesthroughacontinuingcycleofchange:1.Thebusinessperformsanalyticstoidentifyareasofpotentialimprovement.2.Thebusinesseitherreactstoorignoresthisanalysis.3.Areactionresultsinnewordi erentbusinesspractices{perhapsnewprocessesoroperationalsystems{thattypicallygeneratenewdatasets.4.Analystsincorporatenewdatasetsintotheirmodels.5.Thebusinessagainasksitself\Howcanweimprove?"Ahealthy,competitivebusinesswilllooktoincreasethepaceofthiscycle.TheMADapproachwedescribenextisadesignpatternforkeepingupwiththatincreasingpace.4.2GettingMoreMADThecentralphilosophyinMADdatamodelingistogettheorganization'sdataintothewarehouseassoonaspossi-ble.Secondarytothatgoal,thecleaningandintegrationofthedatashouldbestagedintelligently.Toturnthesethemesintopractice,weadvocatethethree-layerapproach.AStagingschemashouldbeusedwhenloadingrawfacttablesorlogs.Onlyengineersandsomeanalystsarepermittedtomanipulatedatainthisschema.TheProductionDataWarehouseschemaholdstheaggre-gatesthatservemostusers.Moresophisticateduserscom-fortableinalarge-scaleSQLenvironmentaregivenaccesstothisschema.AseparateReportingschemaismaintainedtoholdspecialized,staticaggregatesthatsupportreportingtoolsandcasualusers.Itshouldbetunedtoproviderapidaccesstomodestamountsofdata.Thesethreelayersarenotphysicallyseparated.Userswiththecorrectpermissionsareabletocross-joinbetweenlayersandschemas.IntheFANmodel,theStagingschemaholdsrawactionlogs.Analystsaregivenaccesstotheselogsforresearchpurposesandtoencouragealaboratoryapproachtodataanalysis.Questionsthatstartattheeventlogleveloftenbecomebroaderinscope,allowingcustomaggregates.CommunicationbetweentheresearchersandtheDBAsuncoverscommonquestionsandoftenresultsinaggregatesthatwereoriginallypersonalizedforananalyst Thiswillproduceasquarematrixandavectorbasedonthesizeoftheindependentvariablevector.The nalcal-culationiscomputedbyinvertingthesmallAmatrixandmultiplyingbythevectortoderivethecoecients .Additionally,calculationofthecoecientofdetermina-tionR2canbecalculatedconcurrentlybySSR=b0 �1 nXyi2TSS=Xy2i�1 nXyi2R2=SSR TSSInthefollowingSQLquery,wecomputethecoecients ,aswellasthecomponentsofthecoecientofdetermi-nation:CREATEVIEWolsASSELECTpseudo_inverse(A)*basbeta_star,(transpose(b)*(pseudo_inverse(A)*b)-sum_y2/count)--SSR/(sum_yy-sumy2/n)--TSSasr_squaredFROM(SELECTsum(transpose(d.vector)*d.vector)asA,sum(d.vector*y)asb,sum(y)^2assum_y2,sum(y^2)assum_yy,count(*)asnFROMdesignd)ols_aggs;Notetheuseofauser-de nedfunctionforvectortranspose,anduser-de nedaggregatesforsummationof(multidimen-sional)arrayobjects.ThearrayAisasmallin-memorymatrixthatwetreatasasingleobject;thepseudo-inversefunctionimplementsthetextbookMoore-Penrosepseudoin-verseofthematrix.Alloftheabovecanbeecientlycalculatedinasinglepassofthedata.Forconvenience,weencapsulatedthisyetfurtherviatwouser-de nedaggregatefunctions:SELECTols_coef(d.y,d.vector),ols_r2(d.y,d.vector)FROMdesignd;PriortotheimplementationofthisfunctionalitywithintheDBMS,oneGreenplumcustomerwasaccustomedtocal-culatingtheOLSbyexportingdataandimportingthedataintoRforcalculation,aprocessthattookseveralhourstocomplete.Theyreportedsigni cantperformanceimprove-mentwhentheymovedtorunningtheregressionwithintheDBMS.Mostofthebene tderivedfromrunningtheanaly-sisinparallelclosetothedatawithminimaldatamovement.5.2.2ConjugateGradientInthissubsectionwedevelopadata-parallelimplementa-tionoftheConjugateGradiantmethodforsolvingasystemoflinearequations.WecanusethistoimplementSup-portVectorMachines,astate-of-the-arttechniqueforbinaryclassi cation.Binaryclassi ersareacommontoolinmod-ernadplacement,usedtoturncomplexmulti-dimensionaluserfeaturesintosimplebooleanlabelslike\isacaren-thusiast"thatcanbecombinedintoenthusiastcharts.InadditiontoservingasabuildingblockforSVMs,theConju-gateGradiantmethodallowsustooptimizealargeclassoffunctionsthatcanbeapproximatedbysecondorderTaylorexpansions.Toamathematician,thesolutiontothematrixequationAx=bissimplewhenitexists:x=A�1b.AsnotedinSection5.1,wecannotassumewecan ndA�1.IfmatrixAisnnsymmetricandpositivede nite(SPD),wecanusetheConjugateGradientmethod.Thismethodrequiresneitherdf(y)norA�1andconvergesinnomorethanninterations.Ageneraltreatmentisgivenin[17].HereweoutlinethesolutiontoAx=basanextremumoff(x)=1 2x0Ax+b0x+c.Broadly,wehaveanestimate^xtooursolutionx.Since^xisonlyanestimate,r0=A^x�bisnon-zero.Subtractingthiserrorr0fromtheestimateallowsustogenerateaseriespi=ri�1�fA^x�bgoforthogonalvectors.Thesolutionwillbex=Pi ipifor ide nedbelow.Weendatthepointkrkk2forasuitable.Thereareseveralupdatealgorithms,wehavewrittenoursinmatrixnotation.r0=b�A^x0; 0=r00r0 v00Av0;v0=r0;i=0Beginiterationoveri. i=r0iri v0iAvixi+1=xi+ iviri+1=ri� iAvicheckkri+1k2vi+1=ri+1+r0i+1ri+1 r0iriviToincorporatethismethodintothedatabase,westored(vi;xi;ri; i)asarowandinsertedrowi+1inonepass.Thisrequiredtheconstructionoffunctionsupdate alpha(r i,p i,A),update x(x i,alpha i,v i),update r(x i,alpha i,v i,A),andupdate v(r i,alpha i,v i,A).Thoughthefunctioncallswereredundant(forinstance,update v()alsorunstheupdateofri+1),thisallowedustoinsertonefullrowatatime.Anexternaldriverprocessthenchecksthevalueofribeforeproceeding.Uponconvergence,itisrudimentarytocomputex.ThepresenceoftheconjugategradientmethodenablesevenmoresophisticatedtechniqueslikeSupportVectorMa-chines(SVM).Attheircore,SVMsseektomaximizethedistancebetweenasetofpointsandacandiatehyperplane.Thisdistanceisdenotedbythemagnitudeofthenormalvectorskwk2.Mostmethodsincorporatetheintegersf0;1gaslabelsc,sotheproblembecomesargmaxw;bf(w)=1 2kwk2;subjecttoc0w�b0:ThismethodappliestothemoregeneralissueofhighdimensionalfunctionsunderaTaylorexpansionfx0(x)f(x0)+df(x)(x�x0)+1 2(x�xo)0d2f(x)(x�x0)Withagoodinitialguessforxandthecommonassumptionofcontinuityoff(),weknowthethematrixwillbeSPDnearx.See[17]fordetails.5.3FunctionalsBasicstatisticsarenotnewtorelationaldatabases{mostsupportmeans,variancesandsomeformofquantiles.Butmodelingandcomparativestatisticsarenottypicallybuilt-infunctionality.Inthissectionweprovidedata-parallel implementationsofanumberofcomparativestatisticsex-pressedinSQL.Intheprevioussection,scalarsorvectorsweretheatomicunit.Hereaprobabilitydensityfunctionisthefounda-tionalobject.ForinstancetheNormal(Gaussian)densityf(x)=e(x�)2=22isconsideredbymathematiciansasasingle\entity"withtwoattributes:themeanandvari-ance.Acommonstatisticalquestionistoseehowwelladataset tstoatargetdensityfunction.Thez�scoreofadatumxisgivenz(x)=(x�) =p nandiseasytoobtaininstandardSQL.SELECTx.value,(x.value-d.mu)*d.n/d.sigmaASz_scoreFROMx,designd5.3.1Mann­WhitneyUTestRankandorderstatisticsarequiteamenabletorelationaltreatments,sincetheirmainpurposeistoevaluateasetofdata,ratherthenonedatumatatime.Thenextexampleillustratesthenotionofcomparingtwoentiresetsofdatawithouttheoverheadofdescribingaparameterizeddensity.TheMann-WhitneyUTest(MWU)isapopularsubsti-tuteforStudent'st-testinthecaseofnon-parametricdata.ThegeneralideaittotaketwopopulationsAandBanddecideiftheyarefromthesameunderlyingpopulationbyexaminingtherankorderinwhichmembersofAandBshowupinageneralordering.Thecartoonisthatifmem-bersofAareatthe\front"ofthelineandmembersofBareatthe\back"oftheline,thenAandBaredi er-entpopulations.Inanadvertisingsetting,click-throughratesforwebadstendtodefysimpleparametricmodelslikeGaussiansorlog-normaldistributions.Butitisoftenusefultocompareclick-throughratedistributionsfordi er-entadcampaigns,e.g.,tochooseonewithabettermedianclick-through.MWUaddressesthistask.GivenatableTwithcolumnsSAMPLE ID,VALUE,rownum-bersareobtainedandsummedviaSQLwindowingfunc-tions.CREATEVIEWRASSELECTsample_id,avg(value)ASsample_avgsum(rown)ASrank_sum,count(*)ASsample_n,sum(rown)-count(*)*(count(*)+1)ASsample_usFROM(SELECTsample_id,row_number()OVER(ORDERBYvalueDESC)ASrown,valueFROMT)ASorderedGROUPBYsample_idAssumingtheconditionoflargesamplesizes,forinstancegreaterthen5,000,thenormalapproximationcanbejusti- ed.UsingthepreviousviewR,the nalreportedstatisticsaregivenbySELECTr.sample_u,r.sample_avg,r.sample_n(r.sample_u-a.sum_u/2)/sqrt(a.sum_u*(a.sum_n+1)/12)ASz_scoreFROMRasr,(SELECTsum(sample_u)ASsum_u,sum(sample_n)ASsum_nFROMR)ASaGROUPBYr.sample_u,r.sample_avg,r.sample_n,a.sum_n,a.sum_uTheendresultisasmallsetofnumbersthatdescribearelationshipoffunctions.Thissimpleroutinecanbeen-capsulatedbystoredproceduresandmadeavailabletotheanalystsviaasimpleSELECTmann whitney(value)FROMtablecall,elevatingthevocabularyofthedatabasetremendously.5.3.2Log­LikelihoodRatiosLikelihoodratiosareusefulforcomparingasubpopula-tiontoanoverallpopulationonaparticularattributed.Asanexampleinadvertising,considertwoattributesofusers:beveragechoice,andfamilystatus.Onemightwanttoknowwhetherco eeattractsnewparentsmorethanthegeneralpopulation.Thisisacaseofhavingtwodensity(ormass)functionsforthesamedatasetX.Denoteonedistributionasnullhypothesisf0andtheotherasalternatefA.Typically,f0andfAaredi erentparameterizationsofthesamedensity.Forinstance,N(0;0)andN(A;A).ThelikelihoodLunderfiisgivenbyLfi=L(Xjfi)=Ykfi(xk):Thelog-likelihoodratioisgivenbythequotient�2log(Lf0=LfA).Takingthelogallowsustousethewell-known2approxi-mationforlargen.Also,theproductsturnnicelyintosumsandanRDBMScanhandleiteasilyinparallel.LLR=2XklogfA(xk)�2Xklogf0(xk):Thiscalculationdistributesnicelyiffi:R!R,whichmostdo.Iffi:Rn!R,thencaremustbetakeninmanagingthevectorsasdistributedobjects.SupposethevaluesareintableTandthefunctionfA()hasbeenwrittenasauser-de nedfunctionf llk(xnumeric,paramnumeric).ThentheentireexperimentiscanbeperformedwiththecallSELECT2*sum(log(f_llk(T.value,d.alt_param)))-2*sum(log(f_llk(T.value,d.null_param)))ASllrFROMT,designASdThisrepresentsasigni cantgainin exibilityandsophisti-cationforanyRDBMS.Example:TheMultinomialDistributionThemultinomialdistributionextendsthebinomialdis-tribution.ConsiderarandomvariableXwithkdiscreteoutcomes.Thesehaveprobabilitiesp=(p1;:::;pk).Inntrials,thejointprobabilitydistributionisgivenbyP(Xjp)= n(n1;:::;nk)!pn11pnkk:Toobtainpi,weassumeatableoutcomewithcolumnoutcomerepresentingthebasepopulation.CREATEVIEWBASSELECToutcome,outcome_count/sum(outcome_count)over()ASpFROM(SELECToutcome,count(*)::numericASoutcome_countFROMinputGROUPBYoutcome)ASaInthecontextofmodelselection,itisoftenconvenienttocomparethesamedatasetundertwodi erentmultinomialdistributions.LLR=�2logP(Xjp) P(Xj~p)=�2log �nn1;:::;nkpn11pnkk �nn1;:::;nk~pn11~pnkk!=2Xinilog~pi�Xinilogpi:OrinSQL: discussedinthecontextofdataintegration[18].ButthefocusinaMADwarehousecontextisonmassivelyparallelaccessto ledatathatlivesonalocalhigh-speednetwork.Greenplumimplementsfullyparallelaccessforbothload-ingandqueryprocessingoverexternaltablesviaatechniquecalledScatter/GatherStreaming.Theideaissimilartotra-ditionalshared-nothingdatabaseinternals[7],butrequirescoordinationwithexternalprocessesto\feed"alltheDBMSnodesinparallel.Asthedataisstreamedintothesystemitcanbelandedindatabasetablesforsubsequentaccess,oruseddirectlyasapurelyexternaltablewithparallelI/O.Usingthistechnology,Greenplumcustomershavereportingloadingspeedsofafully-mirrored,productiondatabaseinexcessoffourterabytesperhourwithnegligibleimpactonconcurrentdatabaseoperations.6.1.1ETLandELTTraditionaldatawarehousingissupportedbycustomtoolsfortheExtract-Transform-Load(ETL)task.Inrecentyears,thereisincreasingpressuretopushtheworkoftransforma-tionintotheDBMS,toenableparallelexecutionviaSQLtransformationscripts.ThisapproachhasbeendubbedELTsincetransformationisdoneafterloading.TheELTapproachbecomesevenmorenaturalwithexternaltables.Transformationqueriescanbewrittenagainstexternalta-bles,removingtheneedtoeverloaduntransformeddata.Thiscanspeedupthedesignloopfortransformationssub-stantially{especiallywhencombinedwithSQL'sLIMITclauseasa\poorman'sOnlineAggregation"[11]todebugtrans-formations.InadditiontotransformationswritteninSQL,Green-plumsupportsMapReducescriptingintheDBMS,whichcanrunovereitherexternaldataviaScatter/Gather,orin-databasetables(Section6.3).Thisallowsprogrammerstowritetransformationscriptsinthedata ow-styleprogram-mingusedbymanyETLtools,whilerunningatscaleusingtheDBMS'facilitiesforparallelism.6.2DataEvolution:StorageandPartitioningThedatalifecycleinaMADwarehouseincludesdatainvariousstates.Whenadatasourceis rstbroughtintothesystem,analystswilltypicallyiterateoveritfrequentlywithsigni cantanalysisandtransformation.Astransfor-mationsandtablede nitionsbegintosettleforaparticulardatasource,theworkloadlooksmoreliketraditionalEDWsettings:frequentappendstolarge\fact"tables,andocca-sionalupdatesto\detail"tables.Thismaturedataislikelytobeusedforad-hocanalysisaswellasforstandardre-portingtasks.Asdatainthe\fact"tablesagesovertime,itmaybeaccessedlessfrequentlyoreven\rolledo "toanexternalarchive.Notethatallthesestagesco-occurinasinglewarehouseatagiventime.HenceagoodDBMSforMADanalyticsneedstosupportmultiplestoragemechanisms,targetedatdi erentstagesofthedatalifecycle.Intheearlystage,externaltablespro-videalightweightapproachtoexperimentwithtransforma-tions.Detailtablesareoftenmodestinsizeandundergoperiodicupdates;theyarewellservedbytraditionaltrans-actionalstoragetechniques.Append-mostlyfacttablescanbebetterservedbycompressedstorage,whichcanhandleappendsandreadseciently,attheexpenseofmakingup-datesslower.Itshouldbepossibletorollthisdatao ofthewarehouseasitages,withoutdisruptingongoingprocessing.Greenplumprovidesmultiplestorageengines,witharichSQLpartitioningspeci cationtoapplythem exiblyacrossandwithintables.Asmentionedabove,Greenplumincludesexternaltablesupport.Greenplumalsoprovidesatradi-tional\heap"storageformatfordatathatseesfrequentupdates,andahighly-compressed\append-only"(AO)ta-blefeaturefordatathatisnotgoingtobeupdated;bothareintegratedwithinatransactionalframework.Green-plumAOstorageunitscanhaveavarietyofcompressionmodes.Atoneextreme,withcompressiono ,bulkloadsrunveryquickly.Alternatively,themostaggressivecompressionmodesaretunedtouseaslittlespaceaspossible.Thereisalsoamiddlegroundwith\medium"compressiontoprovideimprovedtablescantimeattheexpenseofslightlyslowerloads.InarecentversionGreenplumalsoadds\column-store"partitioningofappend-onlytables,akintoideasintheliterature[20].Thiscanimprovecompression,anden-suresthatqueriesoverlargearchivaltablesonlydoI/Oforthecolumnstheyneedtosee.ADBAshouldbeabletospecifythestoragemechanismtobeusedina exibleway.Greenplumsupportsmanywaystopartitiontablesinordertoincreasequeryanddataloadperformance,aswellastoaidinmanaginglargedatasets.Thetop-mostlayerofpartitioningisadistributionpolicyspeci edviaaDISTRIBUTEDBYclauseintheCREATETABLEstatementthatdetermineshowtherowsofatablearedistributedacrosstheindividualnodesthatcompriseaGreenplumcluster.Whilealltableshaveadistributionpolicy,userscanoptionallyspecifyapartitioningpolicyforatable,whichseparatesthedatainthetableintoparti-tionsbyrangeorlist.Arangepartitioningpolicyletsusersspecifyanordered,non-overlappingsetofpartitionsforapartitioningcolumn,whereeachpartitionhasaSTARTandENDvalue.Alistpartitioningpolicyletsusersspecifyasetofpartitionsforacollectionofcolumns,whereeachparti-tioncorrespondstoaparticularvalue.Forexample,asalestablemaybehash-distributedoverthenodesbysales id.Oneachnode,therowsarefurtherpartitionedbyrangeintoseparatepartitionsforeachmonth,andeachofthesepartitionsissubpartitionedintothreeseparatesalesregions.Notethatthepartitioningstructureiscompletelymutable:ausercanaddnewpartitionsordropexistingpartitionsorsubpartitionsatanypoint.Partitioningisimportantforanumberofreasons.First,thequeryoptimizerisawareofthepartitioningstructure,andcananalyzepredicatestoperformpartitionexclusion:scanningonlyasubsetofthepartitionsinsteadoftheentiretable.Second,eachpartitionofatablecanhaveadi erentstorageformat,tomatchtheexpectedworkload.Atypicalarrangementistopartitionbyatimestamp eld,andhaveolderpartitionsbestoredinahighly-compressedappend-onlyformatwhilenewer,\hotter"partitionsarestoredinamoreupdate-friendlyformattoaccommodateauditingup-dates.Third,itenablesatomicpartitionexchange.Ratherthaninsertingdataarowatatime,ausercanuseETLorELTtostagetheirdatatoatemporarytable.Afterthedataisscrubbedandtransformed,theycanusetheALTERTABLE...EXCHANGEPARTITIONcommandtobindthetemporarytableasanewpartitionofanexistingtableinaquickatomicoperation.Thiscapabilitymakespartitioningparticularlyusefulforbusinessesthatperformbulkdataloadsonadaily,weekly,ormonthlybasis,especiallyiftheydroporarchiveolderdatatokeepsome xedsize\window"ofdataonline inthewarehouse.Thesameideaalsoallowsuserstodophysicalmigrationoftablesandstorageformatmodi ca-tionsinawaythatmostlyisolatesproductiontablesfromloadingandtransformationoverheads.6.3MADProgrammingAlthoughMADdesignfavorsquickimportandfrequentiterationovercarefulmodeling,itisnotintendedtore-jectstructureddatabasesperse.AsmentionedinSec-tion4,thestructureddatamanagementfeaturesofaDBMScanbeveryusefulfororganizingexperimentalresults,trialdatasets,andexperimentalwork ows.Infact,shopsthatusetoolslikeHadooptypicallyhaveDBMSsinaddition,and/orevolvelightdatabasesystemslikeHive.ButaswealsonoteinSection4,itisadvantageoustounifythestruc-turedenvironmentwiththeanalysts'favoriteprogrammingenvironments.Dataanalystscomefrommanywalksoflife.SomeareexpertsinSQL,butmanyarenot.Analyststhatcomefromascienti cormathematicalbackgroundaretypicallytrainedinstatisticalpackageslikeR,SAS,orMatlab.Thesearememory-bound,single-machinesolutions,buttheypro-videconvenientabstractionsformathprogramming,andac-cesstolibrariescontaininghundredsofstatisticalroutines.OtheranalystshavefacilitywithtraditionalprogramminglanguageslikeJava,Perl,andPython,buttypicallydonotwanttowriteparallelorI/O-centriccode.ThekindofdatabaseextensibilitypioneeredbyPostgres[19]isnolongeranexoticDBMSfeature{itisakeytomoderndataanalytics,enablingcodetorunclosetothedata.Tobeinvitingtoavarietyofprogrammers,agoodDBMSexten-sibilityinterfaceshouldaccommodatemultiplelanguages.PostgreSQLhasbecomequitepowerfulinthisregard,sup-portingawiderangeofextensionlanguagesincludingR,PythonandPerl.Greenplumtakestheseinterfacesanden-ablesthemtorundata-parallelonacluster.Thisdoesnotprovideautomaticparallelismofcourse:developersmustthinkthroughhowtheircodeworksinadata-parallelenvi-ronmentwithoutsharedmemory,aswedidinSection5.Inadditiontoworklikeourstoimplementstatisticalmeth-odsinextensibleSQL,thereisagroundswellofe orttoim-plementmethodswiththeMapReduceprogrammingparadigmpopularizedbyGoogle[4]andHadoop.Fromtheperspec-tiveofprogramminglanguagedesign,MapReduceandmod-ernSQLarequitesimilartakesonparallelism:botharedata-parallelprogrammingmodelsforshared-nothingarchi-tecturesthatprovideextensionhooks(\upcalls")tointer-ceptindividualtuplesorsetsoftupleswithinadata ow.Butasaculturalphenomenon,MapReducehascapturedtheinterestofmanydevelopersinterestedinrunninglarge-scaleanalysesonBigData,andiswidelyviewedasamoreat-tractiveprogrammingenvironmentthanSQL.AMADdatawarehouseneedstoattracttheseprogrammers,andallowthemtoenjoythefamiliarityofMapReduceprogramminginacontextthatbothintegrateswiththerestofthedataintheenterprise,ando ersmoresophisticatedtoolsforman-agingdataproducts.GreenplumapproachedthischallengebyimplementingaMapReduceprogramminginterfacewhoseruntimeengineisthesamequeryexecutorusedforSQL[9].UserswriteMapandReducefunctionsinfamiliarlanguageslikePython,Perl,orR,andconnectthemupintoMapReducescriptsviaasimplecon guration le.Theycanthenexecutethesescriptsviaacommandlineinterfacethatpassesthecon gu-rationandMapReducecodetotheDBMS,returningoutputtoacon gurablelocation:commandline, les,orDBMStables.TheonlyrequiredDBMSinteractionisthespeci -cationofanIPaddressfortheDBMS,andauthenticationcredentials(user/password,PGPkeys,etc.)Hencedevelop-erswhoareusedtotraditionalopensourcetoolscontinuetousetheirfavoritecodeeditors,sourcecodemanagement,andshellprompts;theydonotneedtolearnaboutdatabaseutilities,SQLsyntax,schemadesign,etc.TheGreenplumexecutoraccesses lesforMapReducejobsviathesameScatter/GathertechniquethatitusesforexternaltablesinSQL.Inaddition,GreenplumMapReducescriptsinteroperatewithallthefeaturesofthedatabase,andviceversa.MapReducescriptscanusedatabasetablesorviewsastheirinputs,and/orstoretheirresultsasdatabasetablesthatcanbedirectlyaccessedviaSQL.Hencecom-plexpipelinescanevolvethatincludesomestagesinSQL,andsomeinMapReducesyntax.Executioncanbedoneen-tirelyondemand{runningtheSQLandMapReducestagesinapipeline{orviamaterializationofstepsalongthewayeitherinsideoroutsidethedatabase.Programmersofdi er-entstripescaninteroperateviafamiliarinterfaces:databasetablesandviews,orMapReduceinputstreams,incorporat-ingavarietyoflanguagesfortheMapandReducefunctions,andforSQLextensionfunctions.ThiskindofinteroperabilitybetweenprogrammingmetaphorsiscriticalforMADanalytics.Itattractsanalysts{andhencedata{tothewarehouse.Itprovidesagilitytode-velopersbyfacilitatingfamiliarprogramminginterfacesandenablinginteroperabilityamongprogrammingstyles.Fi-nally,itallowsanalyststododeepdevelopmentusingthebesttoolsofthetrade,includingmanydomainspeci cmod-uleswrittenfortheimplementationlanguages.InexperiencewithavarietyofGreenplumcustomers,wehavefoundthatdeveloperscomfortablewithbothSQLandMapReducewillchooseamongthem exiblyfordi erenttasks.Forexample,MapReducehasprovedmoreconve-nientforwritingETLscriptson leswheretheinputor-derisknownandshouldbeexploitedinthetransformation.MapReducealsomakesiteasytospecifytransformationsthattakeoneinputandproducemultipleoutputs{thisisalsocommoninETLsettingsthat\shred"inputrecordsandproduceastreamofoutputtupleswithmixedformats.SQL,surprisingly,hasbeenmoreconvenientthanMapRe-ducefortasksinvolvinggraphdatalikeweblinksandso-cialnetworks,sincemostofthealgorithmsinthatsetting(PageRank,ClusteringCoecients,etc.)canbecodedcom-pactlyas\self-joins"ofalinktable.7.DIRECTIONSANDREFLECTIONSTheworkinthispaperresultedfromafairlyquick,it-erativediscussionamongdata-centricpeoplewithvaryingjobdescriptionsandtraining.Theprocessofarrivingatthepaper'slessonsechoedthelessonsthemselves.Wedidnotdesignadocumentupfront,butinstead\gotMAD":webroughtmanydatapointstogether,fosteredquickiterationamongmultipleparties,andtriedtodigdeeplyintodetails.AsinMADanalysis,weexpecttoarriveatnewquestionsandnewconclusionsasmoredataisbroughttolight.Afewoftheissueswearecurrentlyconsideringincludethefollowing: