/
MAD Skills New Analysis Practices for Big Data Jeffrey Cohen Greenplum Brian Dolan Fox Audience Network Mark Dunlap Evergreen Technologies Joseph M
MAD Skills New Analysis Practices for Big Data Jeffrey Cohen Greenplum Brian Dolan Fox Audience Network Mark Dunlap Evergreen Technologies Joseph M

MAD Skills New Analysis Practices for Big Data Jeffrey Cohen Greenplum Brian Dolan Fox Audience Network Mark Dunlap Evergreen Technologies Joseph M - PDF document

myesha-ticknor
myesha-ticknor . @myesha-ticknor
Follow
130 views | Public

MAD Skills New Analysis Practices for Big Data Jeffrey Cohen Greenplum Brian Dolan Fox Audience Network Mark Dunlap Evergreen Technologies Joseph M - Description

Hellerstein UC Berkeley Caleb Welton Greenplum ABSTRACT As massive data acquisition and storage becomes increas ingly a64256ordable a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis In this paper we h ID: 5223 Download Pdf

Tags :

Uncharted 4 Hellerstein Berkeley Caleb

Share:

Link:

Embed:

Please download the presentation from below link :


Download Pdf - The PPT/PDF document "MAD Skills New Analysis Practices for Bi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation on theme: "MAD Skills New Analysis Practices for Big Data Jeffrey Cohen Greenplum Brian Dolan Fox Audience Network Mark Dunlap Evergreen Technologies Joseph M"— Presentation transcript

housecankeeppacetodayonlybybeing\magnetic":attractingallthedatasourcesthatcropupwithinanorganizationregardlessofdataqualityniceties.Agile:DataWarehousingorthodoxyisbasedonlong-range,carefuldesignandplanning.Givengrowingnumbersofdatasourcesandincreasinglysophisticatedandmission-criticaldataanalyses,amodernwarehousemustinsteadallowanalyststoeasilyingest,digest,produceandadaptdataatarapidpace.Thisrequiresadatabasewhosephysicalandlogicalcontentscanbeincontinuousrapidevolution.Deep:Moderndataanalysesinvolveincreasinglyso-phisticatedstatisticalmethodsthatgowellbeyondtherollupsanddrilldownsoftraditionalBI.Moreover,an-alystsoftenneedtoseeboththeforestandthetreesinrunningthesealgorithms{theywanttostudyenor-mousdatasetswithoutresortingtosamplesandex-tracts.Themoderndatawarehouseshouldservebothasadeepdatarepositoryandasasophisticatedalgo-rithmicruntimeengine.AsnotedbyVarian,thereisagrowingpremiumonan-alystswithMADskillsindataanalysis.Theseareoftenhighlytrainedstatisticians,whomayhavestrongsoftwareskillsbutwouldtypicallyratherfocusondeepdataanaly-sisthandatabasemanagement.Theyneedtobecomple-mentedbyMADapproachestodatawarehousedesignanddatabasesysteminfrastructure.Thesegoalsraiseinterest-ingchallengesthataredi erentthanthetraditionalfocusinthedatawarehousingresearchandindustry.1.1ContributionsInthispaper,wedescribetechniquesandexperienceswehavedevelopedinourdevelopmentofMADanalyticsforFoxAudienceNetwork,usingalargeinstallationoftheGreen-plumDatabasesystem.Wediscussourdatabasedesignmethodologythatfocusesonenablinganagileyetorganizedapproachtodataanalysis(Section4).Wepresentanumberofdata-parallelstatisticalalgorithmsdevelopedforthisset-ting,whichfocusonmodelingandcomparingthedensitiesofdistributions.Theseincludespeci cmethodslikeOrdinaryLeastSquares,ConjugateGradiant,andMann-WhitneyUTesting,aswellasgeneralpurposetechniqueslikematrixmultiplicationandBootstrapping(Section5).Finally,were ectoncriticaldatabasesystemfeaturesthatenableagiledesignand exiblealgorithmdevelopment,includinghigh-performancedataingress/egress,heterogeneousstoragefa-cilities,and exibleprogrammingviabothextensibleSQLandMapReduceinterfacestoasinglesystem(Section6).Underlyingourdiscussionarechallengestoanumberofpointsofconventionalwisdom.Indesigningandanalyzingdatawarehouses,weadvocatethetheme\ModelLess,It-erateMore".Thischallengesdatawarehousingorthodoxy,andarguesforashiftinthelocusofpowerfromDBAsto-wardanalysts.Wedescribetheneedforuni edsystemsthatembraceandintegrateawidevarietyofdata-intensiveprogrammingstyles,sinceanalystscomefrommanywalksoflife.ThisinvolvesmovingbeyondreligiousdebatesabouttheadvantagesofSQLoverMapReduce,orRoverJava,tofocusonevolvingasingleparalleldata owenginethatcansupportadiversityofprogrammingstyles,totacklesub-stantivestatisticalanalytics.Finally,wearguethatmanydatasourcesandstorageformatscanandshouldbeknit-tedtogetherbytheparalleldata owengine.Thispointstowardmore uidintegrationorconsolidationoftradition-allydiversetoolsincludingtraditionalrelationaldatabases,columnstores,ETLtools,anddistributed lesystems.2.BACKGROUND:IFYOU'RENOTMADDataanalyticsisnotanewarea.Inthissectionwede-scribestandardpracticeandrelatedworkinBusinessIntel-ligenceandlarge-scaledataanalysis,tosetthecontextforourMADapproach.2.1OLAPandDataCubesDataCubesandOn-LineAnalyticProcessing(OLAP)werepopularizedinthe1990's,leadingtointensecommer-cialdevelopmentandsigni cantacademicresearch.TheSQLCUBEBYextensiontranslatedthebasicideaofOLAPintoarelationalsetting[8].BItoolspackagethesesum-mariesintofairlyintuitive\cross-tabs"visualizations.Bygroupingonfewdimensions,theanalystseesacoarse\roll-up"bar-chart;bygroupingonmoredimensionsthey\drilldown"into nergraineddetail.Statisticiansusethephrasedescriptivestatisticsforthiskindofanalysis,andtradition-allyapplytheapproachtotheresultsofanexperimentalstudy.Thisfunctionalityisusefulforgainingintuitionaboutaprocessunderlyingtheexperiment.Forexample,byde-scribingtheclickstreamatawebsiteonecangetbetterinu-itionaboutunderlyingpropertiesoftheuserpopulation.Bycontrast,inferentialorinductivestatisticstrytodi-rectlycapturetheunderlyingpropertiesofthepopulation.Thisincludes ttingmodelsandparameterstodataandcomputinglikelihoodfunctions.InferentialstatisticsrequiremorecomputationthanthesimplesummariesprovidedbyOLAP,butprovidemoreprobabilisticpowerthatcanbeusedfortaskslikeprediction(e.g.,\whichuserswouldbelikelytoclickonthisnewad?"),causalityanalysis(\whatfeaturesofapageresultinuserrevisits?"),anddistribu-tionalcomparison(e.g.,\howdothebuyingpatternsoftruckownersdi erfromsedanowners?")Inferentialap-proachesarealsomorerobusttooutliersandotherpar-ticularsofagivendataset.WhileOLAPandDataCubesremainusefulforintuition,theuseofinferentialstatisticshasbecomeanimperativeinmanyimportantautomatedorsemi-automatedbusinessprocessestoday,includingadplacement,websiteoptimization,andcustomerrelationshipmanagement.2.2DatabasesandStatisticalPackagesBItoolsprovidefairlylimitedstatisticalfunctionality.Itisthereforestandardpracticeinmanyorganizationstoex-tractportionsofadatabaseintodesktopsoftwarepackages:statisticalpackagelikeSAS,MatlaborR,spreadsheetslikeExcel,orcustomcodewritteninlanguageslikeJava.Therearevariousproblemswiththisapproach.First,copyingoutalargedatabaseextractisoftenmuchlesse-cientthanpushingcomputationtothedata;itiseasytogetordersofmagnitudeperformancegainsbyrunningcodeinthedatabase.Second,moststatpackagesrequiretheirdatato tinRAM.Forlargedatasets,thismeanssamplingthedatabasetoformanextract,whichlosesdetail.Inmodernsettingslikeadvertisingplacement,microtargetingrequiresanunderstandingofevensmallsubpopulations.Samplesandsynopsescanlosethe\longtail"inadataset,andthatisincreasinglywherethecompetitionfore ectivenesslies. supervisedandunsupervisedlearning,neuralnetworkandnaturallanguageprocessing.Thesearenottechniquestra-ditionallyaddressedbyanRDBMS,andtheimplementationinHadoopresultsinlargedatamigratione ortsforspeci csingle-purposestudies.Theavailabilityofmachinelearningmethodsdirectlywithinthewarehousewouldo erasigni -cantsavingsintime,training,andsystemmanagement,andisoneofthegoalsoftheworkdescribedhere.4.MADDATABASEDESIGNTraditionalDataWarehousephilosophyrevolvesaroundadisciplinedapproachtomodelinginformationandprocessesinanenterprise.InthewordsofwarehousingadvocateBillInmon,itisan\architectedenvironment"[13].Thisviewofwarehousingisatoddswiththemagnetismandagilitydesiredinmanynewanalysissettings,aswedescribebelow.4.1NewRequirementsAsdatasavvypeople,analystsintroduceanewsetofre-quirementstoadatabaseenvironment.Theyhaveadeepunderstandingoftheenterprisedataandtendtobeearlyadoptersofnewdatasources.Inthesamewaythatsys-temsengineersalwayswantthelatest-and-greatesthardwaretechnologies,analystsarealwayshungryfornewsourcesofdata.Whennewdata-generatingbusinessprocessesarelaunched,analystsdemandthenewdataimmediately.Thesedesiresforspeedandbreadthofdataraiseten-sionswithDataWarehousingorthodoxy.Inmondescribesthetraditionalview:Thereisnopointinbringingdata...intothedatawarehouseenvironmentwithoutintegratingit.Ifthedataarrivesatthedatawarehouseinanunintegratedstate,itcannotbeusedtosupportacorporateviewofdata.Andacorporateviewofdataisoneoftheessencesofthearchitectedenvironment.[13]Unfortunately,thechallengeofperfectlyintegratinganewdatasourceintoan\architected"warehouseisoftensub-stantial,andcanholdupaccesstodataformonths{orinmanycases,forever.Thearchitecturalviewintroducesfric-tionintoanalytics,repelsdatasourcesfromthewarehouse,andasaresultproducesshallowincompletewarehouses.ItistheoppositeoftheMADideal.Giventhegrowingsophisticationofanalystsandthegrow-ingvalueofanalytics,wetaketheviewthatitismuchmoreimportanttoprovideagilitytoanalyststhantoaspiretoanelusiveidealoffullintegration.Infact,analystsserveaskeydatamagnetsinanorganization,scoutingforinterest-ingdatathatshouldbepartofthecorporatebigpicture.Theycanalsoactasanearlywarningsystemfordataqual-ityissues.Fortheprivilegeofbeingthe rsttoseethedata,theyaremoretolerantofdirtydata,andwillacttoapplypressureonoperationaldataproducersupstreamofthewarehousetorectifythedatabeforeitarrives.AnalyststypicallyhavemuchhigherstandardsforthedatathanatypicalbusinessunitworkingwithBItools.Theyareun-dauntedbybig, atfacttablesthatholdcompletedatasets,scorningsamplesandaggregates,whichcanbothmasker-rorsandloseimportantfeaturesinthetailsofdistributions.Henceitisourexperiencethatagoodrelationshipwiththeanalyticsteamisanexcellentpreventativemeasurefordatamanagementissueslateron.Feedingtheirappetitesandrespondingtotheirconcernsimprovestheoverallhealthofthewarehouse.Ultimately,theanalystsproducenewdataproductsthatarevaluabletotheenterprise.Theyarenotjustconsumers,butproducersofenterprisedata.Thisrequirestheware-housetobepreparedto\productionalize"thedatagener-atedbytheanalystsintostandardbusinessreportingtools.Itisalsouseful,whenpossible,toleverageasinglepar-allelcomputingplatform,andpushasmuchfunctionalityaspossibleintoit.Thislowersthecostofoperations,andeasestheevolutionofsoftwarefromanalystexperimentstoproductioncodethata ectsoperationalbehavior.Forex-ample,thelifecycleofanadplacementalgorithmmightstartinaspeculativeanalytictask,andendasacustomerfac-ingfeature.Ifitisadata-drivenfeature,itisbesttohavethatentirelifecyclefocusedinasingledevelopmentenviron-mentonthefullenterprisedataset.InthisrespectweagreewithacentraltenetofDataWarehouseorthodoxy:thereistangiblebene ttogettinganorganization'sdataintoonerepository.Wedi eronhowtoachievethatgoalinausefulandsophisticatedmanner.Insum,ahealthybusinessshouldnotassumeanarchi-tecteddatawarehouse,butratheranevolvingstructurethatiteratesthroughacontinuingcycleofchange:1.Thebusinessperformsanalyticstoidentifyareasofpotentialimprovement.2.Thebusinesseitherreactstoorignoresthisanalysis.3.Areactionresultsinnewordi erentbusinesspractices{perhapsnewprocessesoroperationalsystems{thattypicallygeneratenewdatasets.4.Analystsincorporatenewdatasetsintotheirmodels.5.Thebusinessagainasksitself\Howcanweimprove?"Ahealthy,competitivebusinesswilllooktoincreasethepaceofthiscycle.TheMADapproachwedescribenextisadesignpatternforkeepingupwiththatincreasingpace.4.2GettingMoreMADThecentralphilosophyinMADdatamodelingistogettheorganization'sdataintothewarehouseassoonaspossi-ble.Secondarytothatgoal,thecleaningandintegrationofthedatashouldbestagedintelligently.Toturnthesethemesintopractice,weadvocatethethree-layerapproach.AStagingschemashouldbeusedwhenloadingrawfacttablesorlogs.Onlyengineersandsomeanalystsarepermittedtomanipulatedatainthisschema.TheProductionDataWarehouseschemaholdstheaggre-gatesthatservemostusers.Moresophisticateduserscom-fortableinalarge-scaleSQLenvironmentaregivenaccesstothisschema.AseparateReportingschemaismaintainedtoholdspecialized,staticaggregatesthatsupportreportingtoolsandcasualusers.Itshouldbetunedtoproviderapidaccesstomodestamountsofdata.Thesethreelayersarenotphysicallyseparated.Userswiththecorrectpermissionsareabletocross-joinbetweenlayersandschemas.IntheFANmodel,theStagingschemaholdsrawactionlogs.Analystsaregivenaccesstotheselogsforresearchpurposesandtoencouragealaboratoryapproachtodataanalysis.Questionsthatstartattheeventlogleveloftenbecomebroaderinscope,allowingcustomaggregates.CommunicationbetweentheresearchersandtheDBAsuncoverscommonquestionsandoftenresultsinaggregatesthatwereoriginallypersonalizedforananalyst Thiswillproduceasquarematrixandavectorbasedonthesizeoftheindependentvariablevector.The nalcal-culationiscomputedbyinvertingthesmallAmatrixandmultiplyingbythevectortoderivethecoecients .Additionally,calculationofthecoecientofdetermina-tionR2canbecalculatedconcurrentlybySSR=b0 �1 nXyi2TSS=Xy2i�1 nXyi2R2=SSR TSSInthefollowingSQLquery,wecomputethecoecients ,aswellasthecomponentsofthecoecientofdetermi-nation:CREATEVIEWolsASSELECTpseudo_inverse(A)*basbeta_star,(transpose(b)*(pseudo_inverse(A)*b)-sum_y2/count)--SSR/(sum_yy-sumy2/n)--TSSasr_squaredFROM(SELECTsum(transpose(d.vector)*d.vector)asA,sum(d.vector*y)asb,sum(y)^2assum_y2,sum(y^2)assum_yy,count(*)asnFROMdesignd)ols_aggs;Notetheuseofauser-de nedfunctionforvectortranspose,anduser-de nedaggregatesforsummationof(multidimen-sional)arrayobjects.ThearrayAisasmallin-memorymatrixthatwetreatasasingleobject;thepseudo-inversefunctionimplementsthetextbookMoore-Penrosepseudoin-verseofthematrix.Alloftheabovecanbeecientlycalculatedinasinglepassofthedata.Forconvenience,weencapsulatedthisyetfurtherviatwouser-de nedaggregatefunctions:SELECTols_coef(d.y,d.vector),ols_r2(d.y,d.vector)FROMdesignd;PriortotheimplementationofthisfunctionalitywithintheDBMS,oneGreenplumcustomerwasaccustomedtocal-culatingtheOLSbyexportingdataandimportingthedataintoRforcalculation,aprocessthattookseveralhourstocomplete.Theyreportedsigni cantperformanceimprove-mentwhentheymovedtorunningtheregressionwithintheDBMS.Mostofthebene tderivedfromrunningtheanaly-sisinparallelclosetothedatawithminimaldatamovement.5.2.2ConjugateGradientInthissubsectionwedevelopadata-parallelimplementa-tionoftheConjugateGradiantmethodforsolvingasystemoflinearequations.WecanusethistoimplementSup-portVectorMachines,astate-of-the-arttechniqueforbinaryclassi cation.Binaryclassi ersareacommontoolinmod-ernadplacement,usedtoturncomplexmulti-dimensionaluserfeaturesintosimplebooleanlabelslike\isacaren-thusiast"thatcanbecombinedintoenthusiastcharts.InadditiontoservingasabuildingblockforSVMs,theConju-gateGradiantmethodallowsustooptimizealargeclassoffunctionsthatcanbeapproximatedbysecondorderTaylorexpansions.Toamathematician,thesolutiontothematrixequationAx=bissimplewhenitexists:x=A�1b.AsnotedinSection5.1,wecannotassumewecan ndA�1.IfmatrixAisnnsymmetricandpositivede nite(SPD),wecanusetheConjugateGradientmethod.Thismethodrequiresneitherdf(y)norA�1andconvergesinnomorethanninterations.Ageneraltreatmentisgivenin[17].HereweoutlinethesolutiontoAx=basanextremumoff(x)=1 2x0Ax+b0x+c.Broadly,wehaveanestimate^xtooursolutionx.Since^xisonlyanestimate,r0=A^x�bisnon-zero.Subtractingthiserrorr0fromtheestimateallowsustogenerateaseriespi=ri�1�fA^x�bgoforthogonalvectors.Thesolutionwillbex=Pi ipifor ide nedbelow.Weendatthepointkrkk2forasuitable.Thereareseveralupdatealgorithms,wehavewrittenoursinmatrixnotation.r0=b�A^x0; 0=r00r0 v00Av0;v0=r0;i=0Beginiterationoveri. i=r0iri v0iAvixi+1=xi+ iviri+1=ri� iAvicheckkri+1k2vi+1=ri+1+r0i+1ri+1 r0iriviToincorporatethismethodintothedatabase,westored(vi;xi;ri; i)asarowandinsertedrowi+1inonepass.Thisrequiredtheconstructionoffunctionsupdate alpha(r i,p i,A),update x(x i,alpha i,v i),update r(x i,alpha i,v i,A),andupdate v(r i,alpha i,v i,A).Thoughthefunctioncallswereredundant(forinstance,update v()alsorunstheupdateofri+1),thisallowedustoinsertonefullrowatatime.Anexternaldriverprocessthenchecksthevalueofribeforeproceeding.Uponconvergence,itisrudimentarytocomputex.ThepresenceoftheconjugategradientmethodenablesevenmoresophisticatedtechniqueslikeSupportVectorMa-chines(SVM).Attheircore,SVMsseektomaximizethedistancebetweenasetofpointsandacandiatehyperplane.Thisdistanceisdenotedbythemagnitudeofthenormalvectorskwk2.Mostmethodsincorporatetheintegersf0;1gaslabelsc,sotheproblembecomesargmaxw;bf(w)=1 2kwk2;subjecttoc0w�b0:ThismethodappliestothemoregeneralissueofhighdimensionalfunctionsunderaTaylorexpansionfx0(x)f(x0)+df(x)(x�x0)+1 2(x�xo)0d2f(x)(x�x0)Withagoodinitialguessforxandthecommonassumptionofcontinuityoff(),weknowthethematrixwillbeSPDnearx.See[17]fordetails.5.3FunctionalsBasicstatisticsarenotnewtorelationaldatabases{mostsupportmeans,variancesandsomeformofquantiles.Butmodelingandcomparativestatisticsarenottypicallybuilt-infunctionality.Inthissectionweprovidedata-parallel implementationsofanumberofcomparativestatisticsex-pressedinSQL.Intheprevioussection,scalarsorvectorsweretheatomicunit.Hereaprobabilitydensityfunctionisthefounda-tionalobject.ForinstancetheNormal(Gaussian)densityf(x)=e(x�)2=22isconsideredbymathematiciansasasingle\entity"withtwoattributes:themeanandvari-ance.Acommonstatisticalquestionistoseehowwelladataset tstoatargetdensityfunction.Thez�scoreofadatumxisgivenz(x)=(x�) =p nandiseasytoobtaininstandardSQL.SELECTx.value,(x.value-d.mu)*d.n/d.sigmaASz_scoreFROMx,designd5.3.1Mann­WhitneyUTestRankandorderstatisticsarequiteamenabletorelationaltreatments,sincetheirmainpurposeistoevaluateasetofdata,ratherthenonedatumatatime.Thenextexampleillustratesthenotionofcomparingtwoentiresetsofdatawithouttheoverheadofdescribingaparameterizeddensity.TheMann-WhitneyUTest(MWU)isapopularsubsti-tuteforStudent'st-testinthecaseofnon-parametricdata.ThegeneralideaittotaketwopopulationsAandBanddecideiftheyarefromthesameunderlyingpopulationbyexaminingtherankorderinwhichmembersofAandBshowupinageneralordering.Thecartoonisthatifmem-bersofAareatthe\front"ofthelineandmembersofBareatthe\back"oftheline,thenAandBaredi er-entpopulations.Inanadvertisingsetting,click-throughratesforwebadstendtodefysimpleparametricmodelslikeGaussiansorlog-normaldistributions.Butitisoftenusefultocompareclick-throughratedistributionsfordi er-entadcampaigns,e.g.,tochooseonewithabettermedianclick-through.MWUaddressesthistask.GivenatableTwithcolumnsSAMPLE ID,VALUE,rownum-bersareobtainedandsummedviaSQLwindowingfunc-tions.CREATEVIEWRASSELECTsample_id,avg(value)ASsample_avgsum(rown)ASrank_sum,count(*)ASsample_n,sum(rown)-count(*)*(count(*)+1)ASsample_usFROM(SELECTsample_id,row_number()OVER(ORDERBYvalueDESC)ASrown,valueFROMT)ASorderedGROUPBYsample_idAssumingtheconditionoflargesamplesizes,forinstancegreaterthen5,000,thenormalapproximationcanbejusti- ed.UsingthepreviousviewR,the nalreportedstatisticsaregivenbySELECTr.sample_u,r.sample_avg,r.sample_n(r.sample_u-a.sum_u/2)/sqrt(a.sum_u*(a.sum_n+1)/12)ASz_scoreFROMRasr,(SELECTsum(sample_u)ASsum_u,sum(sample_n)ASsum_nFROMR)ASaGROUPBYr.sample_u,r.sample_avg,r.sample_n,a.sum_n,a.sum_uTheendresultisasmallsetofnumbersthatdescribearelationshipoffunctions.Thissimpleroutinecanbeen-capsulatedbystoredproceduresandmadeavailabletotheanalystsviaasimpleSELECTmann whitney(value)FROMtablecall,elevatingthevocabularyofthedatabasetremendously.5.3.2Log­LikelihoodRatiosLikelihoodratiosareusefulforcomparingasubpopula-tiontoanoverallpopulationonaparticularattributed.Asanexampleinadvertising,considertwoattributesofusers:beveragechoice,andfamilystatus.Onemightwanttoknowwhetherco eeattractsnewparentsmorethanthegeneralpopulation.Thisisacaseofhavingtwodensity(ormass)functionsforthesamedatasetX.Denoteonedistributionasnullhypothesisf0andtheotherasalternatefA.Typically,f0andfAaredi erentparameterizationsofthesamedensity.Forinstance,N(0;0)andN(A;A).ThelikelihoodLunderfiisgivenbyLfi=L(Xjfi)=Ykfi(xk):Thelog-likelihoodratioisgivenbythequotient�2log(Lf0=LfA).Takingthelogallowsustousethewell-known2approxi-mationforlargen.Also,theproductsturnnicelyintosumsandanRDBMScanhandleiteasilyinparallel.LLR=2XklogfA(xk)�2Xklogf0(xk):Thiscalculationdistributesnicelyiffi:R!R,whichmostdo.Iffi:Rn!R,thencaremustbetakeninmanagingthevectorsasdistributedobjects.SupposethevaluesareintableTandthefunctionfA()hasbeenwrittenasauser-de nedfunctionf llk(xnumeric,paramnumeric).ThentheentireexperimentiscanbeperformedwiththecallSELECT2*sum(log(f_llk(T.value,d.alt_param)))-2*sum(log(f_llk(T.value,d.null_param)))ASllrFROMT,designASdThisrepresentsasigni cantgainin exibilityandsophisti-cationforanyRDBMS.Example:TheMultinomialDistributionThemultinomialdistributionextendsthebinomialdis-tribution.ConsiderarandomvariableXwithkdiscreteoutcomes.Thesehaveprobabilitiesp=(p1;:::;pk).Inntrials,thejointprobabilitydistributionisgivenbyP(Xjp)= n(n1;:::;nk)!pn11pnkk:Toobtainpi,weassumeatableoutcomewithcolumnoutcomerepresentingthebasepopulation.CREATEVIEWBASSELECToutcome,outcome_count/sum(outcome_count)over()ASpFROM(SELECToutcome,count(*)::numericASoutcome_countFROMinputGROUPBYoutcome)ASaInthecontextofmodelselection,itisoftenconvenienttocomparethesamedatasetundertwodi erentmultinomialdistributions.LLR=�2logP(Xjp) P(Xj~p)=�2log �nn1;:::;nkpn11pnkk �nn1;:::;nk~pn11~pnkk!=2Xinilog~pi�Xinilogpi:OrinSQL: discussedinthecontextofdataintegration[18].ButthefocusinaMADwarehousecontextisonmassivelyparallelaccessto ledatathatlivesonalocalhigh-speednetwork.Greenplumimplementsfullyparallelaccessforbothload-ingandqueryprocessingoverexternaltablesviaatechniquecalledScatter/GatherStreaming.Theideaissimilartotra-ditionalshared-nothingdatabaseinternals[7],butrequirescoordinationwithexternalprocessesto\feed"alltheDBMSnodesinparallel.Asthedataisstreamedintothesystemitcanbelandedindatabasetablesforsubsequentaccess,oruseddirectlyasapurelyexternaltablewithparallelI/O.Usingthistechnology,Greenplumcustomershavereportingloadingspeedsofafully-mirrored,productiondatabaseinexcessoffourterabytesperhourwithnegligibleimpactonconcurrentdatabaseoperations.6.1.1ETLandELTTraditionaldatawarehousingissupportedbycustomtoolsfortheExtract-Transform-Load(ETL)task.Inrecentyears,thereisincreasingpressuretopushtheworkoftransforma-tionintotheDBMS,toenableparallelexecutionviaSQLtransformationscripts.ThisapproachhasbeendubbedELTsincetransformationisdoneafterloading.TheELTapproachbecomesevenmorenaturalwithexternaltables.Transformationqueriescanbewrittenagainstexternalta-bles,removingtheneedtoeverloaduntransformeddata.Thiscanspeedupthedesignloopfortransformationssub-stantially{especiallywhencombinedwithSQL'sLIMITclauseasa\poorman'sOnlineAggregation"[11]todebugtrans-formations.InadditiontotransformationswritteninSQL,Green-plumsupportsMapReducescriptingintheDBMS,whichcanrunovereitherexternaldataviaScatter/Gather,orin-databasetables(Section6.3).Thisallowsprogrammerstowritetransformationscriptsinthedata ow-styleprogram-mingusedbymanyETLtools,whilerunningatscaleusingtheDBMS'facilitiesforparallelism.6.2DataEvolution:StorageandPartitioningThedatalifecycleinaMADwarehouseincludesdatainvariousstates.Whenadatasourceis rstbroughtintothesystem,analystswilltypicallyiterateoveritfrequentlywithsigni cantanalysisandtransformation.Astransfor-mationsandtablede nitionsbegintosettleforaparticulardatasource,theworkloadlooksmoreliketraditionalEDWsettings:frequentappendstolarge\fact"tables,andocca-sionalupdatesto\detail"tables.Thismaturedataislikelytobeusedforad-hocanalysisaswellasforstandardre-portingtasks.Asdatainthe\fact"tablesagesovertime,itmaybeaccessedlessfrequentlyoreven\rolledo "toanexternalarchive.Notethatallthesestagesco-occurinasinglewarehouseatagiventime.HenceagoodDBMSforMADanalyticsneedstosupportmultiplestoragemechanisms,targetedatdi erentstagesofthedatalifecycle.Intheearlystage,externaltablespro-videalightweightapproachtoexperimentwithtransforma-tions.Detailtablesareoftenmodestinsizeandundergoperiodicupdates;theyarewellservedbytraditionaltrans-actionalstoragetechniques.Append-mostlyfacttablescanbebetterservedbycompressedstorage,whichcanhandleappendsandreadseciently,attheexpenseofmakingup-datesslower.Itshouldbepossibletorollthisdatao ofthewarehouseasitages,withoutdisruptingongoingprocessing.Greenplumprovidesmultiplestorageengines,witharichSQLpartitioningspeci cationtoapplythem exiblyacrossandwithintables.Asmentionedabove,Greenplumincludesexternaltablesupport.Greenplumalsoprovidesatradi-tional\heap"storageformatfordatathatseesfrequentupdates,andahighly-compressed\append-only"(AO)ta-blefeaturefordatathatisnotgoingtobeupdated;bothareintegratedwithinatransactionalframework.Green-plumAOstorageunitscanhaveavarietyofcompressionmodes.Atoneextreme,withcompressiono ,bulkloadsrunveryquickly.Alternatively,themostaggressivecompressionmodesaretunedtouseaslittlespaceaspossible.Thereisalsoamiddlegroundwith\medium"compressiontoprovideimprovedtablescantimeattheexpenseofslightlyslowerloads.InarecentversionGreenplumalsoadds\column-store"partitioningofappend-onlytables,akintoideasintheliterature[20].Thiscanimprovecompression,anden-suresthatqueriesoverlargearchivaltablesonlydoI/Oforthecolumnstheyneedtosee.ADBAshouldbeabletospecifythestoragemechanismtobeusedina exibleway.Greenplumsupportsmanywaystopartitiontablesinordertoincreasequeryanddataloadperformance,aswellastoaidinmanaginglargedatasets.Thetop-mostlayerofpartitioningisadistributionpolicyspeci edviaaDISTRIBUTEDBYclauseintheCREATETABLEstatementthatdetermineshowtherowsofatablearedistributedacrosstheindividualnodesthatcompriseaGreenplumcluster.Whilealltableshaveadistributionpolicy,userscanoptionallyspecifyapartitioningpolicyforatable,whichseparatesthedatainthetableintoparti-tionsbyrangeorlist.Arangepartitioningpolicyletsusersspecifyanordered,non-overlappingsetofpartitionsforapartitioningcolumn,whereeachpartitionhasaSTARTandENDvalue.Alistpartitioningpolicyletsusersspecifyasetofpartitionsforacollectionofcolumns,whereeachparti-tioncorrespondstoaparticularvalue.Forexample,asalestablemaybehash-distributedoverthenodesbysales id.Oneachnode,therowsarefurtherpartitionedbyrangeintoseparatepartitionsforeachmonth,andeachofthesepartitionsissubpartitionedintothreeseparatesalesregions.Notethatthepartitioningstructureiscompletelymutable:ausercanaddnewpartitionsordropexistingpartitionsorsubpartitionsatanypoint.Partitioningisimportantforanumberofreasons.First,thequeryoptimizerisawareofthepartitioningstructure,andcananalyzepredicatestoperformpartitionexclusion:scanningonlyasubsetofthepartitionsinsteadoftheentiretable.Second,eachpartitionofatablecanhaveadi erentstorageformat,tomatchtheexpectedworkload.Atypicalarrangementistopartitionbyatimestamp eld,andhaveolderpartitionsbestoredinahighly-compressedappend-onlyformatwhilenewer,\hotter"partitionsarestoredinamoreupdate-friendlyformattoaccommodateauditingup-dates.Third,itenablesatomicpartitionexchange.Ratherthaninsertingdataarowatatime,ausercanuseETLorELTtostagetheirdatatoatemporarytable.Afterthedataisscrubbedandtransformed,theycanusetheALTERTABLE...EXCHANGEPARTITIONcommandtobindthetemporarytableasanewpartitionofanexistingtableinaquickatomicoperation.Thiscapabilitymakespartitioningparticularlyusefulforbusinessesthatperformbulkdataloadsonadaily,weekly,ormonthlybasis,especiallyiftheydroporarchiveolderdatatokeepsome xedsize\window"ofdataonline inthewarehouse.Thesameideaalsoallowsuserstodophysicalmigrationoftablesandstorageformatmodi ca-tionsinawaythatmostlyisolatesproductiontablesfromloadingandtransformationoverheads.6.3MADProgrammingAlthoughMADdesignfavorsquickimportandfrequentiterationovercarefulmodeling,itisnotintendedtore-jectstructureddatabasesperse.AsmentionedinSec-tion4,thestructureddatamanagementfeaturesofaDBMScanbeveryusefulfororganizingexperimentalresults,trialdatasets,andexperimentalwork ows.Infact,shopsthatusetoolslikeHadooptypicallyhaveDBMSsinaddition,and/orevolvelightdatabasesystemslikeHive.ButaswealsonoteinSection4,itisadvantageoustounifythestruc-turedenvironmentwiththeanalysts'favoriteprogrammingenvironments.Dataanalystscomefrommanywalksoflife.SomeareexpertsinSQL,butmanyarenot.Analyststhatcomefromascienti cormathematicalbackgroundaretypicallytrainedinstatisticalpackageslikeR,SAS,orMatlab.Thesearememory-bound,single-machinesolutions,buttheypro-videconvenientabstractionsformathprogramming,andac-cesstolibrariescontaininghundredsofstatisticalroutines.OtheranalystshavefacilitywithtraditionalprogramminglanguageslikeJava,Perl,andPython,buttypicallydonotwanttowriteparallelorI/O-centriccode.ThekindofdatabaseextensibilitypioneeredbyPostgres[19]isnolongeranexoticDBMSfeature{itisakeytomoderndataanalytics,enablingcodetorunclosetothedata.Tobeinvitingtoavarietyofprogrammers,agoodDBMSexten-sibilityinterfaceshouldaccommodatemultiplelanguages.PostgreSQLhasbecomequitepowerfulinthisregard,sup-portingawiderangeofextensionlanguagesincludingR,PythonandPerl.Greenplumtakestheseinterfacesanden-ablesthemtorundata-parallelonacluster.Thisdoesnotprovideautomaticparallelismofcourse:developersmustthinkthroughhowtheircodeworksinadata-parallelenvi-ronmentwithoutsharedmemory,aswedidinSection5.Inadditiontoworklikeourstoimplementstatisticalmeth-odsinextensibleSQL,thereisagroundswellofe orttoim-plementmethodswiththeMapReduceprogrammingparadigmpopularizedbyGoogle[4]andHadoop.Fromtheperspec-tiveofprogramminglanguagedesign,MapReduceandmod-ernSQLarequitesimilartakesonparallelism:botharedata-parallelprogrammingmodelsforshared-nothingarchi-tecturesthatprovideextensionhooks(\upcalls")tointer-ceptindividualtuplesorsetsoftupleswithinadata ow.Butasaculturalphenomenon,MapReducehascapturedtheinterestofmanydevelopersinterestedinrunninglarge-scaleanalysesonBigData,andiswidelyviewedasamoreat-tractiveprogrammingenvironmentthanSQL.AMADdatawarehouseneedstoattracttheseprogrammers,andallowthemtoenjoythefamiliarityofMapReduceprogramminginacontextthatbothintegrateswiththerestofthedataintheenterprise,ando ersmoresophisticatedtoolsforman-agingdataproducts.GreenplumapproachedthischallengebyimplementingaMapReduceprogramminginterfacewhoseruntimeengineisthesamequeryexecutorusedforSQL[9].UserswriteMapandReducefunctionsinfamiliarlanguageslikePython,Perl,orR,andconnectthemupintoMapReducescriptsviaasimplecon guration le.Theycanthenexecutethesescriptsviaacommandlineinterfacethatpassesthecon gu-rationandMapReducecodetotheDBMS,returningoutputtoacon gurablelocation:commandline, les,orDBMStables.TheonlyrequiredDBMSinteractionisthespeci -cationofanIPaddressfortheDBMS,andauthenticationcredentials(user/password,PGPkeys,etc.)Hencedevelop-erswhoareusedtotraditionalopensourcetoolscontinuetousetheirfavoritecodeeditors,sourcecodemanagement,andshellprompts;theydonotneedtolearnaboutdatabaseutilities,SQLsyntax,schemadesign,etc.TheGreenplumexecutoraccesses lesforMapReducejobsviathesameScatter/GathertechniquethatitusesforexternaltablesinSQL.Inaddition,GreenplumMapReducescriptsinteroperatewithallthefeaturesofthedatabase,andviceversa.MapReducescriptscanusedatabasetablesorviewsastheirinputs,and/orstoretheirresultsasdatabasetablesthatcanbedirectlyaccessedviaSQL.Hencecom-plexpipelinescanevolvethatincludesomestagesinSQL,andsomeinMapReducesyntax.Executioncanbedoneen-tirelyondemand{runningtheSQLandMapReducestagesinapipeline{orviamaterializationofstepsalongthewayeitherinsideoroutsidethedatabase.Programmersofdi er-entstripescaninteroperateviafamiliarinterfaces:databasetablesandviews,orMapReduceinputstreams,incorporat-ingavarietyoflanguagesfortheMapandReducefunctions,andforSQLextensionfunctions.ThiskindofinteroperabilitybetweenprogrammingmetaphorsiscriticalforMADanalytics.Itattractsanalysts{andhencedata{tothewarehouse.Itprovidesagilitytode-velopersbyfacilitatingfamiliarprogramminginterfacesandenablinginteroperabilityamongprogrammingstyles.Fi-nally,itallowsanalyststododeepdevelopmentusingthebesttoolsofthetrade,includingmanydomainspeci cmod-uleswrittenfortheimplementationlanguages.InexperiencewithavarietyofGreenplumcustomers,wehavefoundthatdeveloperscomfortablewithbothSQLandMapReducewillchooseamongthem exiblyfordi erenttasks.Forexample,MapReducehasprovedmoreconve-nientforwritingETLscriptson leswheretheinputor-derisknownandshouldbeexploitedinthetransformation.MapReducealsomakesiteasytospecifytransformationsthattakeoneinputandproducemultipleoutputs{thisisalsocommoninETLsettingsthat\shred"inputrecordsandproduceastreamofoutputtupleswithmixedformats.SQL,surprisingly,hasbeenmoreconvenientthanMapRe-ducefortasksinvolvinggraphdatalikeweblinksandso-cialnetworks,sincemostofthealgorithmsinthatsetting(PageRank,ClusteringCoecients,etc.)canbecodedcom-pactlyas\self-joins"ofalinktable.7.DIRECTIONSANDREFLECTIONSTheworkinthispaperresultedfromafairlyquick,it-erativediscussionamongdata-centricpeoplewithvaryingjobdescriptionsandtraining.Theprocessofarrivingatthepaper'slessonsechoedthelessonsthemselves.Wedidnotdesignadocumentupfront,butinstead\gotMAD":webroughtmanydatapointstogether,fosteredquickiterationamongmultipleparties,andtriedtodigdeeplyintodetails.AsinMADanalysis,weexpecttoarriveatnewquestionsandnewconclusionsasmoredataisbroughttolight.Afewoftheissueswearecurrentlyconsideringincludethefollowing: