MAD Skills New Analysis Practices for Big Data Jeffrey Cohen Greenplum Brian Dolan Fox Audience Network Mark Dunlap Evergreen Technologies Joseph M - PDF document

Download presentation
MAD Skills New Analysis Practices for Big Data Jeffrey Cohen Greenplum Brian Dolan Fox Audience Network Mark Dunlap Evergreen Technologies Joseph M
MAD Skills New Analysis Practices for Big Data Jeffrey Cohen Greenplum Brian Dolan Fox Audience Network Mark Dunlap Evergreen Technologies Joseph M

MAD Skills New Analysis Practices for Big Data Jeffrey Cohen Greenplum Brian Dolan Fox Audience Network Mark Dunlap Evergreen Technologies Joseph M - Description


Hellerstein UC Berkeley Caleb Welton Greenplum ABSTRACT As massive data acquisition and storage becomes increas ingly a64256ordable a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis In this paper we h ID: 5223 Download Pdf

Tags

Hellerstein Berkeley Caleb

Embed / Share - MAD Skills New Analysis Practices for Big Data Jeffrey Cohen Greenplum Brian Dolan Fox Audience Network Mark Dunlap Evergreen Technologies Joseph M


Presentation on theme: "MAD Skills New Analysis Practices for Big Data Jeffrey Cohen Greenplum Brian Dolan Fox Audience Network Mark Dunlap Evergreen Technologies Joseph M"— Presentation transcript


housecankeeppacetodayonlybybeing\magnetic":attractingallthedatasourcesthatcropupwithinanorganizationregardlessofdataqualityniceties.Agile:DataWarehousingorthodoxyisbasedonlong-range,carefuldesignandplanning.Givengrowingnumbersofdatasourcesandincreasinglysophisticatedandmission-criticaldataanalyses,amodernwarehousemustinsteadallowanalyststoeasilyingest,digest,produceandadaptdataatarapidpace.Thisrequiresadatabasewhosephysicalandlogicalcontentscanbeincontinuousrapidevolution.Deep:Moderndataanalysesinvolveincreasinglyso-phisticatedstatisticalmethodsthatgowellbeyondtherollupsanddrilldownsoftraditionalBI.Moreover,an-alystsoftenneedtoseeboththeforestandthetreesinrunningthesealgorithms{theywanttostudyenor-mousdatasetswithoutresortingtosamplesandex-tracts.Themoderndatawarehouseshouldservebothasadeepdatarepositoryandasasophisticatedalgo-rithmicruntimeengine.AsnotedbyVarian,thereisagrowingpremiumonan-alystswithMADskillsindataanalysis.Theseareoftenhighlytrainedstatisticians,whomayhavestrongsoftwareskillsbutwouldtypicallyratherfocusondeepdataanaly-sisthandatabasemanagement.Theyneedtobecomple-mentedbyMADapproachestodatawarehousedesignanddatabasesysteminfrastructure.Thesegoalsraiseinterest-ingchallengesthataredi erentthanthetraditionalfocusinthedatawarehousingresearchandindustry.1.1ContributionsInthispaper,wedescribetechniquesandexperienceswehavedevelopedinourdevelopmentofMADanalyticsforFoxAudienceNetwork,usingalargeinstallationoftheGreen-plumDatabasesystem.Wediscussourdatabasedesignmethodologythatfocusesonenablinganagileyetorganizedapproachtodataanalysis(Section4).Wepresentanumberofdata-parallelstatisticalalgorithmsdevelopedforthisset-ting,whichfocusonmodelingandcomparingthedensitiesofdistributions.Theseincludespeci cmethodslikeOrdinaryLeastSquares,ConjugateGradiant,andMann-WhitneyUTesting,aswellasgeneralpurposetechniqueslikematrixmultiplicationandBootstrapping(Section5).Finally,were ectoncriticaldatabasesystemfeaturesthatenableagiledesignand exiblealgorithmdevelopment,includinghigh-performancedataingress/egress,heterogeneousstoragefa-cilities,and exibleprogrammingviabothextensibleSQLandMapReduceinterfacestoasinglesystem(Section6).Underlyingourdiscussionarechallengestoanumberofpointsofconventionalwisdom.Indesigningandanalyzingdatawarehouses,weadvocatethetheme\ModelLess,It-erateMore".Thischallengesdatawarehousingorthodoxy,andarguesforashiftinthelocusofpowerfromDBAsto-wardanalysts.Wedescribetheneedforuni edsystemsthatembraceandintegrateawidevarietyofdata-intensiveprogrammingstyles,sinceanalystscomefrommanywalksoflife.ThisinvolvesmovingbeyondreligiousdebatesabouttheadvantagesofSQLoverMapReduce,orRoverJava,tofocusonevolvingasingleparalleldata owenginethatcansupportadiversityofprogrammingstyles,totacklesub-stantivestatisticalanalytics.Finally,wearguethatmanydatasourcesandstorageformatscanandshouldbeknit-tedtogetherbytheparalleldata owengine.Thispointstowardmore uidintegrationorconsolidationoftradition-allydiversetoolsincludingtraditionalrelationaldatabases,columnstores,ETLtools,anddistributed lesystems.2.BACKGROUND:IFYOU'RENOTMADDataanalyticsisnotanewarea.Inthissectionwede-scribestandardpracticeandrelatedworkinBusinessIntel-ligenceandlarge-scaledataanalysis,tosetthecontextforourMADapproach.2.1OLAPandDataCubesDataCubesandOn-LineAnalyticProcessing(OLAP)werepopularizedinthe1990's,leadingtointensecommer-cialdevelopmentandsigni cantacademicresearch.TheSQLCUBEBYextensiontranslatedthebasicideaofOLAPintoarelationalsetting[8].BItoolspackagethesesum-mariesintofairlyintuitive\cross-tabs"visualizations.Bygroupingonfewdimensions,theanalystseesacoarse\roll-up"bar-chart;bygroupingonmoredimensionsthey\drilldown"into nergraineddetail.Statisticiansusethephrasedescriptivestatisticsforthiskindofanalysis,andtradition-allyapplytheapproachtotheresultsofanexperimentalstudy.Thisfunctionalityisusefulforgainingintuitionaboutaprocessunderlyingtheexperiment.Forexample,byde-scribingtheclickstreamatawebsiteonecangetbetterinu-itionaboutunderlyingpropertiesoftheuserpopulation.Bycontrast,inferentialorinductivestatisticstrytodi-rectlycapturetheunderlyingpropertiesofthepopulation.Thisincludes ttingmodelsandparameterstodataandcomputinglikelihoodfunctions.InferentialstatisticsrequiremorecomputationthanthesimplesummariesprovidedbyOLAP,butprovidemoreprobabilisticpowerthatcanbeusedfortaskslikeprediction(e.g.,\whichuserswouldbelikelytoclickonthisnewad?"),causalityanalysis(\whatfeaturesofapageresultinuserrevisits?"),anddistribu-tionalcomparison(e.g.,\howdothebuyingpatternsoftruckownersdi erfromsedanowners?")Inferentialap-proachesarealsomorerobusttooutliersandotherpar-ticularsofagivendataset.WhileOLAPandDataCubesremainusefulforintuition,theuseofinferentialstatisticshasbecomeanimperativeinmanyimportantautomatedorsemi-automatedbusinessprocessestoday,includingadplacement,websiteoptimization,andcustomerrelationshipmanagement.2.2DatabasesandStatisticalPackagesBItoolsprovidefairlylimitedstatisticalfunctionality.Itisthereforestandardpracticeinmanyorganizationstoex-tractportionsofadatabaseintodesktopsoftwarepackages:statisticalpackagelikeSAS,MatlaborR,spreadsheetslikeExcel,orcustomcodewritteninlanguageslikeJava.Therearevariousproblemswiththisapproach.First,copyingoutalargedatabaseextractisoftenmuchlesse-cientthanpushingcomputationtothedata;itiseasytogetordersofmagnitudeperformancegainsbyrunningcodeinthedatabase.Second,moststatpackagesrequiretheirdatato tinRAM.Forlargedatasets,thismeanssamplingthedatabasetoformanextract,whichlosesdetail.Inmodernsettingslikeadvertisingplacement,microtargetingrequiresanunderstandingofevensmallsubpopulations.Samplesandsynopsescanlosethe\longtail"inadataset,andthatisincreasinglywherethecompetitionfore ectivenesslies. supervisedandunsupervisedlearning,neuralnetworkandnaturallanguageprocessing.Thesearenottechniquestra-ditionallyaddressedbyanRDBMS,andtheimplementationinHadoopresultsinlargedatamigratione ortsforspeci csingle-purposestudies.Theavailabilityofmachinelearningmethodsdirectlywithinthewarehousewouldo erasigni -cantsavingsintime,training,andsystemmanagement,andisoneofthegoalsoftheworkdescribedhere.4.MADDATABASEDESIGNTraditionalDataWarehousephilosophyrevolvesaroundadisciplinedapproachtomodelinginformationandprocessesinanenterprise.InthewordsofwarehousingadvocateBillInmon,itisan\architectedenvironment"[13].Thisviewofwarehousingisatoddswiththemagnetismandagilitydesiredinmanynewanalysissettings,aswedescribebelow.4.1NewRequirementsAsdatasavvypeople,analystsintroduceanewsetofre-quirementstoadatabaseenvironment.Theyhaveadeepunderstandingoftheenterprisedataandtendtobeearlyadoptersofnewdatasources.Inthesamewaythatsys-temsengineersalwayswantthelatest-and-greatesthardwaretechnologies,analystsarealwayshungryfornewsourcesofdata.Whennewdata-generatingbusinessprocessesarelaunched,analystsdemandthenewdataimmediately.Thesedesiresforspeedandbreadthofdataraiseten-sionswithDataWarehousingorthodoxy.Inmondescribesthetraditionalview:Thereisnopointinbringingdata...intothedatawarehouseenvironmentwithoutintegratingit.Ifthedataarrivesatthedatawarehouseinanunintegratedstate,itcannotbeusedtosupportacorporateviewofdata.Andacorporateviewofdataisoneoftheessencesofthearchitectedenvironment.[13]Unfortunately,thechallengeofperfectlyintegratinganewdatasourceintoan\architected"warehouseisoftensub-stantial,andcanholdupaccesstodataformonths{orinmanycases,forever.Thearchitecturalviewintroducesfric-tionintoanalytics,repelsdatasourcesfromthewarehouse,andasaresultproducesshallowincompletewarehouses.ItistheoppositeoftheMADideal.Giventhegrowingsophisticationofanalystsandthegrow-ingvalueofanalytics,wetaketheviewthatitismuchmoreimportanttoprovideagilitytoanalyststhantoaspiretoanelusiveidealoffullintegration.Infact,analystsserveaskeydatamagnetsinanorganization,scoutingforinterest-ingdatathatshouldbepartofthecorporatebigpicture.Theycanalsoactasanearlywarningsystemfordataqual-ityissues.Fortheprivilegeofbeingthe rsttoseethedata,theyaremoretolerantofdirtydata,andwillacttoapplypressureonoperationaldataproducersupstreamofthewarehousetorectifythedatabeforeitarrives.AnalyststypicallyhavemuchhigherstandardsforthedatathanatypicalbusinessunitworkingwithBItools.Theyareun-dauntedbybig, atfacttablesthatholdcompletedatasets,scorningsamplesandaggregates,whichcanbothmasker-rorsandloseimportantfeaturesinthetailsofdistributions.Henceitisourexperiencethatagoodrelationshipwiththeanalyticsteamisanexcellentpreventativemeasurefordatamanagementissueslateron.Feedingtheirappetitesandrespondingtotheirconcernsimprovestheoverallhealthofthewarehouse.Ultimately,theanalystsproducenewdataproductsthatarevaluabletotheenterprise.Theyarenotjustconsumers,butproducersofenterprisedata.Thisrequirestheware-housetobepreparedto\productionalize"thedatagener-atedbytheanalystsintostandardbusinessreportingtools.Itisalsouseful,whenpossible,toleverageasinglepar-allelcomputingplatform,andpushasmuchfunctionalityaspossibleintoit.Thislowersthecostofoperations,andeasestheevolutionofsoftwarefromanalystexperimentstoproductioncodethata ectsoperationalbehavior.Forex-ample,thelifecycleofanadplacementalgorithmmightstartinaspeculativeanalytictask,andendasacustomerfac-ingfeature.Ifitisadata-drivenfeature,itisbesttohavethatentirelifecyclefocusedinasingledevelopmentenviron-mentonthefullenterprisedataset.InthisrespectweagreewithacentraltenetofDataWarehouseorthodoxy:thereistangiblebene ttogettinganorganization'sdataintoonerepository.Wedi eronhowtoachievethatgoalinausefulandsophisticatedmanner.Insum,ahealthybusinessshouldnotassumeanarchi-tecteddatawarehouse,butratheranevolvingstructurethatiteratesthroughacontinuingcycleofchange:1.Thebusinessperformsanalyticstoidentifyareasofpotentialimprovement.2.Thebusinesseitherreactstoorignoresthisanalysis.3.Areactionresultsinnewordi erentbusinesspractices{perhapsnewprocessesoroperationalsystems{thattypicallygeneratenewdatasets.4.Analystsincorporatenewdatasetsintotheirmodels.5.Thebusinessagainasksitself\Howcanweimprove?"Ahealthy,competitivebusinesswilllooktoincreasethepaceofthiscycle.TheMADapproachwedescribenextisadesignpatternforkeepingupwiththatincreasingpace.4.2GettingMoreMADThecentralphilosophyinMADdatamodelingistogettheorganization'sdataintothewarehouseassoonaspossi-ble.Secondarytothatgoal,thecleaningandintegrationofthedatashouldbestagedintelligently.Toturnthesethemesintopractice,weadvocatethethree-layerapproach.AStagingschemashouldbeusedwhenloadingrawfacttablesorlogs.Onlyengineersandsomeanalystsarepermittedtomanipulatedatainthisschema.TheProductionDataWarehouseschemaholdstheaggre-gatesthatservemostusers.Moresophisticateduserscom-fortableinalarge-scaleSQLenvironmentaregivenaccesstothisschema.AseparateReportingschemaismaintainedtoholdspecialized,staticaggregatesthatsupportreportingtoolsandcasualusers.Itshouldbetunedtoproviderapidaccesstomodestamountsofdata.Thesethreelayersarenotphysicallyseparated.Userswiththecorrectpermissionsareabletocross-joinbetweenlayersandschemas.IntheFANmodel,theStagingschemaholdsrawactionlogs.Analystsaregivenaccesstotheselogsforresearchpurposesandtoencouragealaboratoryapproachtodataanalysis.Questionsthatstartattheeventlogleveloftenbecomebroaderinscope,allowingcustomaggregates.CommunicationbetweentheresearchersandtheDBAsuncoverscommonquestionsandoftenresultsinaggregatesthatwereoriginallypersonalizedforananalyst Thiswillproduceasquarematrixandavectorbasedonthesizeoftheindependentvariablevector.The nalcal-culationiscomputedbyinvertingthesmallAmatrixandmultiplyingbythevectortoderivethecoecients .Additionally,calculationofthecoecientofdetermina-tionR2canbecalculatedconcurrentlybySSR=b0 �1 nXyi2TSS=Xy2i�1 nXyi2R2=SSR TSSInthefollowingSQLquery,wecomputethecoecients ,aswellasthecomponentsofthecoecientofdetermi-nation:CREATEVIEWolsASSELECTpseudo_inverse(A)*basbeta_star,(transpose(b)*(pseudo_inverse(A)*b)-sum_y2/count)--SSR/(sum_yy-sumy2/n)--TSSasr_squaredFROM(SELECTsum(transpose(d.vector)*d.vector)asA,sum(d.vector*y)asb,sum(y)^2assum_y2,sum(y^2)assum_yy,count(*)asnFROMdesignd)ols_aggs;Notetheuseofauser-de nedfunctionforvectortranspose,anduser-de nedaggregatesforsummationof(multidimen-sional)arrayobjects.ThearrayAisasmallin-memorymatrixthatwetreatasasingleobject;thepseudo-inversefunctionimplementsthetextbookMoore-Penrosepseudoin-verseofthematrix.Alloftheabovecanbeecientlycalculatedinasinglepassofthedata.Forconvenience,weencapsulatedthisyetfurtherviatwouser-de nedaggregatefunctions:SELECTols_coef(d.y,d.vector),ols_r2(d.y,d.vector)FROMdesignd;PriortotheimplementationofthisfunctionalitywithintheDBMS,oneGreenplumcustomerwasaccustomedtocal-culatingtheOLSbyexportingdataandimportingthedataintoRforcalculation,aprocessthattookseveralhourstocomplete.Theyreportedsigni cantperformanceimprove-mentwhentheymovedtorunningtheregressionwithintheDBMS.Mostofthebene tderivedfromrunningtheanaly-sisinparallelclosetothedatawithminimaldatamovement.5.2.2ConjugateGradientInthissubsectionwedevelopadata-parallelimplementa-tionoftheConjugateGradiantmethodforsolvingasystemoflinearequations.WecanusethistoimplementSup-portVectorMachines,astate-of-the-arttechniqueforbinaryclassi cation.Binaryclassi ersareacommontoolinmod-ernadplacement,usedtoturncomplexmulti-dimensionaluserfeaturesintosimplebooleanlabelslike\isacaren-thusiast"thatcanbecombinedintoenthusiastcharts.InadditiontoservingasabuildingblockforSVMs,theConju-gateGradiantmethodallowsustooptimizealargeclassoffunctionsthatcanbeapproximatedbysecondorderTaylorexpansions.Toamathematician,thesolutiontothematrixequationAx=bissimplewhenitexists:x=A�1b.AsnotedinSection5.1,wecannotassumewecan ndA�1.IfmatrixAisnnsymmetricandpositivede nite(SPD),wecanusetheConjugateGradientmethod.Thismethodrequiresneitherdf(y)norA�1andconvergesinnomorethanninterations.Ageneraltreatmentisgivenin[17].HereweoutlinethesolutiontoAx=basanextremumoff(x)=1 2x0Ax+b0x+c.Broadly,wehaveanestimate^xtooursolutionx.Since^xisonlyanestimate,r0=A^x�bisnon-zero.Subtractingthiserrorr0fromtheestimateallowsustogenerateaseriespi=ri�1�fA^x�bgoforthogonalvectors.Thesolutionwillbex=Pi ipifor ide nedbelow.Weendatthepointkrkk2forasuitable.Thereareseveralupdatealgorithms,wehavewrittenoursinmatrixnotation.r0=b�A^x0; 0=r00r0 v00Av0;v0=r0;i=0Beginiterationoveri. i=r0iri v0iAvixi+1=xi+ iviri+1=ri� iAvicheckkri+1k2vi+1=ri+1+r0i+1ri+1 r0iriviToincorporatethismethodintothedatabase,westored(vi;xi;ri; i)asarowandinsertedrowi+1inonepass.Thisrequiredtheconstructionoffunctionsupdate alpha(r i,p i,A),update x(x i,alpha i,v i),update r(x i,alpha i,v i,A),andupdate v(r i,alpha i,v i,A).Thoughthefunctioncallswereredundant(forinstance,update v()alsorunstheupdateofri+1),thisallowedustoinsertonefullrowatatime.Anexternaldriverprocessthenchecksthevalueofribeforeproceeding.Uponconvergence,itisrudimentarytocomputex.ThepresenceoftheconjugategradientmethodenablesevenmoresophisticatedtechniqueslikeSupportVectorMa-chines(SVM).Attheircore,SVMsseektomaximizethedistancebetweenasetofpointsandacandiatehyperplane.Thisdistanceisdenotedbythemagnitudeofthenormalvectorskwk2.Mostmethodsincorporatetheintegersf0;1gaslabelsc,sotheproblembecomesargmaxw;bf(w)=1 2kwk2;subjecttoc0w�b0:ThismethodappliestothemoregeneralissueofhighdimensionalfunctionsunderaTaylorexpansionfx0(x)f(x0)+df(x)(x�x0)+1 2(x�xo)0d2f(x)(x�x0)Withagoodinitialguessforxandthecommonassumptionofcontinuityoff(),weknowthethematrixwillbeSPDnearx.See[17]fordetails.5.3FunctionalsBasicstatisticsarenotnewtorelationaldatabases{mostsupportmeans,variancesandsomeformofquantiles.Butmodelingandcomparativestatisticsarenottypicallybuilt-infunctionality.Inthissectionweprovidedata-parallel implementationsofanumberofcomparativestatisticsex-pressedinSQL.Intheprevioussection,scalarsorvectorsweretheatomicunit.Hereaprobabilitydensityfunctionisthefounda-tionalobject.ForinstancetheNormal(Gaussian)densityf(x)=e(x�)2=22isconsideredbymathematiciansasasingle\entity"withtwoattributes:themeanandvari-ance.Acommonstatisticalquestionistoseehowwelladataset tstoatargetdensityfunction.Thez�scoreofadatumxisgivenz(x)=(x�) =p nandiseasytoobtaininstandardSQL.SELECTx.value,(x.value-d.mu)*d.n/d.sigmaASz_scoreFROMx,designd5.3.1Mann­WhitneyUTestRankandorderstatisticsarequiteamenabletorelationaltreatments,sincetheirmainpurposeistoevaluateasetofdata,ratherthenonedatumatatime.Thenextexampleillustratesthenotionofcomparingtwoentiresetsofdatawithouttheoverheadofdescribingaparameterizeddensity.TheMann-WhitneyUTest(MWU)isapopularsubsti-tuteforStudent'st-testinthecaseofnon-parametricdata.ThegeneralideaittotaketwopopulationsAandBanddecideiftheyarefromthesameunderlyingpopulationbyexaminingtherankorderinwhichmembersofAandBshowupinageneralordering.Thecartoonisthatifmem-bersofAareatthe\front"ofthelineandmembersofBareatthe\back"oftheline,thenAandBaredi er-entpopulations.Inanadvertisingsetting,click-throughratesforwebadstendtodefysimpleparametricmodelslikeGaussiansorlog-normaldistributions.Butitisoftenusefultocompareclick-throughratedistributionsfordi er-entadcampaigns,e.g.,tochooseonewithabettermedianclick-through.MWUaddressesthistask.GivenatableTwithcolumnsSAMPLE ID,VALUE,rownum-bersareobtainedandsummedviaSQLwindowingfunc-tions.CREATEVIEWRASSELECTsample_id,avg(value)ASsample_avgsum(rown)ASrank_sum,count(*)ASsample_n,sum(rown)-count(*)*(count(*)+1)ASsample_usFROM(SELECTsample_id,row_number()OVER(ORDERBYvalueDESC)ASrown,valueFROMT)ASorderedGROUPBYsample_idAssumingtheconditionoflargesamplesizes,forinstancegreaterthen5,000,thenormalapproximationcanbejusti- ed.UsingthepreviousviewR,the nalreportedstatisticsaregivenbySELECTr.sample_u,r.sample_avg,r.sample_n(r.sample_u-a.sum_u/2)/sqrt(a.sum_u*(a.sum_n+1)/12)ASz_scoreFROMRasr,(SELECTsum(sample_u)ASsum_u,sum(sample_n)ASsum_nFROMR)ASaGROUPBYr.sample_u,r.sample_avg,r.sample_n,a.sum_n,a.sum_uTheendresultisasmallsetofnumbersthatdescribearelationshipoffunctions.Thissimpleroutinecanbeen-capsulatedbystoredproceduresandmadeavailabletotheanalystsviaasimpleSELECTmann whitney(value)FROMtablecall,elevatingthevocabularyofthedatabasetremendously.5.3.2Log­LikelihoodRatiosLikelihoodratiosareusefulforcomparingasubpopula-tiontoanoverallpopulationonaparticularattributed.Asanexampleinadvertising,considertwoattributesofusers:beveragechoice,andfamilystatus.Onemightwanttoknowwhetherco eeattractsnewparentsmorethanthegeneralpopulation.Thisisacaseofhavingtwodensity(ormass)functionsforthesamedatasetX.Denoteonedistributionasnullhypothesisf0andtheotherasalternatefA.Typically,f0andfAaredi erentparameterizationsofthesamedensity.Forinstance,N(0;0)andN(A;A).ThelikelihoodLunderfiisgivenbyLfi=L(Xjfi)=Ykfi(xk):Thelog-likelihoodratioisgivenbythequotient�2log(Lf0=LfA).Takingthelogallowsustousethewell-known2approxi-mationforlargen.Also,theproductsturnnicelyintosumsandanRDBMScanhandleiteasilyinparallel.LLR=2XklogfA(xk)�2Xklogf0(xk):Thiscalculationdistributesnicelyiffi:R!R,whichmostdo.Iffi:Rn!R,thencaremustbetakeninmanagingthevectorsasdistributedobjects.SupposethevaluesareintableTandthefunctionfA()hasbeenwrittenasauser-de nedfunctionf llk(xnumeric,paramnumeric).ThentheentireexperimentiscanbeperformedwiththecallSELECT2*sum(log(f_llk(T.value,d.alt_param)))-2*sum(log(f_llk(T.value,d.null_param)))ASllrFROMT,designASdThisrepresentsasigni cantgainin exibilityandsophisti-cationforanyRDBMS.Example:TheMultinomialDistributionThemultinomialdistributionextendsthebinomialdis-tribution.ConsiderarandomvariableXwithkdiscreteoutcomes.Thesehaveprobabilitiesp=(p1;:::;pk).Inntrials,thejointprobabilitydistributionisgivenbyP(Xjp)= n(n1;:::;nk)!pn11pnkk:Toobtainpi,weassumeatableoutcomewithcolumnoutcomerepresentingthebasepopulation.CREATEVIEWBASSELECToutcome,outcome_count/sum(outcome_count)over()ASpFROM(SELECToutcome,count(*)::numericASoutcome_countFROMinputGROUPBYoutcome)ASaInthecontextofmodelselection,itisoftenconvenienttocomparethesamedatasetundertwodi erentmultinomialdistributions.LLR=�2logP(Xjp) P(Xj~p)=�2log �nn1;:::;nkpn11pnkk �nn1;:::;nk~pn11~pnkk!=2Xinilog~pi�Xinilogpi:OrinSQL: discussedinthecontextofdataintegration[18].ButthefocusinaMADwarehousecontextisonmassivelyparallelaccessto ledatathatlivesonalocalhigh-speednetwork.Greenplumimplementsfullyparallelaccessforbothload-ingandqueryprocessingoverexternaltablesviaatechniquecalledScatter/GatherStreaming.Theideaissimilartotra-ditionalshared-nothingdatabaseinternals[7],butrequirescoordinationwithexternalprocessesto\feed"alltheDBMSnodesinparallel.Asthedataisstreamedintothesystemitcanbelandedindatabasetablesforsubsequentaccess,oruseddirectlyasapurelyexternaltablewithparallelI/O.Usingthistechnology,Greenplumcustomershavereportingloadingspeedsofafully-mirrored,productiondatabaseinexcessoffourterabytesperhourwithnegligibleimpactonconcurrentdatabaseoperations.6.1.1ETLandELTTraditionaldatawarehousingissupportedbycustomtoolsfortheExtract-Transform-Load(ETL)task.Inrecentyears,thereisincreasingpressuretopushtheworkoftransforma-tionintotheDBMS,toenableparallelexecutionviaSQLtransformationscripts.ThisapproachhasbeendubbedELTsincetransformationisdoneafterloading.TheELTapproachbecomesevenmorenaturalwithexternaltables.Transformationqueriescanbewrittenagainstexternalta-bles,removingtheneedtoeverloaduntransformeddata.Thiscanspeedupthedesignloopfortransformationssub-stantially{especiallywhencombinedwithSQL'sLIMITclauseasa\poorman'sOnlineAggregation"[11]todebugtrans-formations.InadditiontotransformationswritteninSQL,Green-plumsupportsMapReducescriptingintheDBMS,whichcanrunovereitherexternaldataviaScatter/Gather,orin-databasetables(Section6.3).Thisallowsprogrammerstowritetransformationscriptsinthedata ow-styleprogram-mingusedbymanyETLtools,whilerunningatscaleusingtheDBMS'facilitiesforparallelism.6.2DataEvolution:StorageandPartitioningThedatalifecycleinaMADwarehouseincludesdatainvariousstates.Whenadatasourceis rstbroughtintothesystem,analystswilltypicallyiterateoveritfrequentlywithsigni cantanalysisandtransformation.Astransfor-mationsandtablede nitionsbegintosettleforaparticulardatasource,theworkloadlooksmoreliketraditionalEDWsettings:frequentappendstolarge\fact"tables,andocca-sionalupdatesto\detail"tables.Thismaturedataislikelytobeusedforad-hocanalysisaswellasforstandardre-portingtasks.Asdatainthe\fact"tablesagesovertime,itmaybeaccessedlessfrequentlyoreven\rolledo "toanexternalarchive.Notethatallthesestagesco-occurinasinglewarehouseatagiventime.HenceagoodDBMSforMADanalyticsneedstosupportmultiplestoragemechanisms,targetedatdi erentstagesofthedatalifecycle.Intheearlystage,externaltablespro-videalightweightapproachtoexperimentwithtransforma-tions.Detailtablesareoftenmodestinsizeandundergoperiodicupdates;theyarewellservedbytraditionaltrans-actionalstoragetechniques.Append-mostlyfacttablescanbebetterservedbycompressedstorage,whichcanhandleappendsandreadseciently,attheexpenseofmakingup-datesslower.Itshouldbepossibletorollthisdatao ofthewarehouseasitages,withoutdisruptingongoingprocessing.Greenplumprovidesmultiplestorageengines,witharichSQLpartitioningspeci cationtoapplythem exiblyacrossandwithintables.Asmentionedabove,Greenplumincludesexternaltablesupport.Greenplumalsoprovidesatradi-tional\heap"storageformatfordatathatseesfrequentupdates,andahighly-compressed\append-only"(AO)ta-blefeaturefordatathatisnotgoingtobeupdated;bothareintegratedwithinatransactionalframework.Green-plumAOstorageunitscanhaveavarietyofcompressionmodes.Atoneextreme,withcompressiono ,bulkloadsrunveryquickly.Alternatively,themostaggressivecompressionmodesaretunedtouseaslittlespaceaspossible.Thereisalsoamiddlegroundwith\medium"compressiontoprovideimprovedtablescantimeattheexpenseofslightlyslowerloads.InarecentversionGreenplumalsoadds\column-store"partitioningofappend-onlytables,akintoideasintheliterature[20].Thiscanimprovecompression,anden-suresthatqueriesoverlargearchivaltablesonlydoI/Oforthecolumnstheyneedtosee.ADBAshouldbeabletospecifythestoragemechanismtobeusedina exibleway.Greenplumsupportsmanywaystopartitiontablesinordertoincreasequeryanddataloadperformance,aswellastoaidinmanaginglargedatasets.Thetop-mostlayerofpartitioningisadistributionpolicyspeci edviaaDISTRIBUTEDBYclauseintheCREATETABLEstatementthatdetermineshowtherowsofatablearedistributedacrosstheindividualnodesthatcompriseaGreenplumcluster.Whilealltableshaveadistributionpolicy,userscanoptionallyspecifyapartitioningpolicyforatable,whichseparatesthedatainthetableintoparti-tionsbyrangeorlist.Arangepartitioningpolicyletsusersspecifyanordered,non-overlappingsetofpartitionsforapartitioningcolumn,whereeachpartitionhasaSTARTandENDvalue.Alistpartitioningpolicyletsusersspecifyasetofpartitionsforacollectionofcolumns,whereeachparti-tioncorrespondstoaparticularvalue.Forexample,asalestablemaybehash-distributedoverthenodesbysales id.Oneachnode,therowsarefurtherpartitionedbyrangeintoseparatepartitionsforeachmonth,andeachofthesepartitionsissubpartitionedintothreeseparatesalesregions.Notethatthepartitioningstructureiscompletelymutable:ausercanaddnewpartitionsordropexistingpartitionsorsubpartitionsatanypoint.Partitioningisimportantforanumberofreasons.First,thequeryoptimizerisawareofthepartitioningstructure,andcananalyzepredicatestoperformpartitionexclusion:scanningonlyasubsetofthepartitionsinsteadoftheentiretable.Second,eachpartitionofatablecanhaveadi erentstorageformat,tomatchtheexpectedworkload.Atypicalarrangementistopartitionbyatimestamp eld,andhaveolderpartitionsbestoredinahighly-compressedappend-onlyformatwhilenewer,\hotter"partitionsarestoredinamoreupdate-friendlyformattoaccommodateauditingup-dates.Third,itenablesatomicpartitionexchange.Ratherthaninsertingdataarowatatime,ausercanuseETLorELTtostagetheirdatatoatemporarytable.Afterthedataisscrubbedandtransformed,theycanusetheALTERTABLE...EXCHANGEPARTITIONcommandtobindthetemporarytableasanewpartitionofanexistingtableinaquickatomicoperation.Thiscapabilitymakespartitioningparticularlyusefulforbusinessesthatperformbulkdataloadsonadaily,weekly,ormonthlybasis,especiallyiftheydroporarchiveolderdatatokeepsome xedsize\window"ofdataonline inthewarehouse.Thesameideaalsoallowsuserstodophysicalmigrationoftablesandstorageformatmodi ca-tionsinawaythatmostlyisolatesproductiontablesfromloadingandtransformationoverheads.6.3MADProgrammingAlthoughMADdesignfavorsquickimportandfrequentiterationovercarefulmodeling,itisnotintendedtore-jectstructureddatabasesperse.AsmentionedinSec-tion4,thestructureddatamanagementfeaturesofaDBMScanbeveryusefulfororganizingexperimentalresults,trialdatasets,andexperimentalwork ows.Infact,shopsthatusetoolslikeHadooptypicallyhaveDBMSsinaddition,and/orevolvelightdatabasesystemslikeHive.ButaswealsonoteinSection4,itisadvantageoustounifythestruc-turedenvironmentwiththeanalysts'favoriteprogrammingenvironments.Dataanalystscomefrommanywalksoflife.SomeareexpertsinSQL,butmanyarenot.Analyststhatcomefromascienti cormathematicalbackgroundaretypicallytrainedinstatisticalpackageslikeR,SAS,orMatlab.Thesearememory-bound,single-machinesolutions,buttheypro-videconvenientabstractionsformathprogramming,andac-cesstolibrariescontaininghundredsofstatisticalroutines.OtheranalystshavefacilitywithtraditionalprogramminglanguageslikeJava,Perl,andPython,buttypicallydonotwanttowriteparallelorI/O-centriccode.ThekindofdatabaseextensibilitypioneeredbyPostgres[19]isnolongeranexoticDBMSfeature{itisakeytomoderndataanalytics,enablingcodetorunclosetothedata.Tobeinvitingtoavarietyofprogrammers,agoodDBMSexten-sibilityinterfaceshouldaccommodatemultiplelanguages.PostgreSQLhasbecomequitepowerfulinthisregard,sup-portingawiderangeofextensionlanguagesincludingR,PythonandPerl.Greenplumtakestheseinterfacesanden-ablesthemtorundata-parallelonacluster.Thisdoesnotprovideautomaticparallelismofcourse:developersmustthinkthroughhowtheircodeworksinadata-parallelenvi-ronmentwithoutsharedmemory,aswedidinSection5.Inadditiontoworklikeourstoimplementstatisticalmeth-odsinextensibleSQL,thereisagroundswellofe orttoim-plementmethodswiththeMapReduceprogrammingparadigmpopularizedbyGoogle[4]andHadoop.Fromtheperspec-tiveofprogramminglanguagedesign,MapReduceandmod-ernSQLarequitesimilartakesonparallelism:botharedata-parallelprogrammingmodelsforshared-nothingarchi-tecturesthatprovideextensionhooks(\upcalls")tointer-ceptindividualtuplesorsetsoftupleswithinadata ow.Butasaculturalphenomenon,MapReducehascapturedtheinterestofmanydevelopersinterestedinrunninglarge-scaleanalysesonBigData,andiswidelyviewedasamoreat-tractiveprogrammingenvironmentthanSQL.AMADdatawarehouseneedstoattracttheseprogrammers,andallowthemtoenjoythefamiliarityofMapReduceprogramminginacontextthatbothintegrateswiththerestofthedataintheenterprise,ando ersmoresophisticatedtoolsforman-agingdataproducts.GreenplumapproachedthischallengebyimplementingaMapReduceprogramminginterfacewhoseruntimeengineisthesamequeryexecutorusedforSQL[9].UserswriteMapandReducefunctionsinfamiliarlanguageslikePython,Perl,orR,andconnectthemupintoMapReducescriptsviaasimplecon guration le.Theycanthenexecutethesescriptsviaacommandlineinterfacethatpassesthecon gu-rationandMapReducecodetotheDBMS,returningoutputtoacon gurablelocation:commandline, les,orDBMStables.TheonlyrequiredDBMSinteractionisthespeci -cationofanIPaddressfortheDBMS,andauthenticationcredentials(user/password,PGPkeys,etc.)Hencedevelop-erswhoareusedtotraditionalopensourcetoolscontinuetousetheirfavoritecodeeditors,sourcecodemanagement,andshellprompts;theydonotneedtolearnaboutdatabaseutilities,SQLsyntax,schemadesign,etc.TheGreenplumexecutoraccesses lesforMapReducejobsviathesameScatter/GathertechniquethatitusesforexternaltablesinSQL.Inaddition,GreenplumMapReducescriptsinteroperatewithallthefeaturesofthedatabase,andviceversa.MapReducescriptscanusedatabasetablesorviewsastheirinputs,and/orstoretheirresultsasdatabasetablesthatcanbedirectlyaccessedviaSQL.Hencecom-plexpipelinescanevolvethatincludesomestagesinSQL,andsomeinMapReducesyntax.Executioncanbedoneen-tirelyondemand{runningtheSQLandMapReducestagesinapipeline{orviamaterializationofstepsalongthewayeitherinsideoroutsidethedatabase.Programmersofdi er-entstripescaninteroperateviafamiliarinterfaces:databasetablesandviews,orMapReduceinputstreams,incorporat-ingavarietyoflanguagesfortheMapandReducefunctions,andforSQLextensionfunctions.ThiskindofinteroperabilitybetweenprogrammingmetaphorsiscriticalforMADanalytics.Itattractsanalysts{andhencedata{tothewarehouse.Itprovidesagilitytode-velopersbyfacilitatingfamiliarprogramminginterfacesandenablinginteroperabilityamongprogrammingstyles.Fi-nally,itallowsanalyststododeepdevelopmentusingthebesttoolsofthetrade,includingmanydomainspeci cmod-uleswrittenfortheimplementationlanguages.InexperiencewithavarietyofGreenplumcustomers,wehavefoundthatdeveloperscomfortablewithbothSQLandMapReducewillchooseamongthem exiblyfordi erenttasks.Forexample,MapReducehasprovedmoreconve-nientforwritingETLscriptson leswheretheinputor-derisknownandshouldbeexploitedinthetransformation.MapReducealsomakesiteasytospecifytransformationsthattakeoneinputandproducemultipleoutputs{thisisalsocommoninETLsettingsthat\shred"inputrecordsandproduceastreamofoutputtupleswithmixedformats.SQL,surprisingly,hasbeenmoreconvenientthanMapRe-ducefortasksinvolvinggraphdatalikeweblinksandso-cialnetworks,sincemostofthealgorithmsinthatsetting(PageRank,ClusteringCoecients,etc.)canbecodedcom-pactlyas\self-joins"ofalinktable.7.DIRECTIONSANDREFLECTIONSTheworkinthispaperresultedfromafairlyquick,it-erativediscussionamongdata-centricpeoplewithvaryingjobdescriptionsandtraining.Theprocessofarrivingatthepaper'slessonsechoedthelessonsthemselves.Wedidnotdesignadocumentupfront,butinstead\gotMAD":webroughtmanydatapointstogether,fosteredquickiterationamongmultipleparties,andtriedtodigdeeplyintodetails.AsinMADanalysis,weexpecttoarriveatnewquestionsandnewconclusionsasmoredataisbroughttolight.Afewoftheissueswearecurrentlyconsideringincludethefollowing:

Shom More....
By: myesha-ticknor
Views: 129
Type: Public

Download Section

Please download the presentation after appearing the download area.


Download Pdf - The PPT/PDF document "MAD Skills New Analysis Practices for Bi..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Try DocSlides online tool for compressing your PDF Files Try Now

Related Documents