/
Blink and Its Done Interactive Queries on Very Large Data Sameer Agarwal UC Berkeley sameeragcs Blink and Its Done Interactive Queries on Very Large Data Sameer Agarwal UC Berkeley sameeragcs

Blink and Its Done Interactive Queries on Very Large Data Sameer Agarwal UC Berkeley sameeragcs - PDF document

cheryl-pisano
cheryl-pisano . @cheryl-pisano
Follow
465 views
Uploaded On 2015-02-23

Blink and Its Done Interactive Queries on Very Large Data Sameer Agarwal UC Berkeley sameeragcs - PPT Presentation

berkeleyedu Aurojit Panda UC Berkeley apandacsberkeleyedu Barzan Mozafari MIT CSAIL barzancsailmitedu Anand P Iyer UC Berkeley apicsberkeleyedu Samuel Madden MIT CSAIL maddencsailmitedu Ion Stoica UC Berkeley istoicacsberkeleyedu ABSTRACT In this dem ID: 38979

berkeleyedu Aurojit Panda Berkeley

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Blink and Its Done Interactive Queries o..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

whileminimizingresponsetime(orerror).Moredetailsabouttheoptimizationproblemformulationcanbefoundin[4].Forbackwardcompatibility,BlinkDBseamlesslyintegrateswiththeHIVE/Hadoop/HDFS[1]stack.BlinkDBcanalsorunonShark(HiveonSpark)[6],aframeworkthatisbackwardcompatiblewithHive,bothatthestorageandlanguagelayers,andusesSpark[10]toreliablycachedatasetsinmemory.Asaresult,aBlinkDBquerythatrunsonsamplesstoredinmemorycantakesecondsratherthanminutes.BlinkDBisopensource1andseveralon-lineservicecom-panieshavealreadyexpressedinterestinusingit.Inthisdemo,wewillshowBlinkDBrunningon100AmazonEC2nodes,providinginteractivequeryperformanceovera10TBdatasetofbrowsersessionsfromanInternetcompany(similartotheSessionstableabove.)Wewillshowacollectionofqueriesfocusedonidentifyingproblemsintheselogles.WewillrunBlinkDBaswellasunmodiedHiveandSharkandshowthatoursystemcanprovidebounded,approximateanswersinafractionofthetimeoftheothersystems.WewillalsoallowattendeestoissuetheirownqueriestoexplorethedataandBlinkDB'sperformance.2.SYSTEMOVERVIEWInthissection,wedescribethesettingsandassumptionsinwhichBlinkDBisdesignedtooperate,andprovideanoverviewofitsdesignandkeycomponents.2.1SettingandAssumptionsBlinkDBisdesignedtooperatelikeadatawarehouse,withonelarge“fact”table.Thistablemayneedtobejoinedwithother“di-mension”tablesusingforeign-keys.Inpractice,dimensiontablesaresignicantlysmallerandusuallytintheaggregatememoryofthecluster.BlinkDBonlycreatesstratiedsamplesforthe“fact”tableandforthejoin-columnsoflargerdimensiontables.Furthermore,sinceourworkloadistargetedatad-hocqueries,ratherthanassumingthatexactqueriesareknownapriori,weas-sumethatquerytemplates(i.e.,thesetofcolumnsusedinWHEREandGROUP-BYclauses)remainfairlystableovertime.Wemakeuseofthisassumptionwhenchoosingwhichsamplestocreate.Thisassumptionhasbeenempiricallyobservedinavarietyofreal-worldproductionworkloads[3,5]andisalsotrueofthequerytraceweuseforourprimaryevaluation(a2-yearquerytracefromCon-vivaInc).Wedonotassumeanypriorknowledgeofthespecicvaluesorpredicatesusedintheseclauses.Finally,inthisdemonstration,wefocusonasmallsetofag-gregationoperators:COUNT,SUM,MEAN,MEDIAN/QUANTILE.However,wesupportclosed-formerrorestimatesforanycombina-tionofthesebasicaggregatesaswellasanyalgebraicfunctionthatismean-likeandasymptoticallynormal,asdescribedin[11].2.2ArchitectureFig.1showstheoverallarchitectureofBlinkDB,whichextendsHive[1]andShark[6].SharkisbackwardscompatiblewithHive,andrunsonSpark,aclustercomputingframeworkthatcancacheinputsandintermediatedatainmemory.BlinkDBaddstwomajorcomponents:(1)acomponenttocreateandmaintainsamples,(2)acomponentforpredictingthequeryresponsetimeandaccuracyandselectingasamplethatbestsatisesgivenconstraints.2.2.1SampleCreationandMaintenanceThiscomponentisresponsibleforcreatingandmaintainingasetofuniformandstratiedsamples.Weuseuniformsamplesovertheentiredatasettohandlequeriesoncolumnswithrelativelyuniform 1http://blinkdb.cs.berkeley.edu Figure1:BlinkDBarchitecture.distributions,andstratiedsamples(ononeormorecolumns)tohandlequeriesoncolumnswithlessuniformdistributions.Samplesarecreated,andupdatedbasedonstatisticscollectedfromtheunderlyingdata(e.g.,histograms)andhistoricalquerytemplates.BlinkDBcreates,andmaintainsasetofuniformsam-ples,andmultiplesetsofstratiedsamples.Setsofcolumnsonwhichstratiedsamplesshouldbebuiltaredecidedusinganopti-mizationframework[4],whichpickssetsofcolumn(s)that(i)aremostusefulforevaluatingquerytemplatesintheworkload,and(ii)exhibitthegreatestskew,i.e.,havedistributionswhererarevaluesarelikelytobeexcludedinauniformsample.Thesetofsamplesareupdatedbothwiththearrivalofnewdata,andwhenthework-loadchanges.2.2.2Run­timeSampleSelectionToexecuteaquery,werstselectanoptimalsetofsample(s)thatmeetitsaccuracyorresponsetimeconstraints.Suchsample(s)arechosenusingacombinationofpre-computedstatisticsandbydynamicallyrunningthequeryonsmallersamplestoestimatethequery'sselectivityandcomplexity.Thisestimatehelpsthequeryoptimizerpickanexecutionplanaswellasthe“best”sample(s)torunthequeryon–i.e.,one(s)thatcansatisfytheuser'serrororresponsetimeconstraints.2.3AnExampleToillustratehowBlinkDBoperates,considertheexample,showninFigure2.Thetableconsistsofvecolumns:SessionID,Genre,OS,City,andURL. Figure2:Anexampleshowingthesamplesforatablewithvecolumns,andagivenqueryworkload.Figure2showsasetofquerytemplatesandtheirrelativefre-quencies.Giventhesetemplatesandastoragebudget,BlinkDBcreatesseveralsamplesbasedonthequerytemplatesandstatisticsaboutthedata.Thesesamplesareorganizedinsamplefamilies,whereeachfamilycontainsmultiplesamplesofdifferentgranu-larities.Inourexample,BlinkDBdecidestocreatetwosamplefamiliesofstratiedsamples:oneonCity,andanotheroneon(OS;URL).