/
Druid A Realtime Analytical Data Store Fangjin Yang Me Druid A Realtime Analytical Data Store Fangjin Yang Me

Druid A Realtime Analytical Data Store Fangjin Yang Me - PDF document

test
test . @test
Follow
551 views
Uploaded On 2015-04-30

Druid A Realtime Analytical Data Store Fangjin Yang Me - PPT Presentation

fangjinmetamarketscom Eric Tschetter echeddargmailcom Xavier L57577aut Metamarkets Group Inc xaviermetamarketscom Nelson Ray ncray86gmailcom Gian Merlino Metamarkets Group Inc gianmetamarketscom Deep Ganguli Metamarkets Group Inc deepmetamarketscom ID: 57606

fangjinmetamarketscom Eric Tschetter echeddargmailcom

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Druid A Realtime Analytical Data Store F..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

DruidAReal-timeAnalyticalDataStoreFangjinYangMetamarketsGroup,Inc.fangjin@metamarkets.comEricTschetterecheddar@gmail.comXavierLéautéMetamarketsGroup,Inc.xavier@metamarkets.comNelsonRayncray86@gmail.comGianMerlinoMetamarketsGroup,Inc.gian@metamarkets.comDeepGanguliMetamarketsGroup,Inc.deep@metamarkets.comABSTRACTDruidisanopensource1datastoredesignedforreal-timeexploratoryanalyticsonlargedatasets.Thesystemcombinesacolumn-orientedstoragelayout,adistributed,shared-nothingarchitecture,andanadvancedindexingstructuretoallowforthearbitraryexplorationofbillion-rowtableswithsub-secondlatencies.Inthispaper,wedescribeDruid’sarchitecture,anddetailhowitsupportsfastaggre-gations,flexiblefilters,andlowlatencydataingestion.CategoriesandSubjectDescriptorsH.2.4[DatabaseManagement]:Systems—DistributeddatabasesKeywordsdistributed;real-time;fault-tolerant;highlyavailable;opensource;analytics;column-oriented;OLAP1.INTRODUCTIONInrecentyears,theproliferationofinternettechnologyhascre-atedasurgeinmachine-generatedevents.Individually,theseeventscontainminimalusefulinformationandareoflowvalue.Giventhetimeandresourcesrequiredtoextractmeaningfromlargecollec-tionsofevents,manycompanieswerewillingtodiscardthisdatain-stead.Althoughinfrastructurehasbeenbuilttohandleevent-baseddata(e.g.IBM’sNetezza[37],HP’sVertica[5],andEMC’sGreen-plum[29]),theyarelargelysoldathighpricepointsandareonlytargetedtowardsthosecompanieswhocanaffordtheoffering.Afewyearsago,GoogleintroducedMapReduce[11]astheirmechanismofleveragingcommodityhardwaretoindextheinter-netandanalyzelogs.TheHadoop[36]projectsoonfollowedandwaslargelypatternedaftertheinsightsthatcameoutoftheoriginalMapReducepaper.Hadoopiscurrentlydeployedinmanyorga-nizationstostoreandanalyzelargeamountsoflogdata.Hadoophascontributedmuchtohelpingcompaniesconverttheirlow-value 1http://druid.io/https://github.com/metamx/druidPermissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprotorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationontherstpage.Copyrightsforcomponentsofthisworkownedbyothersthantheauthor(s)mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish,topostonserversortoredistributetolists,requirespriorspecicpermissionand/orafee.Requestpermissionsfrompermissions@acm.org.SIGMOD'14,June22–27,2014,Snowbird,UT,USA.Copyrightisheldbytheowner/author(s).PublicationrightslicensedtoACM.ACM978-1-4503-2376-5/14/06...$15.00.http://dx.doi.org/10.1145/2588555.2595631.eventstreamsintohigh-valueaggregatesforavarietyofapplica-tionssuchasbusinessintelligenceandA-Btesting.Aswithmanygreatsystems,Hadoophasopenedoureyestoanewspaceofproblems.Specifically,Hadoopexcelsatstoringandprovidingaccesstolargeamountsofdata,however,itdoesnotmakeanyperformanceguaranteesaroundhowquicklythatdatacanbeaccessed.Furthermore,althoughHadoopisahighlyavailablesystem,performancedegradesunderheavyconcurrentload.Lastly,whileHadoopworkswellforstoringdata,itisnotoptimizedforin-gestingdataandmakingthatdataimmediatelyreadable.EarlyoninthedevelopmentoftheMetamarketsproduct,weranintoeachoftheseissuesandcametotherealizationthatHadoopisagreatback-office,batchprocessing,anddatawarehousingsys-tem.However,asacompanythathasproduct-levelguaranteesaroundqueryperformanceanddataavailabilityinahighlyconcur-rentenvironment(1000+users),Hadoopwasn’tgoingtomeetourneeds.Weexploreddifferentsolutionsinthespace,andaftertryingbothRelationalDatabaseManagementSystemsandNoSQLarchi-tectures,wecametotheconclusionthattherewasnothingintheopensourceworldthatcouldbefullyleveragedforourrequire-ments.WeendedupcreatingDruid,anopensource,distributed,column-oriented,real-timeanalyticaldatastore.Inmanyways,DruidsharessimilaritieswithotherOLAPsystems[30,35,22],in-teractivequerysystems[28],main-memorydatabases[14],aswellaswidelyknowndistributeddatastores[7,12,23].Thedistributionandquerymodelalsoborrowideasfromcurrentgenerationsearchinfrastructure[25,3,4].ThispaperdescribesthearchitectureofDruid,exploresthevari-ousdesigndecisionsmadeincreatinganalways-onproductionsys-temthatpowersahostedservice,andattemptstohelpinformany-onewhofacesasimilarproblemaboutapotentialmethodofsolvingit.Druidisdeployedinproductionatseveraltechnologycompa-nies2.Thestructureofthepaperisasfollows:wefirstdescribetheprobleminSection2.Next,wedetailsystemarchitecturefromthepointofviewofhowdataflowsthroughthesysteminSection3.WethendiscusshowandwhydatagetsconvertedintoabinaryformatinSection4.WebrieflydescribethequeryAPIinSection5andpresentperformanceresultsinSection6.Lastly,weleaveoffwithourlessonsfromrunningDruidinproductioninSection7,andrelatedworkinSection8.2.PROBLEMDEFINITIONDruidwasoriginallydesignedtosolveproblemsaroundingest-ingandexploringlargequantitiesoftransactionalevents(logdata).ThisformoftimeseriesdataiscommonlyfoundinOLAPwork- 2http://druid.io/druid.html Timestamp Page Username Gender City CharactersAdded CharactersRemoved 2011-01-01T01:00:00Z JustinBieber Boxer Male SanFrancisco 1800 25 2011-01-01T01:00:00Z JustinBieber Reach Male Waterloo 2912 42 2011-01-01T02:00:00Z Ke$ha Helz Male Calgary 1953 17 2011-01-01T02:00:00Z Ke$ha Xeno Male Taiyuan 3194 170 Table1:SampleDruiddataforeditsthathaveoccurredonWikipedia.flowsandthenatureofthedatatendstobeveryappendheavy.Forexample,considerthedatashowninTable1.Table1containsdataforeditsthathaveoccurredonWikipedia.EachtimeausereditsapageinWikipedia,aneventisgeneratedthatcontainsmetadataabouttheedit.Thismetadataiscomprisedof3distinctcompo-nents.First,thereisatimestampcolumnindicatingwhentheeditwasmade.Next,thereareasetdimensioncolumnsindicatingvar-iousattributesabouttheeditsuchasthepagethatwasedited,theuserwhomadetheedit,andthelocationoftheuser.Finally,thereareasetofmetriccolumnsthatcontainvalues(usuallynumeric)thatcanbeaggregated,suchasthenumberofcharactersaddedorremovedinanedit.Ourgoalistorapidlycomputedrill-downsandaggregatesoverthisdata.Wewanttoanswerquestionslike“HowmanyeditsweremadeonthepageJustinBieberfrommalesinSanFrancisco?”and“Whatistheaveragenumberofcharactersthatwereaddedbypeo-plefromCalgaryoverthespanofamonth?”.Wealsowantqueriesoveranyarbitrarycombinationofdimensionstoreturnwithsub-secondlatencies.TheneedforDruidwasfacilitatedbythefactthatexistingopensourceRelationalDatabaseManagementSystems(RDBMS)andNoSQLkey/valuestoreswereunabletoprovidealowlatencydataingestionandqueryplatformforinteractiveapplications[40].IntheearlydaysofMetamarkets,wewerefocusedonbuildingahosteddashboardthatwouldallowuserstoarbitrarilyexploreandvisualizeeventstreams.Thedatastorepoweringthedashboardneededtoreturnqueriesfastenoughthatthedatavisualizationsbuiltontopofitcouldprovideuserswithaninteractiveexperience.Inadditiontothequerylatencyneeds,thesystemhadtobemulti-tenantandhighlyavailable.TheMetamarketsproductisusedinahighlyconcurrentenvironment.Downtimeiscostlyandmanybusi-nessescannotaffordtowaitifasystemisunavailableinthefaceofsoftwareupgradesornetworkfailure.Downtimeforstartups,whooftenlackproperinternaloperationsmanagement,candeterminebusinesssuccessorfailure.Finally,anotherchallengethatMetamarketsfacedinitsearlydayswastoallowusersandalertingsystemstobeabletomakebusinessdecisionsin“real-time”.Thetimefromwhenaneventiscreatedtowhenthateventisqueryabledetermineshowfastinter-estedpartiesareabletoreacttopotentiallycatastrophicsituationsintheirsystems.PopularopensourcedatawarehousingsystemssuchasHadoopwereunabletoprovidethesub-seconddataingestionlatencieswerequired.Theproblemsofdataexploration,ingestion,andavailabilityspanmultipleindustries.SinceDruidwasopensourcedinOctober2012,itbeendeployedasavideo,networkmonitoring,operationsmon-itoring,andonlineadvertisinganalyticsplatformatmultiplecom-panies.3.ARCHITECTUREADruidclusterconsistsofdifferenttypesofnodesandeachnodetypeisdesignedtoperformaspecificsetofthings.Webelievethisdesignseparatesconcernsandsimplifiesthecomplexityoftheoverallsystem.Thedifferentnodetypesoperatefairlyindependentofeachotherandthereisminimalinteractionamongthem.Hence,intra-clustercommunicationfailureshaveminimalimpactondataavailability.Tosolvecomplexdataanalysisproblems,thedifferentnodetypescometogethertoformafullyworkingsystem.ThenameDruidcomesfromtheDruidclassinmanyrole-playinggames:itisashape-shifter,capableoftakingonmanydifferentformstofulfillvariousdifferentrolesinagroup.ThecompositionofandflowofdatainaDruidclusterareshowninFigure1.3.1Real-timeNodesReal-timenodesencapsulatethefunctionalitytoingestandqueryeventstreams.Eventsindexedviathesenodesareimmediatelyavailableforquerying.ThenodesareonlyconcernedwitheventsforsomesmalltimerangeandperiodicallyhandoffimmutablebatchesofeventstheyhavecollectedoverthissmalltimerangetoothernodesintheDruidclusterthatarespecializedindealingwithbatchesofimmutableevents.Real-timenodesleverageZookeeper[19]forcoordinationwiththerestoftheDruidcluster.ThenodesannouncetheironlinestateandthedatatheyserveinZookeeper.Real-timenodesmaintainanin-memoryindexbufferforallin-comingevents.Theseindexesareincrementallypopulatedaseventsareingestedandtheindexesarealsodirectlyqueryable.Druidbe-havesasarowstoreforqueriesoneventsthatexistinthisJVMheap-basedbuffer.Toavoidheapoverflowproblems,real-timenodespersisttheirin-memoryindexestodiskeitherperiodicallyoraftersomemaximumrowlimitisreached.Thispersistprocessconvertsdatastoredinthein-memorybuffertoacolumnorientedstorageformatdescribedinSection4.Eachpersistedindexisim-mutableandreal-timenodesloadpersistedindexesintooff-heapmemorysuchthattheycanstillbequeried.Thisprocessisde-scribedindetailin[33]andisillustratedinFigure2.Onaperiodicbasis,eachreal-timenodewillscheduleaback-groundtaskthatsearchesforalllocallypersistedindexes.Thetaskmergestheseindexestogetherandbuildsanimmutableblockofdatathatcontainsalltheeventsthathavebeeningestedbyareal-timenodeforsomespanoftime.Werefertothisblockofdataasa“segment”.Duringthehandoffstage,areal-timenodeuploadsthissegmenttoapermanentbackupstorage,typicallyadistributedfilesystemsuchasS3[12]orHDFS[36],whichDruidreferstoas“deepstorage”.Theingest,persist,merge,andhandoffstepsarefluid;thereisnodatalossduringanyoftheprocesses.Figure3illustratestheoperationsofareal-timenode.Thenodestartsat13:37andwillonlyaccepteventsforthecurrenthourorthenexthour.Wheneventsareingested,thenodeannouncesthatitisservingasegmentofdataforanintervalfrom13:00to14:00.Every10minutes(thepersistperiodisconfigurable),thenodewillflushandpersistitsin-memorybuffertodisk.Neartheendofthehour,thenodewilllikelyseeeventsfor14:00to15:00.Whenthisoccurs,thenodepreparestoservedataforthenexthourandcreatesanewin-memoryindex.Thenodethenannouncesthatitisalsoservingasegmentfrom14:00to15:00.Thenodedoesnotimmediatelymergepersistedindexesfrom13:00to14:00,insteaditwaitsforaconfigurablewindowperiodforstragglingeventsfrom13:00to Figure1:AnoverviewofaDruidclusterandtheflowofdatathroughthecluster. Figure2:Real-timenodesbuffereventstoanin-memoryindex,whichisregularlypersistedtodisk.Onaperiodicbasis,per-sistedindexesarethenmergedtogetherbeforegettinghandedoff.Querieswillhitboththein-memoryandpersistedindexes.14:00toarrive.Thiswindowperiodminimizestheriskofdatalossfromdelaysineventdelivery.Attheendofthewindowperiod,thenodemergesallpersistedindexesfrom13:00to14:00intoasingleimmutablesegmentandhandsthesegmentoff.OncethissegmentisloadedandqueryablesomewhereelseintheDruidcluster,thereal-timenodeflushesallinformationaboutthedataitcollectedfor13:00to14:00andunannouncesitisservingthisdata.3.1.1AvailabilityandScalabilityReal-timenodesareaconsumerofdataandrequireacorrespond-ingproducertoprovidethedatastream.Commonly,fordatadura-bilitypurposes,amessagebussuchasKafka[21]sitsbetweentheproducerandthereal-timenodeasshowninFigure4.Real-timenodesingestdatabyreadingeventsfromthemessagebus.Thetimefromeventcreationtoeventconsumptionisordinarilyontheorderofhundredsofmilliseconds.ThepurposeofthemessagebusinFigure4istwo-fold.First,themessagebusactsasabufferforincomingevents.AmessagebussuchasKafkamaintainspositionaloffsetsindicatinghowfaracon-sumer(areal-timenode)hasreadinaneventstream.Consumerscanprogrammaticallyupdatetheseoffsets.Real-timenodesupdatethisoffseteachtimetheypersisttheirin-memorybufferstodisk.Inafailandrecoverscenario,ifanodehasnotlostdisk,itcanreloadallpersistedindexesfromdiskandcontinuereadingeventsfromthelastoffsetitcommitted.Ingestingeventsfromarecentlycom-mittedoffsetgreatlyreducesanode’srecoverytime.Inpractice,weseenodesrecoverfromsuchfailurescenariosinafewseconds.Thesecondpurposeofthemessagebusistoactasasingleend-pointfromwhichmultiplereal-timenodescanreadevents.Multi-plereal-timenodescaningestthesamesetofeventsfromthebus,creatingareplicationofevents.Inascenariowhereanodecom-pletelyfailsandlosesdisk,replicatedstreamsensurethatnodataislost.Asingleingestionendpointalsoallowsfordatastreamstobepartitionedsuchthatmultiplereal-timenodeseachingestaportionofastream.Thisallowsadditionalreal-timenodestobeseamlesslyadded.Inpractice,thismodelhasallowedoneofthelargestproductionDruidclusterstobeabletoconsumerawdataatapproximately500MB/s(150,000events/sor2TB/hour).3.2HistoricalNodesHistoricalnodesencapsulatethefunctionalitytoloadandservetheimmutableblocksofdata(segments)createdbyreal-timenodes.Inmanyreal-worldworkflows,mostofthedataloadedinaDruidclusterisimmutableandhence,historicalnodesaretypicallythemainworkersofaDruidcluster.Historicalnodesfollowashared-nothingarchitectureandthereisnosinglepointofcontentionamongthenodes.Thenodeshavenoknowledgeofoneanotherandareoperationallysimple;theyonlyknowhowtoload,drop,andserveimmutablesegments.Similartoreal-timenodes,historicalnodesannouncetheiron-linestateandthedatatheyareservinginZookeeper.InstructionstoloadanddropsegmentsaresentoverZookeeperandcontaininfor-mationaboutwherethesegmentislocatedindeepstorageandhowtodecompressandprocessthesegment.Beforeahistoricalnodedownloadsaparticularsegmentfromdeepstorage,itfirstchecksalocalcachethatmaintainsinformationaboutwhatsegmentsalreadyexistonthenode.Ifinformationaboutasegmentisnotpresentinthecache,thehistoricalnodewillproceedtodownloadthesegmentfromdeepstorage.ThisprocessisshowninFigure5.Oncepro-cessingiscomplete,thesegmentisannouncedinZookeeper.Atthispoint,thesegmentisqueryable.Thelocalcachealsoallowsforhistoricalnodestobequicklyupdatedandrestarted.Onstartup,thenodeexaminesitscacheandimmediatelyserveswhateverdataitfinds. Figure3:Thenodestarts,ingestsdata,persists,andperiodicallyhandsdataoff.Thisprocessrepeatsindefinitely.Thetimeperiodsbetweendifferentreal-timenodeoperationsareconfigurable. Figure4:Multiplereal-timenodescanreadfromthesamemes-sagebus.Eachnodemaintainsitsownoffset. Figure5:Historicalnodesdownloadimmutablesegmentsfromdeepstorage.Segmentsmustbeloadedinmemorybeforetheycanbequeried.Historicalnodescansupportreadconsistencybecausetheyonlydealwithimmutabledata.Immutabledatablocksalsoenableasim-pleparallelizationmodel:historicalnodescanconcurrentlyscanandaggregateimmutableblockswithoutblocking.3.2.1TiersHistoricalnodescanbegroupedindifferenttiers,whereallnodesinagiventierareidenticallyconfigured.Differentperformanceandfault-toleranceparameterscanbesetforeachtier.Thepurposeoftierednodesistoenablehigherorlowerprioritysegmentstobedis-tributedaccordingtotheirimportance.Forexample,itispossibletospinupa“hot”tierofhistoricalnodesthathaveahighnum-berofcoresandlargememorycapacity.The“hot”clustercanbeconfiguredtodownloadmorefrequentlyaccesseddata.Aparallel“cold”clustercanalsobecreatedwithmuchlesspowerfulbackinghardware.The“cold”clusterwouldonlycontainlessfrequentlyaccessedsegments.3.2.2AvailabilityHistoricalnodesdependonZookeeperforsegmentloadandun-loadinstructions.ShouldZookeeperbecomeunavailable,histor-icalnodesarenolongerabletoservenewdataordropoutdateddata,however,becausethequeriesareservedoverHTTP,histori-calnodesarestillabletorespondtoqueryrequestsforthedatatheyarecurrentlyserving.ThismeansthatZookeeperoutagesdonotimpactcurrentdataavailabilityonhistoricalnodes.3.3BrokerNodesBrokernodesactasqueryrouterstohistoricalandreal-timenodes.BrokernodesunderstandthemetadatapublishedinZookeeperaboutwhatsegmentsarequeryableandwherethosesegmentsarelocated.Brokernodesrouteincomingqueriessuchthatthequerieshittherighthistoricalorreal-timenodes.Brokernodesalsomergepartialresultsfromhistoricalandreal-timenodesbeforereturningafinalconsolidatedresulttothecaller.3.3.1CachingBrokernodescontainacachewithaLRU[31,20]invalidationstrategy.Thecachecanuselocalheapmemoryoranexternaldis- tributedkey/valuestoresuchasMemcached[16].Eachtimeabro-kernodereceivesaquery,itfirstmapsthequerytoasetofseg-ments.Resultsforcertainsegmentsmayalreadyexistinthecacheandthereisnoneedtorecomputethem.Foranyresultsthatdonotexistinthecache,thebrokernodewillforwardthequerytothecorrecthistoricalandreal-timenodes.Oncehistoricalnodesreturntheirresults,thebrokerwillcachetheseresultsonapersegmentba-sisforfutureuse.ThisprocessisillustratedinFigure6.Real-timedataisnevercachedandhencerequestsforreal-timedatawillal-waysbeforwardedtoreal-timenodes.Real-timedataisperpetuallychangingandcachingtheresultsisunreliable.Thecachealsoactsasanadditionallevelofdatadurability.Intheeventthatallhistoricalnodesfail,itisstillpossibletoqueryresultsifthoseresultsalreadyexistinthecache.3.3.2AvailabilityIntheeventofatotalZookeeperoutage,dataisstillqueryable.IfbrokernodesareunabletocommunicatetoZookeeper,theyusetheirlastknownviewoftheclusterandcontinuetoforwardqueriestoreal-timeandhistoricalnodes.Brokernodesmaketheassump-tionthatthestructureoftheclusteristhesameasitwasbeforetheoutage.Inpractice,thisavailabilitymodelhasallowedourDruidclustertocontinueservingqueriesforasignificantperiodoftimewhilewediagnosedZookeeperoutages.3.4CoordinatorNodesDruidcoordinatornodesareprimarilyinchargeofdatamanage-mentanddistributiononhistoricalnodes.Thecoordinatornodestellhistoricalnodestoloadnewdata,dropoutdateddata,replicatedata,andmovedatatoloadbalance.Druidusesamulti-versionconcurrencycontrolswappingprotocolformanagingimmutablesegmentsinordertomaintainstableviews.Ifanyimmutableseg-mentcontainsdatathatiswhollyobsoletedbynewersegments,theoutdatedsegmentisdroppedfromthecluster.Coordinatornodesundergoaleader-electionprocessthatdeterminesasinglenodethatrunsthecoordinatorfunctionality.Theremainingcoordinatornodesactasredundantbackups.Acoordinatornoderunsperiodicallytodeterminethecurrentstateofthecluster.Itmakesdecisionsbycomparingtheexpectedstateoftheclusterwiththeactualstateoftheclusteratthetimeoftherun.AswithallDruidnodes,coordinatornodesmaintainaZookeeperconnectionforcurrentclusterinformation.CoordinatornodesalsomaintainaconnectiontoaMySQLdatabasethatcon-tainsadditionaloperationalparametersandconfigurations.OneofthekeypiecesofinformationlocatedintheMySQLdatabaseisatablethatcontainsalistofallsegmentsthatshouldbeservedbyhistoricalnodes.Thistablecanbeupdatedbyanyservicethatcre-atessegments,forexample,real-timenodes.TheMySQLdatabasealsocontainsaruletablethatgovernshowsegmentsarecreated,destroyed,andreplicatedinthecluster.3.4.1RulesRulesgovernhowhistoricalsegmentsareloadedanddroppedfromthecluster.Rulesindicatehowsegmentsshouldbeassignedtodifferenthistoricalnodetiersandhowmanyreplicatesofasegmentshouldexistineachtier.Rulesmayalsoindicatewhensegmentsshouldbedroppedentirelyfromthecluster.Rulesareusuallysetforaperiodoftime.Forexample,ausermayuserulestoloadthemostrecentonemonth’sworthofsegmentsintoa“hot”cluster,themostrecentoneyear’sworthofsegmentsintoa“cold”cluster,anddropanysegmentsthatareolder.ThecoordinatornodesloadasetofrulesfromaruletableintheMySQLdatabase.Rulesmaybespecifictoacertaindatasourceand/oradefaultsetofrulesmaybeconfigured.Thecoordinatornodewillcyclethroughallavailablesegmentsandmatcheachseg-mentwiththefirstrulethatappliestoit.3.4.2LoadBalancingInatypicalproductionenvironment,queriesoftenhitdozensorevenhundredsofsegments.Sinceeachhistoricalnodehaslimitedresources,segmentsmustbedistributedamongtheclustertoen-surethattheclusterloadisnottooimbalanced.Determiningopti-malloaddistributionrequiressomeknowledgeaboutquerypatternsandspeeds.Typically,queriescoverrecentsegmentsspanningcon-tiguoustimeintervalsforasingledatasource.Onaverage,queriesthataccesssmallersegmentsarefaster.Thesequerypatternssuggestreplicatingrecenthistoricalseg-mentsatahigherrate,spreadingoutlargesegmentsthatarecloseintimetodifferenthistoricalnodes,andco-locatingsegmentsfromdifferentdatasources.Tooptimallydistributeandbalanceseg-mentsamongthecluster,wedevelopedacost-basedoptimizationprocedurethattakesintoaccountthesegmentdatasource,recency,andsize.Theexactdetailsofthealgorithmarebeyondthescopeofthispaperandmaybediscussedinfutureliterature.3.4.3ReplicationCoordinatornodesmaytelldifferenthistoricalnodestoloadacopyofthesamesegment.Thenumberofreplicatesineachtierofthehistoricalcomputeclusterisfullyconfigurable.Setupsthatrequirehighlevelsoffaulttolerancecanbeconfiguredtohaveahighnumberofreplicas.Replicatedsegmentsaretreatedthesameastheoriginalsandfollowthesameloaddistributionalgorithm.Byreplicatingsegments,singlehistoricalnodefailuresaretransparentintheDruidcluster.Weusethispropertyforsoftwareupgrades.Wecanseamlesslytakeahistoricalnodeoffline,updateit,bringitbackup,andrepeattheprocessforeveryhistoricalnodeinacluster.Overthelasttwoyears,wehavenevertakendowntimeinourDruidclusterforsoftwareupgrades.3.4.4AvailabilityDruidcoordinatornodeshaveZookeeperandMySQLasexternaldependencies.CoordinatornodesrelyonZookeepertodeterminewhathistoricalnodesalreadyexistinthecluster.IfZookeeperbe-comesunavailable,thecoordinatorwillnolongerbeabletosendinstructionstoassign,balance,anddropsegments.However,theseoperationsdonotaffectdataavailabilityatall.ThedesignprincipleforrespondingtoMySQLandZookeeperfailuresisthesame:ifanexternaldependencyresponsibleforco-ordinationfails,theclustermaintainsthestatusquo.DruidusesMySQLtostoreoperationalmanagementinformationandsegmentmetadatainformationaboutwhatsegmentsshouldexistintheclus-ter.IfMySQLgoesdown,thisinformationbecomesunavailabletocoordinatornodes.However,thisdoesnotmeandataitselfisun-available.IfcoordinatornodescannotcommunicatetoMySQL,theywillceasetoassignnewsegmentsanddropoutdatedones.Broker,historical,andreal-timenodesarestillqueryableduringMySQLoutages.4.STORAGEFORMATDatatablesinDruid(calleddatasources)arecollectionsoftimes-tampedeventsandpartitionedintoasetofsegments,whereeachsegmentistypically5–10millionrows.Formally,wedefineaseg-mentasacollectionofrowsofdatathatspansomeperiodoftime.SegmentsrepresentthefundamentalstorageunitinDruidandrepli-cationanddistributionaredoneatasegmentlevel. Figure6:Resultsarecachedpersegment.Queriescombinecachedresultswithresultscomputedonhistoricalandreal-timenodes.Druidalwaysrequiresatimestampcolumnasamethodofsim-plifyingdatadistributionpolicies,dataretentionpolicies,andfirst-levelquerypruning.Druidpartitionsitsdatasourcesintowell-definedtimeintervals,typicallyanhouroraday,andmayfurtherpartitiononvaluesfromothercolumnstoachievethedesiredseg-mentsize.Thetimegranularitytopartitionsegmentsisafunctionofdatavolumeandtimerange.Adatasetwithtimestampsspreadoverayearisbetterpartitionedbyday,andadatasetwithtimes-tampsspreadoveradayisbetterpartitionedbyhour.Segmentsareuniquelyidentifiedbyadatasourceidentifer,thetimeintervalofthedata,andaversionstringthatincreaseswhen-everanewsegmentiscreated.Theversionstringindicatesthefreshnessofsegmentdata;segmentswithlaterversionshavenewerviewsofdata(oversometimerange)thansegmentswitholderver-sions.Thissegmentmetadataisusedbythesystemforconcur-rencycontrol;readoperationsalwaysaccessdatainaparticulartimerangefromthesegmentswiththelatestversionidentifiersforthattimerange.Druidsegmentsarestoredinacolumnorientation.GiventhatDruidisbestusedforaggregatingeventstreams(alldatagoingintoDruidmusthaveatimestamp),theadvantagesofstoringaggregateinformationascolumnsratherthanrowsarewelldocumented[1].ColumnstorageallowsformoreefficientCPUusageasonlywhatisneededisactuallyloadedandscanned.Inaroworienteddatastore,allcolumnsassociatedwitharowmustbescannedaspartofanaggregation.Theadditionalscantimecanintroducesignficantperformancedegradations[1].Druidhasmultiplecolumntypestorepresentvariousdatafor-mats.Dependingonthecolumntype,differentcompressionmeth-odsareusedtoreducethecostofstoringacolumninmemoryandondisk.IntheexamplegiveninTable1,thepage,user,gender,andcitycolumnsonlycontainstrings.Storingstringsdirectlyisunnecessarilycostlyandstringcolumnscanbedictionaryencodedinstead.DictionaryencodingisacommonmethodtocompressdataandhasbeenusedinotherdatastoressuchasPowerDrill[17].IntheexampleinTable1,wecanmapeachpagetoauniqueintegeridentifier.Justij?iO@Or]Z0bOMh9]ZoThismappingallowsustorepresentthepagecolumnasanin-tegerarraywherethearrayindicescorrespondtotherowsoftheoriginaldataset.Forthepagecolumn,wecanrepresenttheuniquepagesasfollows:E0J0JoJoFTheresultingintegerarraylendsitselfverywelltocompressionmethods.Genericcompressionalgorithmsontopofencodingsareextremelycommonincolumn-stores.DruidusestheLZF[24]com-pressionalgorithm.Similarcompressionmethodscanbeappliedtonumericcolumns.Forexample,thecharactersaddedandcharactersremovedcolumnsinTable1canalsobeexpressedasindividualarrays.Gh9r9HtOrs8LLOL]ZEoP00J2ko2JokU3J3okVFGh9r9HtOrsROhnvOL]ZE2UJV2Jo7Jo70FInthiscase,wecompresstherawvaluesasopposedtotheirdic-tionaryrepresentations. Figure7:IntegerarraysizeversusConcisesetsize.4.1IndicesforFilteringDataInmanyrealworldOLAPworkflows,queriesareissuedfortheaggregatedresultsofsomesetofmetricswheresomesetofdi-mensionspecificationsaremet.Anexamplequeryis:“HowmanyWikipediaeditsweredonebyusersinSanFranciscowhoarealsomale?”ThisqueryisfilteringtheWikipediadatasetinTable1basedonaBooleanexpressionofdimensionvalues.Inmanyrealworlddatasets,dimensioncolumnscontainstringsandmetriccolumnscontainnumericvalues.Druidcreatesadditionallookupindicesforstringcolumnssuchthatonlythoserowsthatpertaintoaparticularqueryfilterareeverscanned.LetusconsiderthepagecolumninTable1.ForeachuniquepageinTable1,wecanformsomerepresentationindicatinginwhichtablerowsaparticularpageisseen.Wecanstorethisinformationinabinaryarraywherethearrayindicesrepresentourrows.Ifaparticularpageisseeninacertainrow,thatarrayindexismarkedaso.Forexample:Justij?iO@Or]ZrnwsE0JoF]ZEoFEoFE0FE0FbOMh9]ZrnwsE2J3F]ZE0FE0FEoFEoFJustij?iO@Orisseeninrows0ando.Thismappingofcol-umnvaluestorowindicesformsaninvertedindex[39].ToknowwhichrowscontainJustij?iO@OrorbOMh9,wecanmRtogetherthetwoarrays.E0FEoFE0FEoFmREoFE0FEoFE0FQEoFEoFEoFEoFThisapproachofperformingBooleanoperationsonlargebitmapsetsiscommonlyusedinsearchengines.BitmapindicesforOLAPworkloadsisdescribedindetailin[32].Bitmapcompressional-gorithmsareawell-definedareaofresearch[2,44,42]andoftenutilizerun-lengthencoding.DruidoptedtousetheConcisealgo-rithm[10].Figure7illustratesthenumberofbytesusingConcisecompressionversususinganintegerarray.Theresultsweregen-eratedonaHH2uPxe9rXOsystemwithasinglethread,2Gheap,512myounggen,andaforcedGCbetweeneachrun.Thedatasetisasingleday’sworthofdatacollectedfromtheTwittergardenhose[41]datastream.Thedatasetcontains2,272,295rowsand 12dimensionsofvaryingcardinality.Asanadditionalcomparison,wealsoresortedthedatasetrowstomaximizecompression.Intheunsortedcase,thetotalConcisesizewas53,451,144bytesandthetotalintegerarraysizewas127,248,520bytes.Overall,Concisecompressedsetsareabout42%smallerthanintegerar-rays.Inthesortedcase,thetotalConcisecompressedsizewas43,832,884bytesandthetotalintegerarraysizewas127,248,520bytes.Whatisinterestingtonoteisthataftersorting,globalcom-pressiononlyincreasedminimally.4.2StorageEngineDruid’spersistencecomponentsallowsfordifferentstorageen-ginestobepluggedin,similartoDynamo[12].Thesestorageen-ginesmaystoredatainanentirelyin-memorystructuresuchastheJVMheaporinmemory-mappedstructures.TheabilitytoswapstorageenginesallowsforDruidtobeconfigureddependingonaparticularapplication’sspecifications.Anin-memorystorageen-ginemaybeoperationallymoreexpensivethanamemory-mappedstorageenginebutcouldbeabetteralternativeifperformanceiscritical.Bydefault,amemory-mappedstorageengineisused.Whenusingamemory-mappedstorageengine,Druidreliesontheoperatingsystemtopagesegmentsinandoutofmemory.Giventhatsegmentscanonlybescannediftheyareloadedinmemory,amemory-mappedstorageengineallowsrecentsegmentstoretaininmemorywhereassegmentsthatareneverqueriedarepagedout.Themaindrawbackwithusingthememory-mappedstorageengineiswhenaqueryrequiresmoresegmentstobepagedintomemorythanagivennodehascapacityfor.Inthiscase,queryperformancewillsufferfromthecostofpagingsegmentsinandoutofmemory.5.QUERYAPIDruidhasitsownquerylanguageandacceptsqueriesasPOSTrequests.Broker,historical,andreal-timenodesallsharethesamequeryAPI.ThebodyofthePOSTrequestisaJSONobjectcontainingkey-valuepairsspecifyingvariousqueryparameters.Atypicalquerywillcontainthedatasourcename,thegranularityoftheresultdata,timerangeofinterest,thetypeofrequest,andthemetricstoag-gregateover.TheresultwillalsobeaJSONobjectcontainingtheaggregatedmetricsoverthetimeperiod.Mostquerytypeswillalsosupportafilterset.AfiltersetisaBooleanexpressionofdimensionnameandvaluepairs.Anynum-berandcombinationofdimensionsandvaluesmaybespecified.Whenafiltersetisprovided,onlythesubsetofthedatathatper-tainstothefiltersetwillbescanned.TheabilitytohandlecomplexnestedfiltersetsiswhatenablesDruidtodrillintodataatanydepth.Theexactquerysyntaxdependsonthequerytypeandtheinfor-mationrequested.Asamplecountqueryoveraweekofdataisasfollows:CzxuOryTyqOzIztihOsOriOszJzL9t9SnurHOzIzwiciqOLi9zJzijtOrv9eszIz20o3]0o]0o/20o3]0o]0PzJzTietOrzICztyqOzIzsOeOHtnrzJzLihOjsinjzIzq9XOzJzv9euOzIzbOMh9zDJzXr9jue9rityzIzL9yzJz9XXrOX9tinjszIECztyqOzIzHnujtzJzj9hOzIzrnwszDFDThequeryshownabovewillreturnacountofthenumberofrowsintheWikipediadatasourcefrom2013-01-01to2013-01-08,fil-teredforonlythoserowswherethevalueofthe“page”dimensionisequalto“Ke$ha”.TheresultswillbebucketedbydayandwillbeaJSONarrayofthefollowingform:ECztihOst9hqzIz20o2]0o]0oT00I00I00u000ZzJzrOsuetzICzrnwszI3k32kPDDJCztihOst9hqzIz20o2]0o]02T00I00I00u000ZzJzrOsuetzICzrnwszI3P2k32DDJuuuCztihOst9hqzIz20o2]0o]07T00I00I00u000ZzJzrOsuetzICzrnwszIo337DDFDruidsupportsmanytypesofaggregationsincludingsumsonfloating-pointandintegertypes,minimums,maximums,andcom-plexaggregationssuchascardinalityestimationandapproximatequantileestimation.Theresultsofaggregationscanbecombinedinmathematicalexpressionstoformotheraggregations.Itisbe-yondthescopeofthispapertofullydescribethequeryAPIbutmoreinformationcanbefoundonline3.Asofthiswriting,ajoinqueryforDruidisnotyetimplemented.Thishasbeenafunctionofengineeringresourceallocationandusecasedecisionsmorethanadecisiondrivenbytechnicalmerit.In-deed,Druid’sstorageformatwouldallowfortheimplementationofjoins(thereisnolossoffidelityforcolumnsincludedasdimen-sions)andtheimplementationofthemhasbeenaconversationthatwehaveeveryfewmonths.Todate,wehavemadethechoicethattheimplementationcostisnotworththeinvestmentforourorgani-zation.Thereasonsforthisdecisionaregenerallytwo-fold.1.Scalingjoinquerieshasbeen,inourprofessionalexperience,aconstantbottleneckofworkingwithdistributeddatabases.2.Theincrementalgainsinfunctionalityareperceivedtobeoflessvaluethantheanticipatedproblemswithmanaginghighlyconcurrent,join-heavyworkloads.Ajoinqueryisessentiallythemergingoftwoormorestreamsofdatabasedonasharedsetofkeys.Theprimaryhigh-levelstrate-giesforjoinqueriesweareawareofareahash-basedstrategyorasorted-mergestrategy.Thehash-basedstrategyrequiresthatallbutonedatasetbeavailableassomethingthatlookslikeahashtable,alookupoperationisthenperformedonthishashtableforeveryrowinthe“primary”stream.Thesorted-mergestrategyassumesthateachstreamissortedbythejoinkeyandthusallowsforthein-crementaljoiningofthestreams.Eachofthesestrategies,however,requiresthematerializationofsomenumberofthestreamseitherinsortedorderorinahashtableform.Whenallsidesofthejoinaresignificantlylargetables(>1bil-lionrecords),materializingthepre-joinstreamsrequirescomplexdistributedmemorymanagement.Thecomplexityofthememorymanagementisonlyamplifiedbythefactthatwearetargetinghighlyconcurrent,multitenantworkloads.Thisis,asfarasweareaware,anactiveacademicresearchproblemthatwewouldbewillingtohelpresolveinascalablemanner.6.PERFORMANCEDruidrunsinproductionatseveralorganizations,andtodemon-strateitsperformance,wehavechosentosharesomerealworldnumbersforthemainproductionclusterrunningatMetamarketsasofearly2014.ForcomparisonwithotherdatabaseswealsoincluderesultsfromsyntheticworkloadsonTPC-Hdata. 3http://druid.io/docs/latest/Querying.html DataSource Dimensions Metrics 9 25 21 @ 30 26 H 71 35 L 60 19 O 29 8 T 30 16 X 26 18 h 78 14 Table2:Characteristicsofproductiondatasources.6.1QueryPerformanceinProductionDruidqueryperformancecanvarysignficantlydependingonthequerybeingissued.Forexample,sortingthevaluesofahighcardi-nalitydimensionbasedonagivenmetricismuchmoreexpensivethanasimplecountoveratimerange.ToshowcasetheaveragequerylatenciesinaproductionDruidcluster,weselected8ofourmostquerieddatasources,describedinTable2.Approximately30%ofqueriesarestandardaggregatesinvolvingdifferenttypesofmetricsandfilters,60%ofqueriesareorderedgroupbysoveroneormoredimensionswithaggregates,and10%ofqueriesaresearchqueriesandmetadataretrievalqueries.Thenumberofcolumnsscannedinaggregatequeriesroughlyfollowsanexponentialdistribution.Queriesinvolvingasinglecolumnareveryfrequent,andqueriesinvolvingallcolumnsareveryrare.Afewnotesaboutourresults:Theresultsarefroma“hot”tierinourproductioncluster.Therewereapproximately50datasourcesinthetierandseveralhun-dredusersissuingqueries.Therewasapproximately10.5TBofRAMavailableinthe“hot”tierandapproximately10TBofsegmentsloaded.Collectively,thereareabout50billionDruidrowsinthistier.Resultsforeverydatasourcearenotshown.ThehottierusesIntel®Xeon®E5-2670processorsandconsistsof1302processingthreadsand672totalcores(hyperthreaded).Amemory-mappedstorageenginewasused(themachinewasconfiguredtomemorymapthedatainsteadofloadingitintotheJavaheap.)QuerylatenciesareshowninFigure8andthequeriesperminuteareshowninFigure9.Acrossallthevariousdatasources,aver-agequerylatencyisapproximately550milliseconds,with90%ofqueriesreturninginlessthan1second,95%inunder2seconds,and99%ofqueriesreturninginlessthan10seconds.Occasionallyweobservespikesinlatency,asobservedonFebruary19,wherenetworkissuesontheMemcachedinstanceswerecompoundedbyveryhighqueryloadononeofourlargestdatasources.6.2QueryBenchmarksonTPC-HDataWealsopresentDruidbenchmarksonTPC-Hdata.MostTPC-HqueriesdonotdirectlyapplytoDruid,soweselectedqueriesmoretypicalofDruid’sworkloadtodemonstratequeryperformance.Asacomparison,wealsoprovidetheresultsofthesamequeriesus-ingMySQLusingtheMyISAMengine(InnoDBwasslowerinourexperiments).WeselectedMySQLtobenchmarkagainstbecauseofitsuniver-salpopularity.Wechosenottoselectanotheropensourcecolumnstorebecausewewerenotconfidentwecouldcorrectlytuneitforoptimalperformance.OurDruidsetupusedAmazonEC2h3u2xe9rXOinstancetypes(Intel®Xeon®E5-2680v2@2.80GHz)forhistoricalnodesandH3u2xe9rXOinstances(Intel®Xeon®E5-2670v2@2.50GHz)for Figure8:Querylatenciesofproductiondatasources. Figure9:Queriesperminuteofproductiondatasources. Figure10:Druid&MySQLbenchmarks–1GBTPC-Hdata. Figure11:Druid&MySQLbenchmarks–100GBTPC-Hdata.brokernodes.OurMySQLsetupwasanAmazonRDSinstancethatranonthesameh3u2xe9rXOinstancetype.Theresultsforthe1GBTPC-HdatasetareshowninFigure10andtheresultsofthe100GBdatasetareshowninFigure11.WebenchmarkedDruid’sscanrateat53,539,211rows/second/coreforsOeOHtHnujtr*sequivalentqueryoveragiventimeintervaland36,246,530rows/second/coreforasOeOHtsuhrTen9tstypequery.Finally,wepresentourresultsofscalingDruidtomeetincreasingdatavolumeswiththeTPC-H100GBdataset.Weobservethatwhenweincreasedthenumberofcoresfrom8to48,notalltypesofqueriesachievelinearscaling,butthesimpleraggregationqueriesdo,asshowninFigure12.Theincreaseinspeedofaparallelcomputingsystemisoftenlim-itedbythetimeneededforthesequentialoperationsofthesystem.Inthiscase,queriesrequiringasubstantialamountofworkatthebrokerleveldonotparallelizeaswell.6.3DataIngestionPerformanceToshowcaseDruid’sdataingestionlatency,weselectedseveralproductiondatasourcesofvaryingdimensions,metrics,andeventvolumes.Ourproductioningestionsetupconsistsof6nodes,to-talling360GBofRAMand96cores(12xIntel®Xeon®E5-2670). Figure12:Druidscalingbenchmarks–100GBTPC-Hdata. DataSource Dimensions Metrics Peakevents/s s 7 2 28334.60 t 10 7 68808.70 u 5 1 49933.93 v 30 10 22240.45 w 35 14 135763.17 x 28 6 46525.85 y 33 24 162462.41 z 33 24 95747.74 Table3:Ingestioncharacteristicsofvariousdatasources.Notethatinthissetup,severalotherdatasourceswerebeingin-gestedandmanyotherDruidrelatedingestiontaskswererunningconcurrentlyonthemachines.Druid’sdataingestionlatencyisheavilydependentonthecom-plexityofthedatasetbeingingested.Thedatacomplexityisde-terminedbythenumberofdimensionsineachevent,thenumberofmetricsineachevent,andthetypesofaggregationswewanttoperformonthosemetrics.Withthemostbasicdataset(onethatonlyhasatimestampcolumn),oursetupcaningestdataatarateof800,000events/second/core,whichisreallyjustameasurementofhowfastwecandeserializeevents.Realworlddatasetsareneverthissimple.Table3showsaselectionofdatasourcesandtheircharacteristics.Wecanseethat,basedonthedescriptionsinTable3,latenciesvarysignificantlyandtheingestionlatencyisnotalwaysafactorofthenumberofdimensionsandmetrics.Weseesomelowerlatenciesonsimpledatasetsbecausethatwastheratethatthedataproducerwasdeliveringdata.TheresultsareshowninFigure13.Wedefinethroughputasthenumberofeventsareal-timenodecaningestandalsomakequeryable.Iftoomanyeventsaresenttothereal-timenode,thoseeventsareblockeduntilthereal-timenodehascapacitytoacceptthem.Thepeakingestionlatencywemeasuredinproductionwas22914.43events/second/coreonadata-sourcewith30dimensionsand19metrics,runninganAmazonHH2uPxe9rXOinstance. Figure13:Combinedclusteringestionrates.Thelatencymeasurementswepresentedaresufficienttoaddressthestatedproblemsofinteractivity.Wewouldpreferthevariabilityinthelatenciestobeless.Itisstillpossibletodecreaselatenciesbyaddingadditionalhardware,butwehavenotchosentodosobecauseinfrastructurecostsarestillaconsiderationforus.7.DRUIDINPRODUCTIONOverthelastfewyears,wehavegainedtremendousknowledgeabouthandlingproductionworkloadswithDruidandhavemadeacoupleofinterestingobservations.QueryPatterns.Druidisoftenusedtoexploredataandgeneratereportsondata.Intheexploreusecase,thenumberofqueriesissuedbyasingleuseraremuchhigherthaninthereportingusecase.Exploratoryqueriesofteninvolveprogressivelyaddingfiltersforthesametimerangetonarrowdownresults.Userstendtoexploreshorttimein-tervalsofrecentdata.Inthegeneratereportusecase,usersqueryformuchlongerdataintervals,butthosequeriesaregenerallyfewandpre-determined.Multitenancy.Expensiveconcurrentqueriescanbeproblematicinamultitenantenvironment.Queriesforlargedatasourcesmayenduphittingev-eryhistoricalnodeinaclusterandconsumeallclusterresources.Smaller,cheaperqueriesmaybeblockedfromexecutinginsuchcases.Weintroducedqueryprioritizationtoaddresstheseissues.Eachhistoricalnodeisabletoprioritizewhichsegmentsitneedstoscan.Properqueryplanningiscriticalforproductionworkloads.Thankfully,queriesforasignificantamountofdatatendtobeforreportingusecasesandcanbedeprioritized.Usersdonotexpectthesamelevelofinteractivityinthisusecaseaswhentheyareex-ploringdata.Nodefailures.Singlenodefailuresarecommonindistributedenvironments,butmanynodesfailingatoncearenot.Ifhistoricalnodescompletelyfailanddonotrecover,theirsegmentsneedtobereassigned,whichmeansweneedexcessclustercapacitytoloadthisdata.Theamountofadditionalcapacitytohaveatanytimecontributestothecostofrunningacluster.Fromourexperiences,itisextremelyraretoseemorethan2nodescompletelyfailatonceandhence,weleaveenoughcapacityinourclustertocompletelyreassignthedatafrom2historicalnodes.DataCenterOutages.Completeclusterfailuresarepossible,butextremelyrare.IfDruidisonlydeployedinasingledatacenter,itispossiblefortheentiredatacentertofail.Insuchcases,newmachinesneedtobeprovisioned.Aslongasdeepstorageisstillavailable,clus-terrecoverytimeisnetworkbound,ashistoricalnodessimplyneedtoredownloadeverysegmentfromdeepstorage.Wehaveexperi-encedsuchfailuresinthepast,andtherecoverytimewasseveralhoursintheAmazonAWSecosystemforseveralterabytesofdata.7.1OperationalMonitoringPropermonitoringiscriticaltorunalargescaledistributedclus-ter.EachDruidnodeisdesignedtoperiodicallyemitasetofoper-ationalmetrics.ThesemetricsmayincludesystemleveldatasuchasCPUusage,availablememory,anddiskcapacity,JVMstatisticssuchasgarbagecollectiontime,andheapusage,ornodespecificmetricssuchassegmentscantime,cachehitrates,anddatainges-tionlatencies.Druidalsoemitsperquerymetrics.WeemitmetricsfromaproductionDruidclusterandloadthemintoadedicatedmetricsDruidcluster.ThemetricsDruidclusterisusedtoexploretheperformanceandstabilityoftheproductioncluster.Thisdedicatedmetricsclusterhasallowedustofindnu-merousproductionproblems,suchasgradualqueryspeeddegrega-tions,lessthanoptimallytunedhardware,andvariousothersystembottlenecks.Wealsouseametricsclustertoanalyzewhatqueriesaremadeinproductionandwhataspectsofthedatausersaremostinterestedin.7.2PairingDruidwithaStreamProcessorCurrently,Druidcanonlyunderstandfullydenormalizeddatastreams.Inordertoprovidefullbusinesslogicinproduction,DruidcanbepairedwithastreamprocessorsuchasApacheStorm[27].AStormtopologyconsumeseventsfromadatastream,retainsonlythosethatare“on-time”,andappliesanyrelevantbusinesslogic.Thiscouldrangefromsimpletransformations,suchasidtonamelookups,tocomplexoperationssuchasmulti-streamjoins.TheStormtopologyforwardstheprocessedeventstreamtoDruidinreal-time.Stormhandlesthestreamingdataprocessingwork,andDruidisusedforrespondingtoqueriesforbothreal-timeandhistoricaldata.7.3MultipleDataCenterDistributionLargescaleproductionoutagesmaynotonlyaffectsinglenodes,butentiredatacentersaswell.ThetierconfigurationinDruidco-ordinatornodesallowforsegmentstobereplicatedacrossmultipletiers.Hence,segmentscanbeexactlyreplicatedacrosshistoricalnodesinmultipledatacenters.Similarily,querypreferencecanbeassignedtodifferenttiers.Itispossibletohavenodesinonedatacenteractasaprimarycluster(andreceiveallqueries)andhavearedundantclusterinanotherdatacenter.Suchasetupmaybede-siredifonedatacenterissituatedmuchclosertousers.8.RELATEDWORKCattell[6]maintainsagreatsummaryaboutexistingScalableSQLandNoSQLdatastores.Hu[18]contributedanothergreatsummaryforstreamingdatabases.Druidfeature-wisesitssome-wherebetweenGoogle’sDremel[28]andPowerDrill[17].DruidhasmostofthefeaturesimplementedinDremel(DremelhandlesarbitrarynesteddatastructureswhileDruidonlyallowsforasinglelevelofarray-basednesting)andmanyoftheinterestingcompres-sionalgorithmsmentionedinPowerDrill.AlthoughDruidbuildsonmanyofthesameprinciplesasotherdistributedcolumnardatastores[15],manyofthesedatastoresare designedtobemoregenerickey-valuestores[23]anddonotsup-portcomputationdirectlyinthestoragelayer.TherearealsootherdatastoresdesignedforsomeofthesamedatawarehousingissuesthatDruidismeanttosolve.Thesesystemsincludein-memorydatabasessuchasSAP’sHANA[14]andVoltDB[43].ThesedatastoreslackDruid’slowlatencyingestioncharacteristics.Druidalsohasnativeanalyticalfeaturesbakedin,similartoParAccel[34],however,Druidallowssystemwiderollingsoftwareupdateswithnodowntime.DruidissimiliartoC-Store[38]andLazyBase[8]inthatithastwosubsystems,aread-optimizedsubsysteminthehistoricalnodesandawrite-optimizedsubsysteminreal-timenodes.Real-timenodesaredesignedtoingestahighvolumeofappendheavydata,anddonotsupportdataupdates.Unlikethetwoaforementionedsystems,DruidismeantforOLAPtransactionsandnotOLTPtransactions.Druid’slowlatencydataingestionfeaturessharesomesimilar-itieswithTrident/Storm[27]andSparkStreaming[45],however,bothsystemsarefocusedonstreamprocessingwhereasDruidisfocusedoningestionandaggregation.StreamprocessorsaregreatcomplementstoDruidasameansofpre-processingthedatabeforethedataentersDruid.Thereareaclassofsystemsthatspecializeinqueriesontopofclustercomputingframeworks.Shark[13]issuchasystemforqueriesontopofSpark,andCloudera’sImpala[9]isanothersystemfocusedonoptimizingqueryperformanceontopofHDFS.DruidhistoricalnodesdownloaddatalocallyandonlyworkwithnativeDruidindexes.Webelievethissetupallowsforfasterquerylaten-cies.Druidleveragesauniquecombinationofalgorithmsinitsarchi-tecture.AlthoughwebelievenootherdatastorehasthesamesetoffunctionalityasDruid,someofDruid’soptimizationtechniquessuchasusinginvertedindicestoperformfastfiltersarealsousedinotherdatastores[26].9.CONCLUSIONSInthispaperwepresentedDruid,adistributed,column-oriented,real-timeanalyticaldatastore.Druidisdesignedtopowerhighperformanceapplicationsandisoptimizedforlowquerylatencies.Druidsupportsstreamingdataingestionandisfault-tolerant.WediscussedDruidbenchmarksandsummarizedkeyarchitectureas-pectssuchasthestorageformat,querylanguage,andgeneralexe-cution.10.ACKNOWLEDGEMENTSDruidcouldnothavebeenbuiltwithoutthehelpofmanygreatengineersatMetamarketsandinthecommunity.WewanttothankeveryonethathascontributedtotheDruidcodebasefortheirin-valuablesupport.11.REFERENCES[1]D.J.Abadi,S.R.Madden,andN.Hachem.Column-storesvs.row-stores:Howdifferentaretheyreally?InProceedingsofthe2008ACMSIGMODinternationalconferenceonManagementofdata,pages967–980.ACM,2008.[2]G.Antoshenkov.Byte-alignedbitmapcompression.InDataCompressionConference,1995.DCC’95.Proceedings,page476.IEEE,1995.[3]Apache.Apachesolr.httqI//euHOjOu9q9HhOunrX/sner/,February2013.[4]S.Banon.Elasticsearch.httqI//wwwuOe9stiHsO9HhuHnh/,July2013.[5]C.Bear,A.Lamb,andN.Tran.Theverticadatabase:Sqlrdbmsformanagingbigdata.InProceedingsofthe2012workshoponManagementofbigdatasystems,pages37–38.ACM,2012.[6]R.Cattell.Scalablesqlandnosqldatastores.ACMSIGMODRecord,39(4):12–27,2011.[7]F.Chang,J.Dean,S.Ghemawat,W.C.Hsieh,D.A.Wallach,M.Burrows,T.Chandra,A.Fikes,andR.E.Gruber.Bigtable:Adistributedstoragesystemforstructureddata.ACMTransactionsonComputerSystems(TOCS),26(2):4,2008.[8]J.Cipar,G.Ganger,K.Keeton,C.B.MorreyIII,C.A.Soules,andA.Veitch.Lazybase:tradingfreshnessforperformanceinascalabledatabase.InProceedingsofthe7thACMeuropeanconferenceonComputerSystems,pages169–182.ACM,2012.[9]Clouderaimpala.httqI//@enXuHenuLOr9uHnh/@enX,March2013.[10]A.ColantonioandR.DiPietro.Concise:Compressed‘n’composableintegerset.InformationProcessingLetters,110(16):644–650,2010.[11]J.DeanandS.Ghemawat.Mapreduce:simplifieddataprocessingonlargeclusters.CommunicationsoftheACM,51(1):107–113,2008.[12]G.DeCandia,D.Hastorun,M.Jampani,G.Kakulapati,A.Lakshman,A.Pilchin,S.Sivasubramanian,P.Vosshall,andW.Vogels.Dynamo:amazon’shighlyavailablekey-valuestore.InACMSIGOPSOperatingSystemsReview,volume41,pages205–220.ACM,2007.[13]C.Engle,A.Lupher,R.Xin,M.Zaharia,M.J.Franklin,S.Shenker,andI.Stoica.Shark:fastdataanalysisusingcoarse-graineddistributedmemory.InProceedingsofthe2012internationalconferenceonManagementofData,pages689–692.ACM,2012.[14]F.Färber,S.K.Cha,J.Primsch,C.Bornhövd,S.Sigg,andW.Lehner.Saphanadatabase:datamanagementformodernbusinessapplications.ACMSigmodRecord,40(4):45–51,2012.[15]B.Fink.Distributedcomputationondynamo-styledistributedstorage:riakpipe.InProceedingsoftheeleventhACMSIGPLANworkshoponErlangworkshop,pages43–50.ACM,2012.[16]B.Fitzpatrick.Distributedcachingwithmemcached.Linuxjournal,(124):72–74,2004.[17]A.Hall,O.Bachmann,R.Büssow,S.Gănceanu,andM.Nunkesser.Processingatrillioncellspermouseclick.ProceedingsoftheVLDBEndowment,5(11):1436–1446,2012.[18]B.Hu.Streamdatabasesurvey.2011.[19]P.Hunt,M.Konar,F.P.Junqueira,andB.Reed.Zookeeper:Wait-freecoordinationforinternet-scalesystems.InUSENIXATC,volume10,2010.[20]C.S.Kim.Lrfu:Aspectrumofpoliciesthatsubsumestheleastrecentlyusedandleastfrequentlyusedpolicies.IEEETransactionsonComputers,50(12),2001.[21]J.Kreps,N.Narkhede,andJ.Rao.Kafka:Adistributedmessagingsystemforlogprocessing.InProceedingsof6thInternationalWorkshoponNetworkingMeetsDatabases(NetDB),Athens,Greece,2011.[22]T.Lachev.AppliedMicrosoftAnalysisServices2005:AndMicrosoftBusinessIntelligencePlatform.PrologikaPress,2005. [23]A.LakshmanandP.Malik.Cassandra—adecentralizedstructuredstoragesystem.Operatingsystemsreview,44(2):35,2010.[24]Liblzf.httqI//TrOOHnLOuHnh/qrnaOHts/ei@ezT,March2013.[25]LinkedIn.Senseidb.httqI//wwwusOjsOiL@uHnh/,July2013.[26]R.MacNicolandB.French.Sybaseiqmultiplex-designedforanalytics.InProceedingsoftheThirtiethinternationalconferenceonVerylargedatabases-Volume30,pages1227–1230.VLDBEndowment,2004.[27]N.Marz.Storm:Distributedandfault-tolerantrealtimecomputation.httqI//stnrh]qrnaOHtujOt/,February2013.[28]S.Melnik,A.Gubarev,J.J.Long,G.Romer,S.Shivakumar,M.Tolton,andT.Vassilakis.Dremel:interactiveanalysisofweb-scaledatasets.ProceedingsoftheVLDBEndowment,3(1-2):330–339,2010.[29]D.Miner.Unifiedanalyticsplatformforbigdata.InProceedingsoftheWICSA/ECSA2012CompanionVolume,pages176–176.ACM,2012.[30]K.Oehler,J.Gruenes,C.Ilacqua,andM.Perez.IBMCognosTM1:TheOfficialGuide.McGraw-Hill,2012.[31]E.J.O’neil,P.E.O’neil,andG.Weikum.Thelru-kpagereplacementalgorithmfordatabasediskbuffering.InACMSIGMODRecord,volume22,pages297–306.ACM,1993.[32]P.O’NeilandD.Quass.Improvedqueryperformancewithvariantindexes.InACMSigmodRecord,volume26,pages38–49.ACM,1997.[33]P.O’Neil,E.Cheng,D.Gawlick,andE.O’Neil.Thelog-structuredmerge-tree(lsm-tree).ActaInformatica,33(4):351–385,1996.[34]Paraccelanalyticdatabase.httqI//wwwuq9r9HHOeuHnh/rOsnurHOs/K9t9shOOts/p9r8HHOe]GnrO]8j9eytiH]K9t9@9sOuqLT,March2013.[35]M.Schrader,D.Vlamis,M.Nader,C.Claterbos,D.Collins,M.Campbell,andF.Conrad.OracleEssbase&OracleOLAP.McGraw-Hill,Inc.,2009.[36]K.Shvachko,H.Kuang,S.Radia,andR.Chansler.Thehadoopdistributedfilesystem.InMassStorageSystemsandTechnologies(MSST),2010IEEE26thSymposiumon,pages1–10.IEEE,2010.[37]M.SinghandB.Leonhardi.Introductiontotheibmnetezzawarehouseappliance.InProceedingsofthe2011ConferenceoftheCenterforAdvancedStudiesonCollaborativeResearch,pages385–386.IBMCorp.,2011.[38]M.Stonebraker,D.J.Abadi,A.Batkin,X.Chen,M.Cherniack,M.Ferreira,E.Lau,A.Lin,S.Madden,E.O’Neil,etal.C-store:acolumn-orienteddbms.InProceedingsofthe31stinternationalconferenceonVerylargedatabases,pages553–564.VLDBEndowment,2005.[39]A.TomasicandH.Garcia-Molina.Performanceofinvertedindicesinshared-nothingdistributedtextdocumentinformationretrievalsystems.InParallelandDistributedInformationSystems,1993.,ProceedingsoftheSecondInternationalConferenceon,pages8–17.IEEE,1993.[40]E.Tschetter.Introducingdruid:Real-timeanalyticsatabillionrowspersecond.httqI//LruiLuin/@enX/20oo/0V/30/ijtrnLuHijX]LruiLuhthe,April2011.[41]Twitterpublicstreams.httqsI//LOvutwittOruHnh/LnHs/strO9hijX]9qis/strO9hs/qu@eiH,March2013.[42]S.J.vanSchaikandO.deMoor.Amemoryefficientreachabilitydatastructurethroughbitvectorcompression.InProceedingsofthe2011internationalconferenceonManagementofdata,pages913–924.ACM,2011.[43]L.VoltDB.Voltdbtechnicaloverview.httqsI//vnetL@uHnh/,2010.[44]K.Wu,E.J.Otoo,andA.Shoshani.Optimizingbitmapindiceswithefficientcompression.ACMTransactionsonDatabaseSystems(TODS),31(1):1–38,2006.[45]M.Zaharia,T.Das,H.Li,S.Shenker,andI.Stoica.Discretizedstreams:anefficientandfault-tolerantmodelforstreamprocessingonlargeclusters.InProceedingsofthe4thUSENIXconferenceonHotTopicsinCloudComputing,pages10–10.USENIXAssociation,2012.