854 861 850 860 853 852 858 857 851 855 856 859 1httphbaseapacheorg2httpbashocomriakThisworkislicensedundertheCreativeCommonsAttribution ID: 329331
Download Pdf The PPT/PDF document "Compaction management indistributed key" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
854 861 850 860 853 852 858 857 851 855 856 859 CompactionmanagementindistributedkeyvaluedatastoresMuhammadYousufAhmadMcGillUniversitymuhammad.ahmad2@mail.mcgill.caBettinaKemmeMcGillUniversitykemme@cs.mcgill.caABSTRACTCompactionsareavitalmaintenancemechanismusedbydatastoresbasedonthelog-structuredmerge-treetocounterthecontinuousbuildupofdatalesunderupdate-intensiveworkloads.Whilecompactionshelpkeepreadlatenciesincheckoverthelongrun,thiscomesatthecostofsignif-icantlydegradedreadperformanceoverthecourseofthecompactionitself.Inthispaper,weoeranin-depthanal-ysisofcompaction-relatedperformanceoverheadsandpro-posetechniquesfortheirmitigation.Weooadlarge,ex-pensivecompactionstoadedicatedcompactionservertoallowthedatastoreservertobetterutilizeitsresourcesto-wardsservingtheactualworkload.Moreover,sincethenewlycompacteddataisalreadycachedinthecompactionserver'smainmemory,wefetchthisdataoverthenetworkdirectlyintothedatastoreserver'slocalcache,therebyavoid-ingtheperformancepenaltyofreadingitbackfromthelesystem.Infact,pre-fetchingthecompacteddatafromtheremotecachepriortoswitchingtheworkloadovertoitcaneliminatelocalcachemissesaltogether.Therefore,weimplementasmarterwarmupalgorithmthatensuresthatallincomingreadrequestsareservedfromthedatastoreserver'slocalcacheevenasitiswarmingup.WehaveintegratedoursolutionintoHBase,andusingtheYCSBandTPC-Cbenchmarks,weshowthatourapproachsignicantlymit-igatescompaction-relatedperformanceproblems.Wealsodemonstratethescalabilityofoursolutionbydistributingcompactionsacrossmultiplecompactionservers.1.INTRODUCTIONAnumberofprominentdistributedkey-valuedatastores,includingBigtable[3],Cassandra[12],HBase1,andRiak2,cantracetheirrootsbacktothelog-structuredmerge-tree(LSMT)[13]{adatastructurethatsupportshighupdatethroughputsalongwithlow-latencyrandomreads.Thus, 1http://hbase.apache.org/2http://basho.com/riak/ThisworkislicensedundertheCreativeCommonsAttributionNonCommercialNoDerivs3.0UnportedLicense.Toviewacopyofthislicense,visithttp://creativecommons.org/licenses/byncnd/3.0/.Obtainpermissionpriortoanyusebeyondthosecoveredbythelicense.Contactcopyrightholderbyemailinginfo@vldb.org.Articlesfromthisvolumewereinvitedtopresenttheirresultsatthe41stInternationalConferenceonVeryLargeDataBases,August31stSeptember4th2015,KohalaCoast,Hawaii.ProceedingsoftheVLDBEndowment,Vol.8,No.8Copyright2015VLDBEndowment21508097/15/04.thesedatastoresarewell-suitedforonlinetransactionpro-cessing(OLTP)applicationsthathavedemandingwork-loads.Inordertohandleahighrateofincomingupdates,thedatastoredoesnotperformupdatesinplacebutcreatesnewvaluesfortheupdatedkeysandinitiallybuerstheminmainmemory,fromwheretheyareregularly ushed,insortedbatches,toread-onlylesonstablestorage.Asaresult,readingevenasinglekeycouldpotentiallyrequiretraversingmultiplelestondthecorrectvalueofakey.Hence,acontinuousbuild-upoftheseimmutablelescancauseagradualdegradationinreadperformancethatgetsincreasinglyworseovertime.Inordertocurbthisbehavior,thedatastorerunsspecialmaintenanceoperations{com-monlyreferredtoascompactions{onaregularbasis.Acompactionmerge-sortsmultiplelestogether,consolidat-ingtheircontentsintoasinglele.Intheprocess,individualvaluesofthesamekey,potentiallyspreadacrossmultipleles,aremergedtogether,andanyexpiredordeletedvaluesarediscarded.Thus,overthelongrun,compactionshelpmaintainreadlatencyatanacceptablelevelbycontainingthegradualbuild-upofimmutablelesinthesystem.How-ever,thiscomesatthecostofsignicantlatencypeaksdur-ingtheexecutionofcompactions,astheycompetewiththeactualworkloadforCPU,memory,andI/Oresources.SincecompactionsareanessentialpartofanyLSMT-baseddatastore,wewouldliketobeabletoexerciseagreaterdegreeofcontrolovertheirexecutioninordertomit-igateanyundesirableimpactsontheperformanceofthereg-ularworkload.Datastoreadministrators,basedontheirex-perienceandunderstandingofapplicationworkloads,man-agetheseperformanceoverheadsbycarefullytuningthesizeandscheduleofcompactionevents[1].Forexample,astraightforwardmitigationstrategycouldbetopostponemajorcompactionstoo-peakhours.RecentproposalsandprototypesofsmartercompactionalgorithmsinCassandraandHBase(e.g.,leveled,striped)attempttomakethecom-pactionprocessitselfmoreecient,generallybyavoidingrepetitivere-compactionsofolderdataasmuchaspossible.However,thereiscurrentlyadearthofliteraturepertainingtoourunderstandingofhowandwhenexactlycompactionsimpacttheperformanceoftheregularworkload.Tothisend,asourrstcontribution,thispaperpresentsanin-depthexperimentalanalysisoftheseoverheads.Wehopethatthishelpsdataplatformdesignersandapplicationdeveloperstobetterunderstandandevaluatetheseover-headswithrespecttoresourceprovisioningandframingperformance-basedservice-levelagreements. SinceourworkrelatestoOLTPapplications,aprimaryconcernistransactionresponsetime.Ourobservationsshowthatlargecompactioneventshaveanespeciallynegativeim-pactontheresponsetimeofreadsduetotwoissues.First,duringthecompaction,thecompactionprocessitselfcom-petesforresourceswiththeactualworkload,degradingitsperformance.Asecondmajorproblemarethecachemissesthatareinduceduponthecompaction'scompletion.Dis-tributedkey-valuedatastoresgenerallyrelyheavilyonmainmemorycachingtoachievelowlatenciesforreads3.Inpar-ticular,iftheentireworkingdatasetisunabletotwithintheprovisionedmainmemory,thereadcachemayexperi-enceahighdegreeofchurn,resultinginveryunstablereadperformance.Sincedistributeddatastoresaredesignedtobeelasticallyscalable,itisgenerallyassumedthatasuf-cientamountofserverscanbeconvenientlyprovisionedtokeeptheapplication'sgrowingdatasetinmainmemory.Evenso,underupdate-intensiveworkloads,frequentcom-pactionscanbecomeanotherproblematicsourceofcachechurn.Sinceacompactionconsolidatesthecontentsofmul-tipleinputlesintoanewoutputle,allreferencestotheinputlesareobsoletedintheprocess,whichnecessitatesthedatastoretoinvalidatethecorrespondingentriesinitsreadcache.Ouranalysisshowsthatthecachemissescausedbytheselarge-scaleevictionsresultinanextendeddegrada-tioninreadlatency,sincethedatastoreserverthenhastoreadthenewlycompacteddatafromthelesystemintoitscachealloveragain.Withthisinmind,oursecondmajorcontributionistoproposeanovelapproachthatattemptstokeeptheimpactofcompactionsontheperformanceoftheactualworkloadassmallaspossible{bothduringacompaction'sexecution,andafteritcompletes.Asarststep,weooadthecompactionsthemselvestodedicatednodescalledcompactionservers.Takingadvan-tageofthedatareplicationinherentindistributeddatas-tores,weenableacompactionservertotransparentlyreadandwritereplicasofthedatabyactingasaspecializedpeerofthedatastoreservers.Inthesecondstep,aimingtoreducetheoverheadofcachemissesafterthecompaction,weusethecompactionserverasaremotecacheforthedatastoreserver.Thatis,insteadofreadingthenewlycompactedlesfromthelesystem,thedatastoreserverreadsthemdirectlyfromthecompactionserver'scache,therebysignicantlyreducingbothloadtimeandreadlatency.Althoughthisalleviatestheperformancepenaltyincurredbylocalcachemissestoagreatdegree,itdoesnotcompletelyeliminateit.Inordertoaddressthere-mainingoverheadofthesecachemisses,oneapproachcouldbetoeagerlywarmthedatastoreserver'scachewiththecompacteddataimmediatelyuponthecompaction'scom-pletion.Butsuchanapproachisonlyfeasiblewhenwehaveenoughmainmemoryprovisioned,suchthatthedatastoreservercansimultaneouslytboththecurrentdataaswellasthecompacteddatainitscache,thusallowingforaseam-lessswitchbetweenthetwo.Instead,weproposeasmartwarmupalgorithmthatfetchesthecompacteddatafromtheremotecacheinsequentialchunks,whereeachchunkre-placesthecorrespondingrangeofcurrentdatainthelocalcache.Duringthisincrementalwarmupphase,weguaran-teethateachreadrequestisservedcompletelyeitherbythe 3http://www.slideshare.net/xefyr/hbasecon2014-low-latencyolddatalesorbythefreshlycompacteddata.Thisensuresthatallincomingreadrequestscanbeservedfromthedata-storeserver'slocalcacheevenasitiswarmingup,therebycompletelyeliminatingtheperformancepenaltyassociatedwithswitchingovertothenewlycompacteddata.Inshort,themaincontributionsofthispaperare:1.AnexperimentalanalysisoftheperformanceimpactsassociatedwithcompactionsinHBaseandCassandra.2.Ascalablesolutionforooadingcompactionstooneormorededicatedcompactionservers.3.Asolutionforecientlystreamingthecompacteddatafromthecompactionserver'scacheintothedatastoreserver'slocalcacheoverthenetwork.4.Asmartalgorithmforincrementallywarmingupthedatastoreserver'scachewiththecompacteddata.5.AnimplementationoftheaboveanditsevaluationbasedonHBase.Ourpaperdoesnotfollowthetypicalstructurefoundinresearchpapers,whichrstpresentthesolutioninfullfol-lowedbytheexperiments.Instead,weuseastep-wiseap-proach,wherewerstdescribeapartofoursolution,imme-diatelyaccompaniedbyanexperimentalevaluationofthisparttobetterunderstanditsimplications.Inthisspirit,Section2providesanoverviewofthelog-structuredmergetree,HBase,andCassandra.Section3describestheover-allarchitectureofourapproachandahigh-leveldescriptionoftheintegrationofournewcomponentsintoHBase.Sec-tion4thenshortlydescribestheexperimentalsetup,be-foreSection5digsintothedetailsofoursolutionandtheirevaluation.Section6discussesscalability,alongwithsomefurtherexperimentalresults,andSection7discussesthefault-toleranceaspectsofoursolution.Section8presentsasummaryoftherelatedwork.WeconcludeinSection9.2.BACKGROUNDThissectionprovidesanoverviewofthebackgroundrel-evanttoourunderstandingofcompactionsinLSMT-baseddatastores,aswellasashortoverviewofhowcompactionsareperformedinHBaseandCassandra.2.1LSMTThelog-structuredmerge-tree(LSMT)[13]isakey-valuedatastructurethataimstoprovideadatastorageandre-trievalsolutionforhigh-throughputapplications.Itisahy-briddatastructure,withamainmemorylayer(C0)placedontopofoneormorelesystemlayers(C1,andsoon).Up-datesarecollectedinC0and usheddowntoC1inbatches,suchthateachbatchbecomesanimmutablele,withthekey-valuepairswritteninsortedorder.Thisapproachhasseveralimportantimplications.Firstly,fortheclient,updatesareextremelyfast,sincetheyareappliedin-memory.Secondly, ushingupdatesdowninbatchesismoreecientsinceitsignicantlyre-ducesdiskI/O.Moreover,appendingabatchofupdatestoasingleleismuchfasterthanexecutingmultipleran-domwritesonarotationalstoragemedium(e.g.,magneticdisk).Thisenablesthedatastructuretosupporthighup-datethroughputs.Thirdly,multipleupdatesonagivenkeymayendupspreadacrossC0andanynumberoflesinC1(orbelow).Inotherwords,wecanhavemultipleval-uesperkey.Therefore,arandomreadonagivenkeymustrstsearchthroughC0(aquick,in-memorylookup),thenC1(traversingallthelesinthatlayer),andsoon,untilit Figure1:HBaseArchitecturendsthemostrecentvalueforthatkey.Sincethecontentsofalearealreadysortedbykey,anin-leindexcanbeusedtospeeduprandomreadswithinale.Theseread-onlylesinevitablystartbuildingupinthelesystemlayer(s),resultinginreadsbecomingincreasinglyslowovertime.Thisisremediedbyperiodicallyselectingtwoormorelesinalayerandmerge-sortingthemtogetherintoasinglele.Themergeprocessoverwritesoldervalueswiththelatestonesanddiscardsdeletedvalues,therebyclearingupanystaledata.2.2HBaseHBaseisamoderndistributedkey-valuedatastorein-spiredbyBigtable[3].HBaseoerstheabstractionofatable,whereeachrowrepresentsakey-valuepair.Thekeypartistheuniqueidentieroftherow,andthevaluepartcomprisesanarbitrarynumberofcolumnvalues.Columnsaregroupedintocolumnfamiliestopartitionthetablever-tically.Eachtablecanalsobepartitionedhorizontallyintomanyregions.Aregionisacontiguoussetofrowssortedbytheirkeys.Whenaregiongrowsbeyondacertainsize,itisautomaticallysplitintohalf,formingtwonewregions.EveryregionisassignedbyaMasterservertooneofmultipleregionserversintheHBasecluster(seeFigure1).Throughwell-balancedregionplacement,theapplica-tionworkloadcanbeevenlydistributedacrossthecluster.Whenaregionserverbecomesoverloaded,someofitsre-gionscanbereassignedtootherunderloadedregionservers.Whentheclusterreachesitspeakloadcapacity,newregionserverscanbeprovisionedandaddedtotheonlineclus-ter,thusallowingforelasticscalability.HBasereliesonZookeeper4,alightweightquorum-basedreplicationsystem,toreliablymanagethemeta-informationforthesetasks.HBaseusesHDFS5asitsunderlyinglesystemfortheap-plicationdata,whereeachcolumnfamilyofeachregionisphysicallystoredasoneormoreimmutablelescalledstore-les(correspondingtoLSMTlayerC1).HDFSisahighlyscalableandreliabledistributedlesystembasedonGFS[9].Itautomaticallyreplicatesleblocksacrossmultipledatanodesforreliabilityandavailability.Normally,thereisadatanodeco-locatedwitheachregionservertopromotedatalocality.HDFShasaNamenode,similarinspirittotheHBaseMaster,formeta-management. 4http://zookeeper.apache.org/5http://hadoop.apache.org/ApplicationsinteractwithHBasethroughaclientlibrarythatprovidesaninterfaceforreadingandwritingkey-valuepairs,eitherindividuallyorinbatches,andperformingse-quentialscansthatsupportpredicate-basedltering.Eachread(i.e.,getorscan)orwrite(i.e.,putordelete)requestissenttotheregionserverthatservestheregiontowhichtherequestedkey-valuepair(s)belongs.Awriteisservedsimplybyapplyingthereceivedupdatetoanin-memorydatastructurecalledamemstore(correspondingtoLSMTlayerC0).Thisallowsforhavingmultiplevaluesforeachcolumnofarow.Eachregionmaintainsonememstorepercolumnfamily.Whenthesizeofamemstorereachesacer-tainthreshold,itscontentare ushedtoHDFS,therebycreatinganewstorele.Areadisservedbyscanningfortherequesteddatathroughthememstoreandthroughallregion'sstorelesthatmightcontaintherequesteddata.Eachregionservermaintainsablockcachethatcachesre-centlyaccessedstoreleblockstoimprovereadperformance.Periodically,orwhenthenumberofaregion'sstorelescrossesacertaincongurablelimit,theparentregionserverwillperformacompactiontoconsolidatethecontentsofsev-eralstorelesintoone.Whenacompactionisthustriggered,aspecialalgorithmdecideswhichoftheregion'sstorelestocompact.Ifitselectsalloftheminonego,itiscalledamajorcompaction,andaminorcompactionotherwise.Unlikeaminorcompaction,amajorcompactionaddition-allyalsoremovesvaluesthathavebeen aggedfordeletionviatheirlatestupdates.Therefore,majorcompactionsaremoreexpensiveandusuallytakemuchlongertocomplete.2.2.1ExploringCompactionsThedefaultcompactionalgorithminHBaseusesaheuris-ticthatattemptstochoosetheoptimalcombinationofstore-lestocompactbasedoncertainconstraintsspeciedbythedatastoreadministrator.Theaimistogivetheadministra-toragreaterdegreeofcontroloverthesizeofcompactionsandthus,indirectly,theirfrequencyaswell.Forexample,itispossibletospecifyminimumandmaximumlimitsonthenumberofstorelesthatcanbeprocessedpercompaction.Similarly,thealgorithmalsoallowsustoenforcealimitonthetotallesizeofthegroup,sothatminorcompactionsdonotbecometoolarge.Finally,aratioparametercanbespeciedthatensuresthatthesizeofeachstoreleincludedinthecompactioniswithinacertainfactoroftheaveragelesizeofthegroup.Thealgorithmexploresallpossiblepermutationsthatmeetalltheserequirementsandpicksthebestone(ornone),optimizingfortheratioparameter.WecancongureHBasetousedierentratioparametersforpeakando-peakhours.2.3CassandraCassandraisanotherpopulardistributedkey-valuedata-store.ItsdesignincorporateselementsfrombothBigtableandDynamo[7].Asaresult,ithasalotincommonwithHBase,yetalsodiersfromitinseveralimportantrespects.UnlikeHBase,Cassandrahasadecentralizedarchitecture,soaclientcansendarequesttoanynodeinthecluster,whichthenactsasaproxybetweentheclientandthenodesthatactuallyservetheclient'srequest.Cassandraalsoal-lowsapplicationstochoosefromarangeofconsistencyset-tingsperrequest.Thelowestsettingallowsforinconsisten-ciessuchasstalereadsanddirtywrites(though,eventually,thedatastoredoesreachaconsistentstate),butoerssupe- Figure2:OoadingCompactionsriorperformance.Thestrictestsettingmatchestheconsis-tencylevelofHBase,butsacricesonperformance.WhileHBasemaintainsitsownblockcache,Cassandrareliesin-steadontheOScacheforfasteraccesstohotleblocks.Atanergranularity,italsooerstheoptionofusingarow-levelcache.Finally,Cassandrausesaslightlydierentcompactionalgorithm(seetieredcompactionsinSection8).UnlikeHBase,minorcompactionsinCassandracleanupdeletedvaluesaswell.Cassandraalsothrottlescompactionstolimittheiroverhead.Despitethesedierences,CassandrahastwoimportantsimilaritieswithHBase:Cassandraalsopartitionsitsdataandrunscompactionsonaper-partitionbasis.Moreover,Cassandraalso ushesitsin-memoryupdatesinsortedbatchesintoread-onlyles.Thesesimilaritiesmakeusbelievethatmanyaspectsofourapproach,althoughimplementedinHBase,aregenerallyapplicabletoCassandraandotherdata-storesintheLSMTfamilyaswell.3.ARCHITECTUREOursolutionaddstwonewcomponentstothedatastorearchitecture:acentralizedcompactionmanagerandasetofcompactionservers.TheintegrationofthesecomponentsintotheHBasearchitectureisdepictedinFigure2.Acompactionserverperformscompactionsonbehalfofregionservers.Therefore,italsohostsadatanodeinor-dertogainaccesstotheHDFSlayer.Wheneveraregionserver ushesregiondata,itwritesanewstoreletoHDFS,whichcanthenbereadbythecompactionserver.Similarly,uponcompactingaregion,thecompactionserverwritesthecompactedstorelebacktoHDFSaswell.Compactionserverscanbeaddedorremoved,allowingforscalability.Eachcompactionserverisassignedsomesubsetofthedata.Thecompactionmanagermanagestheseas-signments,mappingregionstocompactionserversakintohowtheHBaseMastermapsregionstoregionservers.WhileourimplementationmakessubstantialadditionsandchangestoHBase,wehaveattemptedtoperformtheminamodularmanner.WeusedtheHBaseMasterandregionservercodeasabaseforimplementingthecompactionman-agerandthecompactionserver,respectively.Forexample,thecompactionserverreusesthecodeforscanningstorelesfromHDFSandperformingcompactionsonthem.Thatis,wetakethecompactionalgorithmasablackbox,withoutmodifyingit.However,wemodiedspecicsubcomponentsoftheregionservercodesothatitcouldooadcompactionstoacompactionserverandalsoreceivethecompacteddatabackoverthenetworkformoreecientwarmup.4.EXPERIMENTALSETUPSincethenextsectioncombinesthepresentationofourproposedsolutionsalongwithadetailedperformanceanal-ysisofeachofthesteps,weprovideasummaryofthegeneralexperimentalsetupbeforeproceeding.4.1EnvironmentWeranourexperimentsonahomogeneousclusterof20Linuxmachines.Eachnodehasa2.66GHzdual-coreIntelCore2processor,8GBofRAM,anda7,200RPMSATAHDDwith160GB.ThenodesareconnectedoveraGigabitEthernetswitch.TheOSis64-bitUbuntuLinuxandtheJavaenvironmentis64-bitOracleJDK7.Weusedthefol-lowingsoftwareversions:HBase0.96,HDFS2.3,Cassandra2.0,andYCSB0.1.4.4.2Datastores4.2.1HBase/HDFSTheHBaseMaster,theHDFSNamenode,andZooKeeperservicesallshareonededicatednode6.WemodiedafewkeycongurationparametersinHBaseinordertobetterstudytheoverheadsofcompactions.Thecompactionleselectionratiowaschangedfrom1.2to3.0.Regionserverswereallocated7GBofmainmemory,ofwhich6GBwenttotheblockcache.WeusedSnappy7forcompression.Inallourexperiments,eachregionserverandcompactionserverhoststheirrespectivedatanode,withaminimumofthreedatanodesinthecluster.4.2.2CassandraSinceCassandrapreferstousetheOScache,weallocatedonly4GBofmainmemorytoitsprocessandkepttherowcachedisabled.WeusedtheByteOrderedPartitioner,whichallowsustoecientlyperformsequentialscansbyprimarykey(thedefault,randompartitionerisunsuitableforthispurpose).SincethestandardYCSBbindingforCassandraisoutdated,weimplementedacustombindingforCassan-dra2.0usingthelatestCQL3API.4.3BenchmarksWeareinterestedinrunningOLTPworkloadsonaclouddatastore.AtypicalOLTPworkloadgeneratesahighvol-umeofconcurrentlyexecutingread-writetransactions.Mosttransactionsexecuteaseriesofshortreadsandupdates,butafewmightalsoexecutelargerreadoperationssuchaspar-tialorfulltablescans.Inourexperiments,wetrytoemulatetheseworkloadcharacteristicswithtwobenchmarks.YCSBisapopularmicrobenchmarkfordistributeddata-stores.WeusedittostressbothHBaseandCassandrawithanupdate-intensiveworkload.Welaunchseparateclientprocessesforreadsandwrites.Ourwriteworkloadconsistsof100%updates,whileourreadworkloadcomprises90%getsand10%scans.WeusedtheZipandistributiontore ectanOLTPworkloadmoreclosely.TPC-Cisawell-knownOLTPbenchmarkthatisgen-erallyusedforbenchmarkingtraditionalrelationaldata-basesetups.WeusedanimplementationofTPC-Ccalled 6Reliabilitywasnotafocusoftheevaluation,soweprovi-sionedoneZooKeeperserveronly,withsucientcapacity.7https://code.google.com/p/snappy/ (a)NoCompactions(HBase) (b)Compactions(HBase) (c)Compactions(Cassandra)Figure3:Motivation:(a)Underanupdate-intensiveworkload,readlatencyinHBasegetsincreasinglyworseovertimeifthestorelesthatbuilduparenotregularlycompacted(thegureisscaledforscans,butgetsareaectedjustthesame).(b)Althoughregularcompactionshelpmaintainreadperformancewithinreasonablelimitsoverthelongrun,readlatencystillspikessignicantlyduringthecompactioneventsthemselves.(c)Cassandrasuersfromthesameproblem;wecanseethatthelargerofthetwocompactionshasasignicantnegativeimpactonreadperformanceoveraperiodofaroundtenminutes;notethesametwodistinctphases.PyTPCC8,whichworkswithvariousclouddatastores,in-cludingHBase.SincethereisnosupportfortransactionsinHBase,thebenchmarksimplyexecutesitstransactionswithoutACIDguarantees.Forconvenience,itdoesnotsimulatethethinktimebetweentransactions,thusallowingustostressthedatastorewithlessclients.Theworkloadcomprisesvetransactiontypes:New-Order(45%),Pay-ment(43%),Order-Status(4%),Delivery(4%),andStock-Level(4%).Wepopulated50warehouses,correspondingtoaround14GBofactualdata.5.OFFLOADINGCOMPACTIONSAkeyperformancegoalofOLTPapplicationsismain-taininglowresponsetimesunderhighthroughput.Inthissection,werstshowthatreadperformancecansuersig-nicantlyduringandimmediatelyafteralargecompaction,inbothHBaseandCassandra.Wethenproposeandevalu-ateanumberofstrategiesforalleviatingthisproblem.5.1MotivationTounderstandtheimplicationsofHBasecompactionsonreadperformance,weranaYCSBworkloadwith10readthreadsagainstoneregionserver(nocompactionserver).Ourtesttableheldthreemillionrowsinasingleregion,equivalenttoaround4GBofactual,uncompresseddata.Thisensuredthattheworkingdatasettcomfortablywithintheregionserver's6GBblockcache.Werecordedthere-sponsetimesofgetsandscansoverthecourseoftheexper-iment,atve-secondintervals.ThegraphsinFigure3showtheresponsetimeofgetsandscansoverthedurationofeachexperiment.Figure3(a)showstheobserveddegradationinreadperformanceovertimewhencompactionsaredisabledaltogether.Figure3(b)showsthatwhilecompactionshelpmaintainreadperfor-mancewithinreasonablelimitsoverthelongrun,eachcom-pactioneventcausesasignicantspikeinresponsetime.Wecanalsoseethatamajorcompactioncausesamuchlargerandlongerdegradationinreadperformancerelativetomi-norcompactions.Notethatbothgetsandscansareseverelyaectedbythemajorcompaction.AsimilarexperimentonCassandrashowsthatitalsoexhibitsseverecompaction-relatedperformancedegradation(seeFigure3(c)). 8http://github.com/apavlo/py-tpccFigure4(a)zoomsintothecompactionphase.Wecanseethatamajorcompactioncanaddanoticeableperfor-manceoverheadontheregionserverthatexecutesit,andcantypicallytakeontheorderofafewminutestocomplete.Theresponsetimesofreadoperationsexecutingonthisre-gionserverdegradenoticeablyduringthistime.Thegureshowstwodistinctphasesofdegradation:compactionandwarmup.Thecompactionphaseischaracterizedbyhigherresponsetimesoverthedurationofthecompaction.WeobservedthatthisismainlyduetotheCPUoverheadas-sociatedwithcompactingthestoreledata.Bothgetsandscansareaected.Thewarmupphasestartswhenthecom-pactioncompletes.Atthistimetheserverswitchesfromthecurrentdatatothenewlycompacteddata.Theswitchtriggerstheevictionoftheobsoletedleblocksenmasse,followedbya urryofcachesmissesasthecompacteddatablocksarethenreadandcached.Thisleadstoaseveredegradationinreadresponsetimesforanextendedperiod.Figure3(c)showsthatCassandrasimilarlyexhibitsthesetwophasesaswell.5.2CompactionPhaseWerstattempttodealwiththeoverheadofthecom-pactionphase.Ourobservationsshowthattheperformancedegradationinthisphasecanbeexacerbatedbythedata-storeserverexperiencinghighloads.Inotherwords,over-loadinganalreadysaturatedprocessorcancauseresponsetimestospikeandthecompactionitselftotakemuchlongertocomplete.Oneapproachtomanagethisoverheadisforthedatastoretolimittheamountofresourcesthatacom-pactionconsumes.Bythrottlingcompactionsinthisway,thedatastorecanamortizetheircostoveralongerdura-tion.Infact,thisistheapproachtakenbyCassandra;itthrottlescompactionthroughputtoacongurablelimit(16MB/sbydefault).However,webelievethatthisapproachdoesnotsucientlyaddresstheproblem,forthreereasonsmainly.Firstly,Figure3(c)showsthatdespitethethrot-tling,responsetimesstillspikedwiththecompaction,justaswasobservedwithHBasewithnothrottling.Wecould,ofcourse,throttlemoreaggressively,therebyamortizingtheoverheadoveramuchlongerperiod,butthisleadstooursecondconcern.Thelongeracompactiontakes,themoreobsoleteddata(deletedandexpiredvalues)thedatastoreservermustmaintainoverthatduration,thuscontinuingto (a)CompactionPhases(SS) (b)CompactionOoading(CO)Figure4:CompactionPhasehurtreadperformance.Thirdly,evenwhenthrottlingcom-pactionshelpstoalleviatetheiroverheadtosomeextent,itoersnofurtherbenetsformanagingtheoverheadofthesubsequentwarmupphase.Therefore,ourapproachooadstheseexpensivecom-pactionstoadedicatedcompactionserver,thusallowingtheregionservertofullydedicateitsresourcestowardsservingtheactualapplicationworkload.Therearetwoobviousben-etstothisapproach.First,iteliminatestheCPUoverheadthattheregionserverwouldotherwiseincuroverthedura-tionofthecompaction.Second,thecompactioncangen-erallybeexecutedfaster,sinceitisrunningonadedicatedserver.Althoughthecompactionserverneedstoreadthecompaction'sinputstorelesfromthelesystemratherthanmainmemory(theregionservercanreadthedatafromitsblockcache),wecouldnotobserveanynegativeimpactinourexperimentasaresultofthis.WeevaluatedthebenetofooadingthecompactionusingYCSB.Figure4(b)plotstheresponsetimesofgetsandscansunderourapproach,wherewesimplyaddedthecompactionmanagerandonecompactionservertothepreviousexperiment.ComparingFigure4(b)againstthestandardsetupinFigure4(a),wecanseethatwithadedicatedcompactionserver,thecompactionphaseisshorter,withanoticeableimprovementinreadla-tencyaswell.Ontheotherhand,weseenoimprovementinthelong-runningwarmupphaseafterthecompactionhascompleted.Therefore,next,wediscusstheadvantagesofhavingthecompacteddatainthecompactionserver'smainmemoryforimprovingthewarmupphase.5.3WarmupPhaseAspreviouslydiscussed,weobservethatoncethecom-pactioncompletes,theregionservermustreadtheoutputstorelefromdiskbackintoitsblockcacheinordertoservereadsfromthenewlycompacteddata.Atthisstage,readperformancecansuersignicantlyduetothehighrateofcachemissesastheblockcachegraduallywarmsupagain.Theexperimentalresultspresentedsofarclearlyshowthatthewarmupphasehasasignicantnegativeimpactontheperformanceofourworkload.Infact,wetendtoseeanex-tendedphaseofuptoafewminutesofseverelydegradedresponsetimesforbothindividualgetsaswellasscans.Therefore,intheremainderofthissection,weanalyzethisparticularperformanceissueandattempttomitigateit.5.3.1WriteThroughCachingFirst,weanalyzethewarmupphaseinthestandardsetup(i.e.,theregionserverdoesnotooadthecompaction).Weconsiderwhethercachingacompaction'soutputinawrite-throughmanner{i.e.,eachblockwrittentoHDFSissimul-taneouslycachedintheblockcacheaswell{couldpresentanybenetunderthestandardsetup.Ideally,thiswouldeliminatetheneedforawarmupphasealtogether.How-ever,ourobservationsshowthatthisapproachdoesnotinfactyieldpromisingresults.Inordertotestthisidea,wemodiedHBasetoallowustocachecompactedblocksinawrite-thoughmanner.InFigure5(b),wecancomparetheperformanceofthisapproachagainstthestandardsetup(Figure5(a)).Weseethatwhilethewarmupphaseimprovestoanextent,theperformancepenaltyispassedbacktothecompactionphaseinstead.Uponfurtherinvestigation,wewitnessedlarge-scaleevictionsofhotblocksfromtheblockcacheduringthecompaction,resultinginheavycachechurn,whichseverelydegradedreadperformance.Inotherwords,weseethatduringthecourseofthecompaction,thenewlycompacteddatacompetesforthelimitedcapacityofthere-gionserver'sblockcacheevenasthecurrentdataisstillbeingread,sincetheswitchtothenewdataismadeonlyoncethecompactioncompletes.Therefore,thisapproachonlyshiftstheproblembacktothecompactionphase.Clearly,thelargerthemainmemoryofeachregionserveris,comparedtothesizeoftheregionsitmaintains,themoreofthecurrentdataandcompacteddatawillttogetherintomainmemory,andthelesscachechurnwewillobserve.However,thatwouldleadtoasignicantover-provisioningofmemoryperregionserver,sincetheextramemorywouldonlybeusedduringcompactions.Forthisreason,webe-lievethathavingafewcompactionserversactingasremotecachesthataresharedbymanyregionservers,cansolvethisproblemwithlessoverallresources.5.3.2RemoteCachingOurapproachofooadingcompactionspresentsuswithaninterestingopportunitytotakeadvantageofwrite-throughcachingonthecompactionserverinstead,therebycombin-ingbothapproaches.Asadedicatednode,itcanbeaskedtoplaytheroleofaremotecacheduringthewarmupphasesinceitalreadyhasthecompactionoutputcachedinitsmainmemory.Withthisapproach,insteadofreadingthenewlycompactedblocksfromitslocaldisk,theregionserverrequeststhemfromthecompactionserver'smemoryinstead.Thereisanobvioustrade-oherebetweendiskandnetworkI/O.Sinceourmainaimisachievingbetterresponsetimes,wedeemthistrade-otobeworthwhileforsetupswherenetworkI/OisfasterthandiskI/O. (a)StandardSetup(SS) (b)Write-ThroughCaching(WTC) (c)RemoteCaching(RC)Figure5:WarmupPhaseWehaveimplementedaremoteprocedurecallthatal-lowstheregionservertofetchthecachedblocksfromthecompactionserverinsteadofreadingthemfromthelocalHDFSdatanode.Toreducethenetworktransferoverhead,wecompresstheblocksatthesourceusingSnappy,andsubsequentlyuncompressthemuponreceivingthemattheregionserver.Thiscomesatthecostofaslightprocessingoverhead,butthesavingsinthetotaltransfertimeandnet-workI/Omakethisanacceptabletrade-o.WeevaluatetheeectivenessofthisapproachinFigure5(c),whichshowsasignicantimprovementinresponsetimesinthewarmupphaseascomparedtonothavingaremotecacheavailable.Whiletheevictionoftheobsoleteddatablocksstillcausescachemisses,notethatthewarmupphasecompletesquickerduetothemuchfasteraccessofblocksfromthecompactionserver'smemoryoverthenetworkratherthanfromdisk.Ofcourse,thebenetofcompactionooadingonthecom-pactionphaseisretainedaswell.Nevertheless,westillobserveadistinctperformancebound-arybetweenthecompactionandwarmupphaseswherethecachemissesoccur.Hence,whiletheremotecacheoersasignicantimprovementoverreadingfromthelocaldisk,theperformancepenaltyduetothesecachemissesremainstobeaddressed.5.4SmartWarmupToobtainfurtherimprovements,weessentiallyneedtoavoidcachemissesbypreemptivelyfetchingandcachingthecompacteddata.Wediscusstwooptionsfordoingthis.5.4.1PreSwitchWarmupIntherstoption,wewarmthelocalcacheup(transfer-ringdatafromthecompactionservertotheregionserver)priortomakingtheswitchtothecompacteddata.Thisissimilarinprincipletothewrite-throughcachingapproachpreviouslydiscussed.Thatis,itseectivenessdependsontheavailabilityofadditionalmainmemory,suchthattheregionservercansimultaneouslytboththecurrentdataaswellasthecompacteddatainitscache,thusallowingforaseamlessswitch.Whencomparedwithwrite-throughcaching,inwhichthewarmuphappensduringthecom-pactionitself,hereweperformthewarmupafterthecom-pactioncompletes.Therefore,sincethecompactionisper-formedremotelyandthecompacteddatafetchedoverthenetwork,theregionserver'sperformancedoesnotsuerdur-ingthecompaction,and,oncetheswitchismade,there-mainderofthewarmupismoreecientaswell.Figure6(a)showstheperformanceofthisapproach.Thepre-switchwarmupcomprisestwosub-phases,depictedinthegureusinggrayandpink,respectively.Recallthat6GBofmainmemoryisavailablefortheblockcache.Sincethecurrentdatatakesuparound4GB,thepre-switchwarm-upcanlluptheremaining2GBwithoutseverelyaectingtheperformanceoftheworkload(gray).However,asthewarmupcontinuesbeyondthispoint(pink),thecompacteddatacompeteswiththecurrentdatainthecache,resultinginseverelydetrimentalcachechurn.Thisalsoaectspost-switchperformance(orange),sincewemustthenre-fetchthecompacteddatathatwasoverwrittenbythecurrentdata.Therefore,thelongerthepre-switchwarmupphasetakes,thelesseectivethisapproachbecomes.Neverthe-less,itsoverallperformanceisstillbetterthanthewrite-throughcachingapproachwithoutthecompactionserver(Figure5(b)),since,inthelattercase,theoldandnewdataalreadystarttocompeteduringthecompactionphaseoncetheblockcachellsup;whereas,withtheremotecache,thedetrimentalcachechurnoccursonlyforamuchshorterpartofthepre-switchwarmupphase.SinceOLTPworkloadstypicallygenerateregionsofhotdata,wealsotriedaversionofthisapproachwherewewarmthecacheupwithonlyasmuchhotdataascantside-by-sidewiththecurrentdata(gray)sothatwedonotcauseanycachechurn(pink).However,thisstrategyappearedtooernoadditionalbenetwhentested.Werealizedthatthisisbecausethehotdatacompriseslessthan1%oftheblocks,whichcaneasilybefetchedalmostimmediatelyineithercase(beforeoraftertheswitch),meaningthat99%ofcachemissesareactuallyassociatedwithcolddata.5.4.2IncrementalWarmupOurexperimentalanalysisaboveshowsthatthelessad-ditionalmemoryisprovisionedontheregionserver,theworsethepre-switchwarmupwillperform.Therefore,wenowpresentanincrementalwarmupstrategythatsolvesthisproblemwithoutrequiringtheprovisioningofadditionalmemory.Itworksontwofronts.Therstaspectisthatwefetchthecompacteddatafromtheremotecacheinsequen-tialchunks,whereeachchunkreplacesthecorrespondingrangeofcurrentdatainthelocalcache.Forthis,weexploitthefactthatthestoreleswrittenbyLSMTdatastoresarepre-sortedbykey.Hence,wecanmovesequentiallyalongthecompactedpartition'skeyrange.Thatis,wersttrans-ferthecompacteddatablockswiththesmallestkeysinthestoreles.Atthesametime,weevictthecurrentdatablocksthatcoverthesamekeyrangethatwejusttransferred.Thatis,thenewlycompacteddatablocksreplacethedatablockswiththesamekeyrange.Atanygiventime,wekeeptrackoftheincrementalwarmupthresholdTwhichrepresentsthe (a)Pre-switchWarmup(PSW) (b)IncrementalWarmup(IW) (c)ThrottledIncremental(TIW)Figure6:SmartWarmuprowkeywiththefollowingproperty:allnewlycompactedblocksholdingrowkeyssmallerorequaltoThavebeenfetchedandcached,and,correspondingly,allcurrentblocksholdingrowskeysuptoThavebeenevictedfromthelo-calcache.ThismeansthatallcurrentblockswithrowkeyslargerthanThavenotbeenevictedyetandarestillintheregionserver'scache.Readoperationsarenowexecutedinthefollowingwayonthismixeddata.GivenagetrequestforarowwithkeyR,orascanrequestthatstartsatkeyR,wedirectittoreadthenewlycompactedstoreleifRT,andthecurrentstoreles(canbeoneormore,withoverlappingkeyranges)otherwise.Inthisway,weensurethatallincomingrequestscanbeservedimmediatelyfromtheregionserver'sblockcacheevenasitiswarmingup,thusremovingtheoverheadassociatedwithcachemisses.AsFigure6(b)shows,theimprovementoeredbythisapproachissignicant.Whileagetonlyreadsasinglerow,ascanspansmultiplerowsandthuscouldpotentiallyspanmultipleblocksofastorele.Therefore,ascanmayfallunderoneofthethreefollowingcases.Ifthescanstartsandendsbelowtheincre-mentalthreshold,T,itwillreadonlycompacteddatathatisalreadycached.IfthescanstartsbelowbutendsbeyondT,itwillstillreadthecompacteddata,althoughallofthisdatamightnotyetbecachedwhenthescanstarts.Butasthescanprogresses,sowillT,inparallel,asthecompacteddataisstreamedintotheregionserver,andthus,thisscanwillmostlikelybefullycoveredbythecacheaswell.OnlyinthecasethatthescanovertakesT,accessingkeyswithavaluehigherthanthecurrentT,itwillslowdownduetocachemisses.IfthescanstartsandendsbeyondT,itwillreadthecurrentdatainsteadandwillalso,mostlikely,befullycoveredbythecache.InthecasethatTovertakesitmidway,evictingtheblocksitwasabouttoread,itwillencountercachemisses.However,sincescanningrowsfromlocallycachedblocksisfasterthanfetchingblocksfromtheremotecache,wedonotexpectorobservethistohappenoften.Infact,wesawrelativelyveryfewcachemissesover-allinourexperiment.Notethatinallcases,anygivenreadrequestisservedeitherentirelyfromthecompacteddataorentirelyfromthecurrentdata.Moreover,notethataregionmaycomprisemultiplecol-umnfamilies,andeachfamilyhasitsownstorele(s).Thealgorithmiteratesovertheregion'scolumnfamilies,warm-ingthemuponeatatime.Therefore,whensucharegionreceivesareadrequestcoveringmultiplecolumnfamiliesduringtheincrementalwarmup,weensurethataconsis-tentresultisreturned,sinceeachfamilyisindividuallyreadconsistentlybeforetheresultsarecombined. Figure7:PerformanceEvaluation:YCSBAsanalimprovement,wethrottlethewarmupphase.TheresultisshowninFigure6(c).ThisessentiallymeansthatTadvancesslowerthanwithoutthrottling,and,there-fore,thewarmupphaselastslonger.However,asaresult,theperformanceoverheadofthisphaseisvirtuallyelimi-nated.ItreducestheCPUcostsforthedatatransferandreducesthechancesofcachemissescausedbycurrentdatablocksgettingoverwrittenbythenewdatatooquickly.Asaresult,weseethatthereishardlyanynoticeableimpactleftfromthecompactionandwarmupphases.AsummaryofourYCSBperformanceevaluationispre-sentedinFigure7.Foreachapproach,weshowthedegra-dationofreadlatencyduringthecompactionandwarmupphases,respectively,asameasureofthepercentagedif-ferencefromthebaseline,i.e.,theaveragelatencybeforethecompactionstarted.Theimportantimprovementsarehighlightedingreen.Wecanseethatwithourbestap-proach,throttledincrementalwarmup(TIW),theperfor-mancedegradationofgetsisreducedtoonly7%/9%(com-paction/warmup),whilethatofscansisreducedtoonly20%/4%.Thedurationofthecompactionphaseissignif-icantlyshortenedaswell.Althoughthewarmupphaseislongerthanwithsimpleremotecaching(RC),thesigni-cantlysuperiorperformanceofTIWmakesupforthis.5.5TPCCWeuseTPC-C,astandardOLTPbenchmark,toevaluatetheperformanceofourproposedapproaches.Ontheback-end,werantworegionserversandonecompactionserver,whileatotalof80clientthreadswerelaunchedusingtwofront-endnodes.Werecordedtheaverageresponsetimeofeachtransactiontype,andalsomeasuredthetpmCmetric(New-Ordertransactionsperminute)averagedoverthedu- (a)New-Order (b)Stock-LevelFigure8:PerformanceEvaluation:TPC-Crationofeachexperiment.InordertoobservetheadverseimpactsofcompactionsonthestandardTPC-Cworkload,wetriggeredcompactionsonthetwomostheavilyupdatedtables,StockandOrder-Line,intwoseparatesetsofexper-iments,respectively.Intherstset,weobservedtheperformanceofNew-Order,whichisashort,read-writetransaction.SinceitreadstheStocktable,itisimpactedbycompactionsrunningonthistable.Figure8(a)showstheeectsofthisimpactun-derthestandardsetup(SS)andtheimprovementsoeredbyeachofourmainapproaches.Wecanseethatwiththrottledincrementalwarmup(TIW),thedegradationintheaverageresponsetimeofNew-Ordertransactions(againstthebase-line),issignicantlyreducedinboththecompactionandwarmupphases.Thedurationofthecompactionphaseisalsoconsiderablyshortened.Thewarmupphaseisshortestwhenusingsimpleremotecaching(RO).Overall,ourbestapproach,TIW,providesanimprovementofnearly11%intermsofthetpmCmetric.Inthesecondset,weobservedthelonger-runningStock-Leveltransaction.SinceitreadstheOrder-Linetable,itwasimpactedbycompactionsrunningonthistable.InFigure8(b),weseetheperformanceimprovementprovidedbyeachofourapproaches.Whiletheresponsetimeisonlyslightlybetterinthecompactionphase,itsdurationiscutdownconsiderablybyooadingthecompaction.Onceagain,asignicantreductioninresponsetimedegradationisseenwithourincrementalwarmup(TIW)approach,eventhoughthewarmupdurationstaysnearlythesame.6.SCALABILITYByusingacompactionmanagerthatoverseestheexecu-tionofcompactionsonallcompactionservers,wecanscaleourapproachinasimilarmannerasHBasecanscaletoasmanyregionserversasneeded.Infact,sinceHBasepar-titionsitsdataintoregions,weconvenientlyusethesamepartitioningschemeforourpurposes.Thus,thedistributeddesignofoursolutioninheritstheelasticityandloaddistri-butionqualitiesofHBase.6.1ElasticityForapplicationworkloadsthat uctuateovertime,HBaseoerstheabilitytoaddorremoveregionserversastheneedarises.Alongthesamelines,ourcompactionmanagerisabletohandlemanycompactionserversatthesametime.ItusesthesameZooKeeper-basedmechanismasHBaseforman-agingthemeta-informationneededformappingsbetweenregionsandcompactionservers.6.2LoadDistributionAstheapplicationdatasetgrows,HBasecreatesnewre-gionsanddistributesthemacrosstheregionservers.Ourcompactionmanagerautomaticallydetectsthesenewre-gionsandassignsthemtotheavailablecompactionservers.WeinheritthemodulardesignofHBase,whichallowsustoplugincustomloadbalancingalgorithmsasrequired.Wecurrentlyuseasimpleround-robinstrategyfordistribut-ingregionsacrosscompactionsservers.However,wecanenvisionmorecomplexalgorithmsthatbalanceregionsdy-namicallybasedonthecurrentCPUandmemoryloadsofcompactionservers{metricsthatwepublishoverthesameinterfacethatHBaseusesforitsothercomponents.6.3CompactionSchedulingSchedulingcompactionsisaninterestingproblem.Cur-rently,welettheregionserverscheduleitsowncompactionsbasedonitsdefaultexploringalgorithm(seeSection2.2.1).However,ourdesignallowsforthecompactionmanagertoperformcompactionschedulingbasedonitsdynamic,globalviewoftheloadsbeinghandledbycompactionservers.Animportantparameterishowmanycompactionsacom-pactionservercanhandleconcurrently.Asweuseitsmainmemoryasaremotecache,thesumofthecompacteddataofallregionsitisconcurrentlycompactingshouldnotbelargerthantheserver'smemory.Aroughestimationofthislimitcanbecalculatedasfollows.Givenanestimateoftheratec(inbytes/s),atwhichacompactionservercanreadandcompactdata,andanes-timateoftheratew(inbytes/s),atwhichthecompacteddataistransferredbacktotheregionserveroverthenet-work(withthrottling),wecancalculatetheduration,D(b)(inseconds),ofacompactionasafunctionofitssize,b(inbytes):D=b=c+b=w.Moreover,acompactionserverwithmbytesofmainmemorycacheatitsdisposalcanhandlelcompactionsofaveragesizebatatime,wherel=bm=bc.Thus,onecompactionserverwillhavethecapacitytocom-pactuptohregionsofaveragesizebperintervaloftsec-onds,whereh=t=D(b)bm=bc.Therefore,givenanupdateworkloadthattriggersTcompactionsperregionperinter-valoftseconds,wecanassignuptobh=Tcregionspercom-pactionserver.And,foradatasetofRregions,wewillneedtoprovisionatleastC=dR=bh=Tceofthesecompactionserversforthegivenapplicationdatasetsizeandworkload.Forexample,considerasetupwiththefollowingparame-ters:c=20MB=s,w=8MB=s,m=6GB,b=4GB,T=1=hour,andR=10regions.ThisgivesusC=2compactionservers. (a)StandardSetup:5RS (b)CompactionOoading:5RS/1CS (c)CompactionOoading:10RS/1CS (d)CompactionOoading:10RS/2CSFigure9:PerformanceEvaluation:Scalability6.4PerformanceEvaluationUsingYCSB,wedemonstratethescalabilityofoursolu-tionbyscalingoursetupfromveregionserversuptoten.Theve-nodesetupserved10millionrowssplitintoveregionssupportedbyonecompactionserver.Welaunchedtworead/writeclients(with40readthreadsandtwowritethreadseach).Theten-nodesetupdoubledboththedatasetsizeandworkload;i.e.,20millionrowssplitintotenregions,stressedwithfourread/writeclients.Atrst,weprovi-sionedonlyonecompactionserver,overloadingitbeyonditsmaximumcapacity.Next,weranthesameexperimentwithtwocompactionserverstodemonstratethecapabilityofourarchitecturetoeectivelydistributetheloadbetweenthetwoservers.Theexperimentsrunforfourhours;multi-plemajorcompactionsaretriggeredinthisduration.Figures9(a)to9(d)showtheresults.Figure9(a)showstheaverageresponsetimeofreadsoverthefour-hourpe-riodontheveregionservers(nocompactionservers).Wecanseethesamelatencyspikesasinoursmallerscaleex-perimentswherecompactionswerenotooaded.Figure9(b)showstheve-nodesetupwithonecompactionserver,whichcanhandlethecompactionstriggeredbyallveregionservers,eliminatingtheperformanceoverheadseeninthestandardsetup.InFigure9(c),tenregionserversareservedbyasinglecompactionserver.Inthiscase,thecompactionserverbecomesoverloaded.Asourcompactionserveronlyhasenoughmainmemorycacheavailable(6GB)tocom-pactasingleregion'sdata(4GB)atatime,wecannotallowseveralcompactionstorunconcurrently.Thus,compactionsaredelayed,andreadperformanceontheregionserversgetsincreasinglyworse,asmorestorelesarecreatedthathavetobescannedbyreads,andtheregionserversstartrunningoutofblockcachespaceaswell.Finally,wecanobserveinFigure9(d)thatwithtwocompactionservers,wecanhandlethecompactionloadoftenregionserverscomfortably,andresponsetimesremainsmoothovertheentireexecution.7.FAULTTOLERANCEOurapproachoersanecientsolutionforooadingcompactionswhileensuringtheircorrectexecutionevenwhencomponentsfail.Thissectionaddressesseveralimportantfailurecasesanddiscussesthefault-toleranceofoursolution.7.1CompactionServerFailureWhenthecompactionmanagerdetectsthatacompactionserverhasfailed,itreassignitsregionstoanotheravailablecompactionserver.Acompactionservercanbeinoneofthreestatesatthetimeoffailure:idle,compactingsomeregion(s),ortransferringcompacteddatabacktotheregionserver(s).Ifitwasperformingacompaction,thenitsfail-urewillcausearemoteexceptionontheregionserverandthecompactionwillbeaborted.Notethatnoactualdatalossoccurs,sincethecompactionserverwaswritingtoatemporaryle,andtheregionserverdoesnotswitchovertothecompactedleuntilthecompactionhascompleted.Theregionservercanretrythecompactionanditwillbeassignedtoanothercompactionserver.Ifnocompactionserversareavailable,thentheregionservercansimplyper-formthecompactionitself.Ifthecompactionserverwasintheprocessoftransferringacompactedlebacktotheregionserverwhenthefailureoccurs,thiswillalsocausearemoteexceptionontheotherend.Inthecaseofincrementalwarmup,somerequestswillalreadyhavestartedreadingthepartiallytransferredcom-pacteddata.Therefore,theregionserverneedstonal-izeloadingthecompacteddata,whichitcandobysimplyreadingthestorelesfromthelesysteminstead,asthecom-pactionservercompletedwritingthenewstorelestoHDFSbeforebeginningthetransfertotheregionserver.How-ever,sincetheremainingportionofthecompacteddatanowneedstobefetchedfromHDFS,readperformancemightsuf-ferduringtheremainderofthewarmupphase(asunderthestandardsetup). 7.2CompactionManagerFailureInourcurrentimplementation,inordertoooadacom-paction,theregionservermustgothroughthecompactionmanagertobeforwardedtothecompactionserverthatwillhandlethecompaction.Thus,thecompactionmanagerbe-comesasinglepointoffailureinoursetup.However,thisisonlyanimplementationissue.SinceweuseZooKeeperformaintainingthecompactionservers'regionassignments,ourdesignoersareliablewayforaregionservertocontactacompactionserverdirectly.AswiththeHBaseMaster,ifthecompactionmanagerfails,welosetheabilitytoaddorremovecompactionserversandassignregions,soitwouldneedtoberestartedassoonaspossibletoresumethesefunctions.However,ongoingcompactionsarenotaected,sincetheregionserverandcompactionservercommunicatedirectlyonceconnected.7.3RegionServerFailureIfaregionserverfailswhilewaitingforanooadedcom-pactiontoreturn,thecompactionserverdetectsthediscon-nectioninthecommunicationchannelviaatimeout,andthecompactionisaborted.OncetheHBaseMasterhasas-signedtheaectedregionstoanotherregionserver,theycansimplyretrythecompactionanditwillbehandledbythecompactionserverasanewcompactionrequest.Ifare-gionserverfailsduringtheincrementalwarmupphase,thenewparentregionservermustensurethatitloadsonlythenewlycompactedle(s)fromHDFS,andnotanyoftheolderles,whichshouldbediscardedatthispoint.Althoughwecurrentlydonothandlethisfailurecase,weintendtoim-plementasimplesolutionforitbymodifyingthelenamespriortoinitiatingtheincrementalwarmup.Inthisway,iftheregionisreopenedbyanotherregionserver,itcandetectwhichlesintheregion'sHDFSdirectorycanbediscardedduetobeingsupersededbythenewercompactedles.8.RELATEDWORKThenumberofscalablekey-valuestores,aswellasmoreadvanceddatastores,providingmorecomplexdatamodelsandtransactionconsistency,hasincreasedveryquicklyoverthelastdecade[2,3,5,7,10,12].Manyofthesedatas-toresrelyoncreatingmultiplevalues/versionsofdataitemsratherthanapplyingupdatesin-place,inordertohandlehighwritethroughputrequirements.However,readperfor-mancecanbeseverelyaectedastryingtondtherightdataversionforagivenquerytakesincreasinglylongerovertime.Therefore,compactionsareafundamentalfeatureofthesedatastores,helpingtoregularlycleanupexpiredver-sions,andthuskeepreadperformanceatacceptablelevels.Varioustypesofcompactionalgorithmsexist.Inordertomakecompactionsmoreecient,thesealgorithmsgen-erallyattempttolimittheamountofdataprocessedpercompactionbyselectinglesinawaythatavoidstherepet-itivere-compactionofolderdataasmuchaspossible.Forinstance,tieredcompactionswererstusedbyBigtableandalsoadoptedbyCassandra.Ratherthanselectingaran-domsetofstorelestocompact,thisalgorithmselectsonlyaxednumber(usuallyfour)ofstorelesatatime,pickinglesthatareallaroundthesamesize.Oneeectofthismechanismisthatlargerstorelesmaybecompactedlessfrequently,therebyreducingthetotalamountofI/Otakenupbycompactionsovertime.Theleveledcompactionsal-gorithmwasintroducedinLevelDB9andwasrecentlyalsoimplementedinCassandra.Theaimofthisalgorithmistoremovetheneedforsearchingthroughmultiplestorelestoanswerareadrequest.Thealgorithmachievesthisgoalsim-plybypreventingupdatedvaluesofagivenrowfromendingupacrossmultiplestorelesintherstplace.TheoverallI/Oloadofleveledcompactionsissignicantlylargerthanstandardcompactions,however,thecompactionsthemselvesaresmallandquick,andsotendtobemuchlessdisrup-tivetothedatastore'sruntimeperformanceovertime.Ontheotherhand,iftheclusterisalreadyI/O-constrained,oriftheworkloadisveryupdate-intensive(e.g.,timese-ries),thenleveledcompactionsbecomecounter-productive.Stripedcompactions10,avariationofleveledcompactions,havebeenprototypedforHBaseasanimprovementoveritscurrentalgorithm.YetanothervariationisimplementedinbLSM[15],whichpresentsasolutionforfullyamortizingthecostofcompactionsintotheworkloadbydynamicallybalancingtherateatwhichtheexistingdataisbeingcom-pactedwiththerateofincomingupdates.Inourapproach,wetakethecompactionapproachitselfasablackbox.Infact,allbuttheincrementalwarm-upapproachdonotcareatallwhatistheactualcontentofthestoreles.Thein-crementalwarmupapproachneedsrowstobesortedinkeyorder,butisalsoindependentofthecompactionalgorithm.Otherdatastructuresthatperformperiodicdatamain-tenanceoperationsinthesameveinastheLSMTincludeR-trees[11]anddierentialles[16].AswithLSMTdatas-tores,updatesareinitiallywrittentosomeshort-termstor-agelayer,andsubsequentlyconsolidatedintotheunderlyinglong-termstoragelayerviaperiodicmergeoperations,thusbridgingthegapbetweenOLTPandOLAPfunctionality.SAPHANA[17]isamajorin-memorydatabasethatfallsinthiscategory.AmergeinHANAisaresource-intensiveoperationperformedentirelyin-memory.Thus,theservermusthaveenoughmemorytosimultaneouslyholdthecur-rentandcompacteddata.Inprinciple,ourincrementalwarmupalgorithmoersthesameperformancebenetsasafullyin-memorysolution,whilerequiringhalfthememory.Bothcomputationooadingaswellassmartcacheman-agementarewell-knowntechniquesinmanydistributedsys-tems.Butwearenotawareofanyotherapproachthatcon-sidersooadingcompactionswiththeaimofrelievingthequeryprocessingserveroftheaddedCPUandmemoryload.However,theconceptofseparatingdierenttasksthatneedtoworkonthesamedataisprevalentinreplication-basedapproaches,whichaordsanopportunitytorundierentkindsofworkloadssimultaneouslyondierentcopiesofthedata.Aslongaspotentialdatacon ictsareecientlyhan-dled,thishastheadvantagethatthedierentworkloadsdonotinterferewitheachother.Forinstance,inapproachesthatuseprimarycopyreplication,updatetransactionsareexecutedontheprimarysiteonly,whiletheothercopiesareread-only.IntheGanymedsystem[14],forinstance,thevariousread-onlycopiesareusedforvarioustypesofread-onlyqueries,whiletheprimarycopyisdedicatedtoupdatetransactions.Inasimilarspirit,weseparatecom-pactionsfromstandardtransactionprocessingtominimizeinterferenceofthesetwotasks. 9http://leveldb.googlecode.com/svn/trunk/doc/impl.html10https://issues.apache.org/jira/browse/HBASE-7667 Techniquesforthelivemigrationofvirtualmachines,suchas[4,18],dealwithtransferringamachine'sstateanddatatoanotherandswitchingovertoitwithoutdrasticallyaf-fectingtheworkloadbeingserved.Similarly,techniquesforlivedatabasemigrationdealwithecientlytransferringthecontentsofthecache[6]andpotentiallythediskaswell[8].Thus,similardatatransferconsiderationsariseasforcom-pactionooading.However,inthesemigrationapproaches,onegenerallydoesnotneedtoconsidertheinterferencebe-tweentwoworkloads(i.e.,thequeryprocessingandtheof- oadedcompaction,inourcase).9.CONCLUSIONSInthispaper,wetookafreshapproachtocompactionsinHBase.Ourprimarygoalwastoeliminatethenegativeperformanceimpactsofcompactionsunderupdate-intensiveOLTPworkloads,particularlywithregardstoreadperfor-mance.Weproposedooadingmajorcompactionsfromtheregionservertoadedicatedcompactionserver.Thisallowsustofullyutilizetheregionserver'sresourcesto-wardsservingtheactualworkload.Wealsousethecom-pactionserverasaremotecache,sinceitalreadyholdsthefreshlycompacteddatainitsmainmemory.Theregionserverfetchestheseblocksoverthenetworkratherthanfromitslocaldisk.Finally,weproposedanecientincrementalwarmupalgorithm,whichsmoothlytransitionsfromthecur-rentdataintheregionserver'scachetothecompacteddatafetchedfromtheremotecache.WithYCSBandTPC-C,weshowedthatthislastapproachwasabletoeliminatevirtu-allyallcompaction-relatedperformanceoverheads.Finally,wedemonstratedthatoursystemcanscalebyaddingmorecompactionserversasneeded.Forfuturework,wewouldliketomakethecompactionmanagermoreawareoftheloadbalancingrequirementsofregions,regionservers,andcompactionservers.Ifonecom-pactionserverisassignedmoreregionsthatitcanhandleef-fectively,thecompactionmanagershouldre-balanceregionsaccordinglyamongtheavailablecompactionservers,whiletakingintoconsiderationtheircurrentrespectiveloads.10.ACKNOWLEDGMENTSTheauthorswouldliketothanktheanonymousreview-ersforusefulfeedbacktoimprovethispaper.ThisworkwaspartiallyfundedbytheNaturalSciencesandEngineer-ingResearchCouncilofCanada(NSERC)andMinisteredel'Enseignementsuperieur,Recherche,ScienceetTechnolo-gie,Quebec,Canada(MESRST).11.REFERENCES[1]A.S.Aiyer,M.Bautin,G.J.Chen,P.Damania,P.Khemani,K.Muthukkaruppan,K.Ranganathan,N.Spiegelberg,L.Tang,andM.Vaidya.StorageinfrastructurebehindFacebookMessages:UsingHBaseatscale.IEEEDataEng.Bull.,35(2):4{13,2012.[2]J.Baker,C.Bond,J.C.Corbett,J.J.Furman,A.Khorlin,J.Larson,J.-M.Leon,Y.Li,A.Lloyd,andV.Yushprakh.Megastore:Providingscalable,highlyavailablestorageforinteractiveservices.InCIDR,pages223{234,2011.[3]F.Chang,J.Dean,S.Ghemawat,W.C.Hsieh,D.A.Wallach,M.Burrows,T.Chandra,A.Fikes,andR.Gruber.Bigtable:Adistributedstoragesystemforstructureddata.InOSDI,pages205{218,2006.[4]C.Clark,K.Fraser,S.Hand,J.G.Hansen,E.Jul,C.Limpach,I.Pratt,andA.Wareld.Livemigrationofvirtualmachines.InNSDI,2005.[5]J.C.Corbett,J.Dean,M.Epstein,A.Fikes,C.Frost,J.J.Furman,S.Ghemawat,A.Gubarev,C.Heiser,P.Hochschild,W.C.Hsieh,S.Kanthak,E.Kogan,H.Li,A.Lloyd,S.Melnik,D.Mwaura,D.Nagle,S.Quinlan,R.Rao,L.Rolig,Y.Saito,M.Szymaniak,C.Taylor,R.Wang,andD.Woodford.Spanner:Google'sglobally-distributeddatabase.InOSDI,pages261{264,2012.[6]S.Das,S.Nishimura,D.Agrawal,andA.ElAbbadi.Albatross:Lightweightelasticityinsharedstoragedatabasesforthecloudusinglivedatamigration.PVLDB,4(8):494{505,2011.[7]G.DeCandia,D.Hastorun,M.Jampani,G.Kakulapati,A.Lakshman,A.Pilchin,S.Sivasubramanian,P.Vosshall,andW.Vogels.Dynamo:Amazon'shighlyavailablekey-valuestore.InSOSP,pages205{220,2007.[8]A.J.Elmore,S.Das,D.Agrawal,andA.ElAbbadi.Zephyr:livemigrationinsharednothingdatabasesforelasticcloudplatforms.InSIGMOD,pages301{312,2011.[9]S.Ghemawat,H.Gobio,andS.-T.Leung.TheGooglelesystem.InSOSP,pages29{43,2003.[10]A.Gupta,F.Yang,J.Govig,A.Kirsch,K.Chan,K.Lai,S.Wu,S.G.Dhoot,A.R.Kumar,A.Agiwal,S.Bhansali,M.Hong,J.Cameron,M.Siddiqi,D.Jones,J.Shute,A.Gubarev,S.Venkataraman,andD.Agrawal.Mesa:Geo-replicated,nearreal-time,scalabledatawarehousing.PVLDB,7(12):1259{1270,2014.[11]C.KolovsonandM.Stonebraker.Indexingtechniquesforhistoricaldatabases.InDataEngineering.[12]A.LakshmanandP.Malik.Cassandra:Adecentralizedstructuredstoragesystem.SIGOPSOper.Syst.Rev.,44(2):35{40,Apr.2010.[13]P.E.O'Neil,E.Cheng,D.Gawlick,andE.J.O'Neil.Thelog-structuredmerge-tree(LSM-tree).ActaInf.,33(4):351{385,1996.[14]C.Plattner,G.Alonso,andM.T.Ozsu.ExtendingDBMSswithsatellitedatabases.VLDBJ.,17(4):657{682,2008.[15]R.SearsandR.Ramakrishnan.bLSM:ageneralpurposelogstructuredmergetree.SIGMOD,pages217{228,2012.[16]D.G.SeveranceandG.M.Lohman.Dierentialles:Theirapplicationtothemaintenanceoflargedatabases.ACMTrans.DatabaseSyst.,1(3):256{267,Sept.1976.[17]V.Sikka,F.Farber,W.Lehner,S.K.Cha,T.Peh,andC.Bornhovd.EcienttransactionprocessinginSAPHANAdatabase:Theendofacolumnstoremyth.SIGMOD,pages731{742,2012.[18]T.Wood,P.J.Shenoy,A.Venkataramani,andM.S.Yousif.Black-boxandgray-boxstrategiesforvirtualmachinemigration.InNSDI,2007.