/
Compaction management indistributed key Compaction management indistributed key

Compaction management indistributed key - PDF document

tatiana-dople
tatiana-dople . @tatiana-dople
Follow
394 views
Uploaded On 2017-02-15

Compaction management indistributed key - PPT Presentation

854 861 850 860 853 852 858 857 851 855 856 859 1httphbaseapacheorg2httpbashocomriakThisworkislicensedundertheCreativeCommonsAttribution ID: 329331

854 861 850 860 853 852 858 857 851 855 856 859 1http://hbase.apache.org/2http://basho.com/riak/ThisworkislicensedundertheCreativeCommonsAttribution

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Compaction management indistributed key" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

854 861 850 860 853 852 858 857 851 855 856 859 Compactionmanagementindistributedkey­valuedatastoresMuhammadYousufAhmadMcGillUniversitymuhammad.ahmad2@mail.mcgill.caBettinaKemmeMcGillUniversitykemme@cs.mcgill.caABSTRACTCompactionsareavitalmaintenancemechanismusedbydatastoresbasedonthelog-structuredmerge-treetocounterthecontinuousbuildupofdata lesunderupdate-intensiveworkloads.Whilecompactionshelpkeepreadlatenciesincheckoverthelongrun,thiscomesatthecostofsignif-icantlydegradedreadperformanceoverthecourseofthecompactionitself.Inthispaper,weo eranin-depthanal-ysisofcompaction-relatedperformanceoverheadsandpro-posetechniquesfortheirmitigation.Weooadlarge,ex-pensivecompactionstoadedicatedcompactionservertoallowthedatastoreservertobetterutilizeitsresourcesto-wardsservingtheactualworkload.Moreover,sincethenewlycompacteddataisalreadycachedinthecompactionserver'smainmemory,wefetchthisdataoverthenetworkdirectlyintothedatastoreserver'slocalcache,therebyavoid-ingtheperformancepenaltyofreadingitbackfromthe lesystem.Infact,pre-fetchingthecompacteddatafromtheremotecachepriortoswitchingtheworkloadovertoitcaneliminatelocalcachemissesaltogether.Therefore,weimplementasmarterwarmupalgorithmthatensuresthatallincomingreadrequestsareservedfromthedatastoreserver'slocalcacheevenasitiswarmingup.WehaveintegratedoursolutionintoHBase,andusingtheYCSBandTPC-Cbenchmarks,weshowthatourapproachsigni cantlymit-igatescompaction-relatedperformanceproblems.Wealsodemonstratethescalabilityofoursolutionbydistributingcompactionsacrossmultiplecompactionservers.1.INTRODUCTIONAnumberofprominentdistributedkey-valuedatastores,includingBigtable[3],Cassandra[12],HBase1,andRiak2,cantracetheirrootsbacktothelog-structuredmerge-tree(LSMT)[13]{adatastructurethatsupportshighupdatethroughputsalongwithlow-latencyrandomreads.Thus, 1http://hbase.apache.org/2http://basho.com/riak/ThisworkislicensedundertheCreativeCommonsAttribution­NonCommercial­NoDerivs3.0UnportedLicense.Toviewacopyofthisli­cense,visithttp://creativecommons.org/licenses/by­nc­nd/3.0/.Obtainper­missionpriortoanyusebeyondthosecoveredbythelicense.Contactcopyrightholderbyemailinginfo@vldb.org.Articlesfromthisvolumewereinvitedtopresenttheirresultsatthe41stInternationalConferenceonVeryLargeDataBases,August31st­September4th2015,KohalaCoast,Hawaii.ProceedingsoftheVLDBEndowment,Vol.8,No.8Copyright2015VLDBEndowment2150­8097/15/04.thesedatastoresarewell-suitedforonlinetransactionpro-cessing(OLTP)applicationsthathavedemandingwork-loads.Inordertohandleahighrateofincomingupdates,thedatastoredoesnotperformupdatesinplacebutcreatesnewvaluesfortheupdatedkeysandinitiallybu erstheminmainmemory,fromwheretheyareregularly ushed,insortedbatches,toread-only lesonstablestorage.Asaresult,readingevenasinglekeycouldpotentiallyrequiretraversingmultiple lesto ndthecorrectvalueofakey.Hence,acontinuousbuild-upoftheseimmutable lescancauseagradualdegradationinreadperformancethatgetsincreasinglyworseovertime.Inordertocurbthisbehavior,thedatastorerunsspecialmaintenanceoperations{com-monlyreferredtoascompactions{onaregularbasis.Acompactionmerge-sortsmultiple lestogether,consolidat-ingtheircontentsintoasingle le.Intheprocess,individualvaluesofthesamekey,potentiallyspreadacrossmultiple les,aremergedtogether,andanyexpiredordeletedvaluesarediscarded.Thus,overthelongrun,compactionshelpmaintainreadlatencyatanacceptablelevelbycontainingthegradualbuild-upofimmutable lesinthesystem.How-ever,thiscomesatthecostofsigni cantlatencypeaksdur-ingtheexecutionofcompactions,astheycompetewiththeactualworkloadforCPU,memory,andI/Oresources.SincecompactionsareanessentialpartofanyLSMT-baseddatastore,wewouldliketobeabletoexerciseagreaterdegreeofcontrolovertheirexecutioninordertomit-igateanyundesirableimpactsontheperformanceofthereg-ularworkload.Datastoreadministrators,basedontheirex-perienceandunderstandingofapplicationworkloads,man-agetheseperformanceoverheadsbycarefullytuningthesizeandscheduleofcompactionevents[1].Forexample,astraightforwardmitigationstrategycouldbetopostponemajorcompactionstoo -peakhours.RecentproposalsandprototypesofsmartercompactionalgorithmsinCassandraandHBase(e.g.,leveled,striped)attempttomakethecom-pactionprocessitselfmoreecient,generallybyavoidingrepetitivere-compactionsofolderdataasmuchaspossible.However,thereiscurrentlyadearthofliteraturepertainingtoourunderstandingofhowandwhenexactlycompactionsimpacttheperformanceoftheregularworkload.Tothisend,asour rstcontribution,thispaperpresentsanin-depthexperimentalanalysisoftheseoverheads.Wehopethatthishelpsdataplatformdesignersandapplicationdeveloperstobetterunderstandandevaluatetheseover-headswithrespecttoresourceprovisioningandframingperformance-basedservice-levelagreements. SinceourworkrelatestoOLTPapplications,aprimaryconcernistransactionresponsetime.Ourobservationsshowthatlargecompactioneventshaveanespeciallynegativeim-pactontheresponsetimeofreadsduetotwoissues.First,duringthecompaction,thecompactionprocessitselfcom-petesforresourceswiththeactualworkload,degradingitsperformance.Asecondmajorproblemarethecachemissesthatareinduceduponthecompaction'scompletion.Dis-tributedkey-valuedatastoresgenerallyrelyheavilyonmainmemorycachingtoachievelowlatenciesforreads3.Inpar-ticular,iftheentireworkingdatasetisunableto twithintheprovisionedmainmemory,thereadcachemayexperi-enceahighdegreeofchurn,resultinginveryunstablereadperformance.Sincedistributeddatastoresaredesignedtobeelasticallyscalable,itisgenerallyassumedthatasuf- cientamountofserverscanbeconvenientlyprovisionedtokeeptheapplication'sgrowingdatasetinmainmemory.Evenso,underupdate-intensiveworkloads,frequentcom-pactionscanbecomeanotherproblematicsourceofcachechurn.Sinceacompactionconsolidatesthecontentsofmul-tipleinput lesintoanewoutput le,allreferencestotheinput lesareobsoletedintheprocess,whichnecessitatesthedatastoretoinvalidatethecorrespondingentriesinitsreadcache.Ouranalysisshowsthatthecachemissescausedbytheselarge-scaleevictionsresultinanextendeddegrada-tioninreadlatency,sincethedatastoreserverthenhastoreadthenewlycompacteddatafromthe lesystemintoitscachealloveragain.Withthisinmind,oursecondmajorcontributionistoproposeanovelapproachthatattemptstokeeptheimpactofcompactionsontheperformanceoftheactualworkloadassmallaspossible{bothduringacompaction'sexecution,andafteritcompletes.Asa rststep,weooadthecompactionsthemselvestodedicatednodescalledcompactionservers.Takingadvan-tageofthedatareplicationinherentindistributeddatas-tores,weenableacompactionservertotransparentlyreadandwritereplicasofthedatabyactingasaspecializedpeerofthedatastoreservers.Inthesecondstep,aimingtoreducetheoverheadofcachemissesafterthecompaction,weusethecompactionserverasaremotecacheforthedatastoreserver.Thatis,insteadofreadingthenewlycompacted lesfromthe lesystem,thedatastoreserverreadsthemdirectlyfromthecompactionserver'scache,therebysigni cantlyreducingbothloadtimeandreadlatency.Althoughthisalleviatestheperformancepenaltyincurredbylocalcachemissestoagreatdegree,itdoesnotcompletelyeliminateit.Inordertoaddressthere-mainingoverheadofthesecachemisses,oneapproachcouldbetoeagerlywarmthedatastoreserver'scachewiththecompacteddataimmediatelyuponthecompaction'scom-pletion.Butsuchanapproachisonlyfeasiblewhenwehaveenoughmainmemoryprovisioned,suchthatthedatastoreservercansimultaneously tboththecurrentdataaswellasthecompacteddatainitscache,thusallowingforaseam-lessswitchbetweenthetwo.Instead,weproposeasmartwarmupalgorithmthatfetchesthecompacteddatafromtheremotecacheinsequentialchunks,whereeachchunkre-placesthecorrespondingrangeofcurrentdatainthelocalcache.Duringthisincrementalwarmupphase,weguaran-teethateachreadrequestisservedcompletelyeitherbythe 3http://www.slideshare.net/xefyr/hbasecon2014-low-latencyolddata lesorbythefreshlycompacteddata.Thisensuresthatallincomingreadrequestscanbeservedfromthedata-storeserver'slocalcacheevenasitiswarmingup,therebycompletelyeliminatingtheperformancepenaltyassociatedwithswitchingovertothenewlycompacteddata.Inshort,themaincontributionsofthispaperare:1.AnexperimentalanalysisoftheperformanceimpactsassociatedwithcompactionsinHBaseandCassandra.2.Ascalablesolutionforooadingcompactionstooneormorededicatedcompactionservers.3.Asolutionforecientlystreamingthecompacteddatafromthecompactionserver'scacheintothedatastoreserver'slocalcacheoverthenetwork.4.Asmartalgorithmforincrementallywarmingupthedatastoreserver'scachewiththecompacteddata.5.AnimplementationoftheaboveanditsevaluationbasedonHBase.Ourpaperdoesnotfollowthetypicalstructurefoundinresearchpapers,which rstpresentthesolutioninfullfol-lowedbytheexperiments.Instead,weuseastep-wiseap-proach,wherewe rstdescribeapartofoursolution,imme-diatelyaccompaniedbyanexperimentalevaluationofthisparttobetterunderstanditsimplications.Inthisspirit,Section2providesanoverviewofthelog-structuredmergetree,HBase,andCassandra.Section3describestheover-allarchitectureofourapproachandahigh-leveldescriptionoftheintegrationofournewcomponentsintoHBase.Sec-tion4thenshortlydescribestheexperimentalsetup,be-foreSection5digsintothedetailsofoursolutionandtheirevaluation.Section6discussesscalability,alongwithsomefurtherexperimentalresults,andSection7discussesthefault-toleranceaspectsofoursolution.Section8presentsasummaryoftherelatedwork.WeconcludeinSection9.2.BACKGROUNDThissectionprovidesanoverviewofthebackgroundrel-evanttoourunderstandingofcompactionsinLSMT-baseddatastores,aswellasashortoverviewofhowcompactionsareperformedinHBaseandCassandra.2.1LSMTThelog-structuredmerge-tree(LSMT)[13]isakey-valuedatastructurethataimstoprovideadatastorageandre-trievalsolutionforhigh-throughputapplications.Itisahy-briddatastructure,withamainmemorylayer(C0)placedontopofoneormore lesystemlayers(C1,andsoon).Up-datesarecollectedinC0and usheddowntoC1inbatches,suchthateachbatchbecomesanimmutable le,withthekey-valuepairswritteninsortedorder.Thisapproachhasseveralimportantimplications.Firstly,fortheclient,updatesareextremelyfast,sincetheyareappliedin-memory.Secondly, ushingupdatesdowninbatchesismoreecientsinceitsigni cantlyre-ducesdiskI/O.Moreover,appendingabatchofupdatestoasingle leismuchfasterthanexecutingmultipleran-domwritesonarotationalstoragemedium(e.g.,magneticdisk).Thisenablesthedatastructuretosupporthighup-datethroughputs.Thirdly,multipleupdatesonagivenkeymayendupspreadacrossC0andanynumberof lesinC1(orbelow).Inotherwords,wecanhavemultipleval-uesperkey.Therefore,arandomreadonagivenkeymust rstsearchthroughC0(aquick,in-memorylookup),thenC1(traversingallthe lesinthatlayer),andsoon,untilit Figure1:HBaseArchitecture ndsthemostrecentvalueforthatkey.Sincethecontentsofa learealreadysortedbykey,anin- leindexcanbeusedtospeeduprandomreadswithina le.Theseread-only lesinevitablystartbuildingupinthe lesystemlayer(s),resultinginreadsbecomingincreasinglyslowovertime.Thisisremediedbyperiodicallyselectingtwoormore lesinalayerandmerge-sortingthemtogetherintoasingle le.Themergeprocessoverwritesoldervalueswiththelatestonesanddiscardsdeletedvalues,therebyclearingupanystaledata.2.2HBaseHBaseisamoderndistributedkey-valuedatastorein-spiredbyBigtable[3].HBaseo erstheabstractionofatable,whereeachrowrepresentsakey-valuepair.Thekeypartistheuniqueidenti eroftherow,andthevaluepartcomprisesanarbitrarynumberofcolumnvalues.Columnsaregroupedintocolumnfamiliestopartitionthetablever-tically.Eachtablecanalsobepartitionedhorizontallyintomanyregions.Aregionisacontiguoussetofrowssortedbytheirkeys.Whenaregiongrowsbeyondacertainsize,itisautomaticallysplitintohalf,formingtwonewregions.EveryregionisassignedbyaMasterservertooneofmultipleregionserversintheHBasecluster(seeFigure1).Throughwell-balancedregionplacement,theapplica-tionworkloadcanbeevenlydistributedacrossthecluster.Whenaregionserverbecomesoverloaded,someofitsre-gionscanbereassignedtootherunderloadedregionservers.Whentheclusterreachesitspeakloadcapacity,newregionserverscanbeprovisionedandaddedtotheonlineclus-ter,thusallowingforelasticscalability.HBasereliesonZookeeper4,alightweightquorum-basedreplicationsystem,toreliablymanagethemeta-informationforthesetasks.HBaseusesHDFS5asitsunderlying lesystemfortheap-plicationdata,whereeachcolumnfamilyofeachregionisphysicallystoredasoneormoreimmutable lescalledstore- les(correspondingtoLSMTlayerC1).HDFSisahighlyscalableandreliabledistributed lesystembasedonGFS[9].Itautomaticallyreplicates leblocksacrossmultipledatanodesforreliabilityandavailability.Normally,thereisadatanodeco-locatedwitheachregionservertopromotedatalocality.HDFShasaNamenode,similarinspirittotheHBaseMaster,formeta-management. 4http://zookeeper.apache.org/5http://hadoop.apache.org/ApplicationsinteractwithHBasethroughaclientlibrarythatprovidesaninterfaceforreadingandwritingkey-valuepairs,eitherindividuallyorinbatches,andperformingse-quentialscansthatsupportpredicate-based ltering.Eachread(i.e.,getorscan)orwrite(i.e.,putordelete)requestissenttotheregionserverthatservestheregiontowhichtherequestedkey-valuepair(s)belongs.Awriteisservedsimplybyapplyingthereceivedupdatetoanin-memorydatastructurecalledamemstore(correspondingtoLSMTlayerC0).Thisallowsforhavingmultiplevaluesforeachcolumnofarow.Eachregionmaintainsonememstorepercolumnfamily.Whenthesizeofamemstorereachesacer-tainthreshold,itscontentare ushedtoHDFS,therebycreatinganewstore le.Areadisservedbyscanningfortherequesteddatathroughthememstoreandthroughallregion'sstore lesthatmightcontaintherequesteddata.Eachregionservermaintainsablockcachethatcachesre-centlyaccessedstore leblockstoimprovereadperformance.Periodically,orwhenthenumberofaregion'sstore lescrossesacertaincon gurablelimit,theparentregionserverwillperformacompactiontoconsolidatethecontentsofsev-eralstore lesintoone.Whenacompactionisthustriggered,aspecialalgorithmdecideswhichoftheregion'sstore lestocompact.Ifitselectsalloftheminonego,itiscalledamajorcompaction,andaminorcompactionotherwise.Unlikeaminorcompaction,amajorcompactionaddition-allyalsoremovesvaluesthathavebeen aggedfordeletionviatheirlatestupdates.Therefore,majorcompactionsaremoreexpensiveandusuallytakemuchlongertocomplete.2.2.1ExploringCompactionsThedefaultcompactionalgorithminHBaseusesaheuris-ticthatattemptstochoosetheoptimalcombinationofstore- lestocompactbasedoncertainconstraintsspeci edbythedatastoreadministrator.Theaimistogivetheadministra-toragreaterdegreeofcontroloverthesizeofcompactionsandthus,indirectly,theirfrequencyaswell.Forexample,itispossibletospecifyminimumandmaximumlimitsonthenumberofstore lesthatcanbeprocessedpercompaction.Similarly,thealgorithmalsoallowsustoenforcealimitonthetotal lesizeofthegroup,sothatminorcompactionsdonotbecometoolarge.Finally,aratioparametercanbespeci edthatensuresthatthesizeofeachstore leincludedinthecompactioniswithinacertainfactoroftheaverage lesizeofthegroup.Thealgorithmexploresallpossiblepermutationsthatmeetalltheserequirementsandpicksthebestone(ornone),optimizingfortheratioparameter.Wecancon gureHBasetousedi erentratioparametersforpeakando -peakhours.2.3CassandraCassandraisanotherpopulardistributedkey-valuedata-store.ItsdesignincorporateselementsfrombothBigtableandDynamo[7].Asaresult,ithasalotincommonwithHBase,yetalsodi ersfromitinseveralimportantrespects.UnlikeHBase,Cassandrahasadecentralizedarchitecture,soaclientcansendarequesttoanynodeinthecluster,whichthenactsasaproxybetweentheclientandthenodesthatactuallyservetheclient'srequest.Cassandraalsoal-lowsapplicationstochoosefromarangeofconsistencyset-tingsperrequest.Thelowestsettingallowsforinconsisten-ciessuchasstalereadsanddirtywrites(though,eventually,thedatastoredoesreachaconsistentstate),buto erssupe- Figure2:OoadingCompactionsriorperformance.Thestrictestsettingmatchestheconsis-tencylevelofHBase,butsacri cesonperformance.WhileHBasemaintainsitsownblockcache,Cassandrareliesin-steadontheOScacheforfasteraccesstohot leblocks.Ata nergranularity,italsoo erstheoptionofusingarow-levelcache.Finally,Cassandrausesaslightlydi erentcompactionalgorithm(seetieredcompactionsinSection8).UnlikeHBase,minorcompactionsinCassandracleanupdeletedvaluesaswell.Cassandraalsothrottlescompactionstolimittheiroverhead.Despitethesedi erences,CassandrahastwoimportantsimilaritieswithHBase:Cassandraalsopartitionsitsdataandrunscompactionsonaper-partitionbasis.Moreover,Cassandraalso ushesitsin-memoryupdatesinsortedbatchesintoread-only les.Thesesimilaritiesmakeusbelievethatmanyaspectsofourapproach,althoughimplementedinHBase,aregenerallyapplicabletoCassandraandotherdata-storesintheLSMTfamilyaswell.3.ARCHITECTUREOursolutionaddstwonewcomponentstothedatastorearchitecture:acentralizedcompactionmanagerandasetofcompactionservers.TheintegrationofthesecomponentsintotheHBasearchitectureisdepictedinFigure2.Acompactionserverperformscompactionsonbehalfofregionservers.Therefore,italsohostsadatanodeinor-dertogainaccesstotheHDFSlayer.Wheneveraregionserver ushesregiondata,itwritesanewstore letoHDFS,whichcanthenbereadbythecompactionserver.Similarly,uponcompactingaregion,thecompactionserverwritesthecompactedstore lebacktoHDFSaswell.Compactionserverscanbeaddedorremoved,allowingforscalability.Eachcompactionserverisassignedsomesubsetofthedata.Thecompactionmanagermanagestheseas-signments,mappingregionstocompactionserversakintohowtheHBaseMastermapsregionstoregionservers.WhileourimplementationmakessubstantialadditionsandchangestoHBase,wehaveattemptedtoperformtheminamodularmanner.WeusedtheHBaseMasterandregionservercodeasabaseforimplementingthecompactionman-agerandthecompactionserver,respectively.Forexample,thecompactionserverreusesthecodeforscanningstore lesfromHDFSandperformingcompactionsonthem.Thatis,wetakethecompactionalgorithmasablackbox,withoutmodifyingit.However,wemodi edspeci csubcomponentsoftheregionservercodesothatitcouldooadcompactionstoacompactionserverandalsoreceivethecompacteddatabackoverthenetworkformoreecientwarmup.4.EXPERIMENTALSETUPSincethenextsectioncombinesthepresentationofourproposedsolutionsalongwithadetailedperformanceanal-ysisofeachofthesteps,weprovideasummaryofthegeneralexperimentalsetupbeforeproceeding.4.1EnvironmentWeranourexperimentsonahomogeneousclusterof20Linuxmachines.Eachnodehasa2.66GHzdual-coreIntelCore2processor,8GBofRAM,anda7,200RPMSATAHDDwith160GB.ThenodesareconnectedoveraGigabitEthernetswitch.TheOSis64-bitUbuntuLinuxandtheJavaenvironmentis64-bitOracleJDK7.Weusedthefol-lowingsoftwareversions:HBase0.96,HDFS2.3,Cassandra2.0,andYCSB0.1.4.4.2Datastores4.2.1HBase/HDFSTheHBaseMaster,theHDFSNamenode,andZooKeeperservicesallshareonededicatednode6.Wemodi edafewkeycon gurationparametersinHBaseinordertobetterstudytheoverheadsofcompactions.Thecompaction leselectionratiowaschangedfrom1.2to3.0.Regionserverswereallocated7GBofmainmemory,ofwhich6GBwenttotheblockcache.WeusedSnappy7forcompression.Inallourexperiments,eachregionserverandcompactionserverhoststheirrespectivedatanode,withaminimumofthreedatanodesinthecluster.4.2.2CassandraSinceCassandrapreferstousetheOScache,weallocatedonly4GBofmainmemorytoitsprocessandkepttherowcachedisabled.WeusedtheByteOrderedPartitioner,whichallowsustoecientlyperformsequentialscansbyprimarykey(thedefault,randompartitionerisunsuitableforthispurpose).SincethestandardYCSBbindingforCassandraisoutdated,weimplementedacustombindingforCassan-dra2.0usingthelatestCQL3API.4.3BenchmarksWeareinterestedinrunningOLTPworkloadsonaclouddatastore.AtypicalOLTPworkloadgeneratesahighvol-umeofconcurrentlyexecutingread-writetransactions.Mosttransactionsexecuteaseriesofshortreadsandupdates,butafewmightalsoexecutelargerreadoperationssuchaspar-tialorfulltablescans.Inourexperiments,wetrytoemulatetheseworkloadcharacteristicswithtwobenchmarks.YCSBisapopularmicrobenchmarkfordistributeddata-stores.WeusedittostressbothHBaseandCassandrawithanupdate-intensiveworkload.Welaunchseparateclientprocessesforreadsandwrites.Ourwriteworkloadconsistsof100%updates,whileourreadworkloadcomprises90%getsand10%scans.WeusedtheZip andistributiontore ectanOLTPworkloadmoreclosely.TPC-Cisawell-knownOLTPbenchmarkthatisgen-erallyusedforbenchmarkingtraditionalrelationaldata-basesetups.WeusedanimplementationofTPC-Ccalled 6Reliabilitywasnotafocusoftheevaluation,soweprovi-sionedoneZooKeeperserveronly,withsucientcapacity.7https://code.google.com/p/snappy/ (a)NoCompactions(HBase) (b)Compactions(HBase) (c)Compactions(Cassandra)Figure3:Motivation:(a)Underanupdate-intensiveworkload,readlatencyinHBasegetsincreasinglyworseovertimeifthestore lesthatbuilduparenotregularlycompacted(the gureisscaledforscans,butgetsarea ectedjustthesame).(b)Althoughregularcompactionshelpmaintainreadperformancewithinreasonablelimitsoverthelongrun,readlatencystillspikessigni cantlyduringthecompactioneventsthemselves.(c)Cassandrasu ersfromthesameproblem;wecanseethatthelargerofthetwocompactionshasasigni cantnegativeimpactonreadperformanceoveraperiodofaroundtenminutes;notethesametwodistinctphases.PyTPCC8,whichworkswithvariousclouddatastores,in-cludingHBase.SincethereisnosupportfortransactionsinHBase,thebenchmarksimplyexecutesitstransactionswithoutACIDguarantees.Forconvenience,itdoesnotsimulatethethinktimebetweentransactions,thusallowingustostressthedatastorewithlessclients.Theworkloadcomprises vetransactiontypes:New-Order(45%),Pay-ment(43%),Order-Status(4%),Delivery(4%),andStock-Level(4%).Wepopulated50warehouses,correspondingtoaround14GBofactualdata.5.OFFLOADINGCOMPACTIONSAkeyperformancegoalofOLTPapplicationsismain-taininglowresponsetimesunderhighthroughput.Inthissection,we rstshowthatreadperformancecansu ersig-ni cantlyduringandimmediatelyafteralargecompaction,inbothHBaseandCassandra.Wethenproposeandevalu-ateanumberofstrategiesforalleviatingthisproblem.5.1MotivationTounderstandtheimplicationsofHBasecompactionsonreadperformance,weranaYCSBworkloadwith10readthreadsagainstoneregionserver(nocompactionserver).Ourtesttableheldthreemillionrowsinasingleregion,equivalenttoaround4GBofactual,uncompresseddata.Thisensuredthattheworkingdataset tcomfortablywithintheregionserver's6GBblockcache.Werecordedthere-sponsetimesofgetsandscansoverthecourseoftheexper-iment,at ve-secondintervals.ThegraphsinFigure3showtheresponsetimeofgetsandscansoverthedurationofeachexperiment.Figure3(a)showstheobserveddegradationinreadperformanceovertimewhencompactionsaredisabledaltogether.Figure3(b)showsthatwhilecompactionshelpmaintainreadperfor-mancewithinreasonablelimitsoverthelongrun,eachcom-pactioneventcausesasigni cantspikeinresponsetime.Wecanalsoseethatamajorcompactioncausesamuchlargerandlongerdegradationinreadperformancerelativetomi-norcompactions.Notethatbothgetsandscansareseverelya ectedbythemajorcompaction.AsimilarexperimentonCassandrashowsthatitalsoexhibitsseverecompaction-relatedperformancedegradation(seeFigure3(c)). 8http://github.com/apavlo/py-tpccFigure4(a)zoomsintothecompactionphase.Wecanseethatamajorcompactioncanaddanoticeableperfor-manceoverheadontheregionserverthatexecutesit,andcantypicallytakeontheorderofafewminutestocomplete.Theresponsetimesofreadoperationsexecutingonthisre-gionserverdegradenoticeablyduringthistime.The gureshowstwodistinctphasesofdegradation:compactionandwarmup.Thecompactionphaseischaracterizedbyhigherresponsetimesoverthedurationofthecompaction.WeobservedthatthisismainlyduetotheCPUoverheadas-sociatedwithcompactingthestore ledata.Bothgetsandscansarea ected.Thewarmupphasestartswhenthecom-pactioncompletes.Atthistimetheserverswitchesfromthecurrentdatatothenewlycompacteddata.Theswitchtriggerstheevictionoftheobsoleted leblocksenmasse,followedbya urryofcachesmissesasthecompacteddatablocksarethenreadandcached.Thisleadstoaseveredegradationinreadresponsetimesforanextendedperiod.Figure3(c)showsthatCassandrasimilarlyexhibitsthesetwophasesaswell.5.2CompactionPhaseWe rstattempttodealwiththeoverheadofthecom-pactionphase.Ourobservationsshowthattheperformancedegradationinthisphasecanbeexacerbatedbythedata-storeserverexperiencinghighloads.Inotherwords,over-loadinganalreadysaturatedprocessorcancauseresponsetimestospikeandthecompactionitselftotakemuchlongertocomplete.Oneapproachtomanagethisoverheadisforthedatastoretolimittheamountofresourcesthatacom-pactionconsumes.Bythrottlingcompactionsinthisway,thedatastorecanamortizetheircostoveralongerdura-tion.Infact,thisistheapproachtakenbyCassandra;itthrottlescompactionthroughputtoacon gurablelimit(16MB/sbydefault).However,webelievethatthisapproachdoesnotsucientlyaddresstheproblem,forthreereasonsmainly.Firstly,Figure3(c)showsthatdespitethethrot-tling,responsetimesstillspikedwiththecompaction,justaswasobservedwithHBasewithnothrottling.Wecould,ofcourse,throttlemoreaggressively,therebyamortizingtheoverheadoveramuchlongerperiod,butthisleadstooursecondconcern.Thelongeracompactiontakes,themoreobsoleteddata(deletedandexpiredvalues)thedatastoreservermustmaintainoverthatduration,thuscontinuingto (a)CompactionPhases(SS) (b)CompactionOoading(CO)Figure4:CompactionPhasehurtreadperformance.Thirdly,evenwhenthrottlingcom-pactionshelpstoalleviatetheiroverheadtosomeextent,ito ersnofurtherbene tsformanagingtheoverheadofthesubsequentwarmupphase.Therefore,ourapproachooadstheseexpensivecom-pactionstoadedicatedcompactionserver,thusallowingtheregionservertofullydedicateitsresourcestowardsservingtheactualapplicationworkload.Therearetwoobviousben-e tstothisapproach.First,iteliminatestheCPUoverheadthattheregionserverwouldotherwiseincuroverthedura-tionofthecompaction.Second,thecompactioncangen-erallybeexecutedfaster,sinceitisrunningonadedicatedserver.Althoughthecompactionserverneedstoreadthecompaction'sinputstore lesfromthe lesystemratherthanmainmemory(theregionservercanreadthedatafromitsblockcache),wecouldnotobserveanynegativeimpactinourexperimentasaresultofthis.Weevaluatedthebene tofooadingthecompactionusingYCSB.Figure4(b)plotstheresponsetimesofgetsandscansunderourapproach,wherewesimplyaddedthecompactionmanagerandonecompactionservertothepreviousexperiment.ComparingFigure4(b)againstthestandardsetupinFigure4(a),wecanseethatwithadedicatedcompactionserver,thecompactionphaseisshorter,withanoticeableimprovementinreadla-tencyaswell.Ontheotherhand,weseenoimprovementinthelong-runningwarmupphaseafterthecompactionhascompleted.Therefore,next,wediscusstheadvantagesofhavingthecompacteddatainthecompactionserver'smainmemoryforimprovingthewarmupphase.5.3WarmupPhaseAspreviouslydiscussed,weobservethatoncethecom-pactioncompletes,theregionservermustreadtheoutputstore lefromdiskbackintoitsblockcacheinordertoservereadsfromthenewlycompacteddata.Atthisstage,readperformancecansu ersigni cantlyduetothehighrateofcachemissesastheblockcachegraduallywarmsupagain.Theexperimentalresultspresentedsofarclearlyshowthatthewarmupphasehasasigni cantnegativeimpactontheperformanceofourworkload.Infact,wetendtoseeanex-tendedphaseofuptoafewminutesofseverelydegradedresponsetimesforbothindividualgetsaswellasscans.Therefore,intheremainderofthissection,weanalyzethisparticularperformanceissueandattempttomitigateit.5.3.1Write­ThroughCachingFirst,weanalyzethewarmupphaseinthestandardsetup(i.e.,theregionserverdoesnotooadthecompaction).Weconsiderwhethercachingacompaction'soutputinawrite-throughmanner{i.e.,eachblockwrittentoHDFSissimul-taneouslycachedintheblockcacheaswell{couldpresentanybene tunderthestandardsetup.Ideally,thiswouldeliminatetheneedforawarmupphasealtogether.How-ever,ourobservationsshowthatthisapproachdoesnotinfactyieldpromisingresults.Inordertotestthisidea,wemodi edHBasetoallowustocachecompactedblocksinawrite-thoughmanner.InFigure5(b),wecancomparetheperformanceofthisapproachagainstthestandardsetup(Figure5(a)).Weseethatwhilethewarmupphaseimprovestoanextent,theperformancepenaltyispassedbacktothecompactionphaseinstead.Uponfurtherinvestigation,wewitnessedlarge-scaleevictionsofhotblocksfromtheblockcacheduringthecompaction,resultinginheavycachechurn,whichseverelydegradedreadperformance.Inotherwords,weseethatduringthecourseofthecompaction,thenewlycompacteddatacompetesforthelimitedcapacityofthere-gionserver'sblockcacheevenasthecurrentdataisstillbeingread,sincetheswitchtothenewdataismadeonlyoncethecompactioncompletes.Therefore,thisapproachonlyshiftstheproblembacktothecompactionphase.Clearly,thelargerthemainmemoryofeachregionserveris,comparedtothesizeoftheregionsitmaintains,themoreofthecurrentdataandcompacteddatawill ttogetherintomainmemory,andthelesscachechurnwewillobserve.However,thatwouldleadtoasigni cantover-provisioningofmemoryperregionserver,sincetheextramemorywouldonlybeusedduringcompactions.Forthisreason,webe-lievethathavingafewcompactionserversactingasremotecachesthataresharedbymanyregionservers,cansolvethisproblemwithlessoverallresources.5.3.2RemoteCachingOurapproachofooadingcompactionspresentsuswithaninterestingopportunitytotakeadvantageofwrite-throughcachingonthecompactionserverinstead,therebycombin-ingbothapproaches.Asadedicatednode,itcanbeaskedtoplaytheroleofaremotecacheduringthewarmupphasesinceitalreadyhasthecompactionoutputcachedinitsmainmemory.Withthisapproach,insteadofreadingthenewlycompactedblocksfromitslocaldisk,theregionserverrequeststhemfromthecompactionserver'smemoryinstead.Thereisanobvioustrade-o herebetweendiskandnetworkI/O.Sinceourmainaimisachievingbetterresponsetimes,wedeemthistrade-o tobeworthwhileforsetupswherenetworkI/OisfasterthandiskI/O. (a)StandardSetup(SS) (b)Write-ThroughCaching(WTC) (c)RemoteCaching(RC)Figure5:WarmupPhaseWehaveimplementedaremoteprocedurecallthatal-lowstheregionservertofetchthecachedblocksfromthecompactionserverinsteadofreadingthemfromthelocalHDFSdatanode.Toreducethenetworktransferoverhead,wecompresstheblocksatthesourceusingSnappy,andsubsequentlyuncompressthemuponreceivingthemattheregionserver.Thiscomesatthecostofaslightprocessingoverhead,butthesavingsinthetotaltransfertimeandnet-workI/Omakethisanacceptabletrade-o .Weevaluatethee ectivenessofthisapproachinFigure5(c),whichshowsasigni cantimprovementinresponsetimesinthewarmupphaseascomparedtonothavingaremotecacheavailable.Whiletheevictionoftheobsoleteddatablocksstillcausescachemisses,notethatthewarmupphasecompletesquickerduetothemuchfasteraccessofblocksfromthecompactionserver'smemoryoverthenetworkratherthanfromdisk.Ofcourse,thebene tofcompactionooadingonthecom-pactionphaseisretainedaswell.Nevertheless,westillobserveadistinctperformancebound-arybetweenthecompactionandwarmupphaseswherethecachemissesoccur.Hence,whiletheremotecacheo ersasigni cantimprovementoverreadingfromthelocaldisk,theperformancepenaltyduetothesecachemissesremainstobeaddressed.5.4SmartWarmupToobtainfurtherimprovements,weessentiallyneedtoavoidcachemissesbypreemptivelyfetchingandcachingthecompacteddata.Wediscusstwooptionsfordoingthis.5.4.1Pre­SwitchWarmupInthe rstoption,wewarmthelocalcacheup(transfer-ringdatafromthecompactionservertotheregionserver)priortomakingtheswitchtothecompacteddata.Thisissimilarinprincipletothewrite-throughcachingapproachpreviouslydiscussed.Thatis,itse ectivenessdependsontheavailabilityofadditionalmainmemory,suchthattheregionservercansimultaneously tboththecurrentdataaswellasthecompacteddatainitscache,thusallowingforaseamlessswitch.Whencomparedwithwrite-throughcaching,inwhichthewarmuphappensduringthecom-pactionitself,hereweperformthewarmupafterthecom-pactioncompletes.Therefore,sincethecompactionisper-formedremotelyandthecompacteddatafetchedoverthenetwork,theregionserver'sperformancedoesnotsu erdur-ingthecompaction,and,oncetheswitchismade,there-mainderofthewarmupismoreecientaswell.Figure6(a)showstheperformanceofthisapproach.Thepre-switchwarmupcomprisestwosub-phases,depictedinthe gureusinggrayandpink,respectively.Recallthat6GBofmainmemoryisavailablefortheblockcache.Sincethecurrentdatatakesuparound4GB,thepre-switchwarm-upcan lluptheremaining2GBwithoutseverelya ectingtheperformanceoftheworkload(gray).However,asthewarmupcontinuesbeyondthispoint(pink),thecompacteddatacompeteswiththecurrentdatainthecache,resultinginseverelydetrimentalcachechurn.Thisalsoa ectspost-switchperformance(orange),sincewemustthenre-fetchthecompacteddatathatwasoverwrittenbythecurrentdata.Therefore,thelongerthepre-switchwarmupphasetakes,thelesse ectivethisapproachbecomes.Neverthe-less,itsoverallperformanceisstillbetterthanthewrite-throughcachingapproachwithoutthecompactionserver(Figure5(b)),since,inthelattercase,theoldandnewdataalreadystarttocompeteduringthecompactionphaseoncetheblockcache llsup;whereas,withtheremotecache,thedetrimentalcachechurnoccursonlyforamuchshorterpartofthepre-switchwarmupphase.SinceOLTPworkloadstypicallygenerateregionsofhotdata,wealsotriedaversionofthisapproachwherewewarmthecacheupwithonlyasmuchhotdataascan tside-by-sidewiththecurrentdata(gray)sothatwedonotcauseanycachechurn(pink).However,thisstrategyappearedtoo ernoadditionalbene twhentested.Werealizedthatthisisbecausethehotdatacompriseslessthan1%oftheblocks,whichcaneasilybefetchedalmostimmediatelyineithercase(beforeoraftertheswitch),meaningthat99%ofcachemissesareactuallyassociatedwithcolddata.5.4.2IncrementalWarmupOurexperimentalanalysisaboveshowsthatthelessad-ditionalmemoryisprovisionedontheregionserver,theworsethepre-switchwarmupwillperform.Therefore,wenowpresentanincrementalwarmupstrategythatsolvesthisproblemwithoutrequiringtheprovisioningofadditionalmemory.Itworksontwofronts.The rstaspectisthatwefetchthecompacteddatafromtheremotecacheinsequen-tialchunks,whereeachchunkreplacesthecorrespondingrangeofcurrentdatainthelocalcache.Forthis,weexploitthefactthatthestore leswrittenbyLSMTdatastoresarepre-sortedbykey.Hence,wecanmovesequentiallyalongthecompactedpartition'skeyrange.Thatis,we rsttrans-ferthecompacteddatablockswiththesmallestkeysinthestore les.Atthesametime,weevictthecurrentdatablocksthatcoverthesamekeyrangethatwejusttransferred.Thatis,thenewlycompacteddatablocksreplacethedatablockswiththesamekeyrange.Atanygiventime,wekeeptrackoftheincrementalwarmupthresholdTwhichrepresentsthe (a)Pre-switchWarmup(PSW) (b)IncrementalWarmup(IW) (c)ThrottledIncremental(TIW)Figure6:SmartWarmuprowkeywiththefollowingproperty:allnewlycompactedblocksholdingrowkeyssmallerorequaltoThavebeenfetchedandcached,and,correspondingly,allcurrentblocksholdingrowskeysuptoThavebeenevictedfromthelo-calcache.ThismeansthatallcurrentblockswithrowkeyslargerthanThavenotbeenevictedyetandarestillintheregionserver'scache.Readoperationsarenowexecutedinthefollowingwayonthismixeddata.GivenagetrequestforarowwithkeyR,orascanrequestthatstartsatkeyR,wedirectittoreadthenewlycompactedstore leifRT,andthecurrentstore les(canbeoneormore,withoverlappingkeyranges)otherwise.Inthisway,weensurethatallincomingrequestscanbeservedimmediatelyfromtheregionserver'sblockcacheevenasitiswarmingup,thusremovingtheoverheadassociatedwithcachemisses.AsFigure6(b)shows,theimprovemento eredbythisapproachissigni cant.Whileagetonlyreadsasinglerow,ascanspansmultiplerowsandthuscouldpotentiallyspanmultipleblocksofastore le.Therefore,ascanmayfallunderoneofthethreefollowingcases.Ifthescanstartsandendsbelowtheincre-mentalthreshold,T,itwillreadonlycompacteddatathatisalreadycached.IfthescanstartsbelowbutendsbeyondT,itwillstillreadthecompacteddata,althoughallofthisdatamightnotyetbecachedwhenthescanstarts.Butasthescanprogresses,sowillT,inparallel,asthecompacteddataisstreamedintotheregionserver,andthus,thisscanwillmostlikelybefullycoveredbythecacheaswell.OnlyinthecasethatthescanovertakesT,accessingkeyswithavaluehigherthanthecurrentT,itwillslowdownduetocachemisses.IfthescanstartsandendsbeyondT,itwillreadthecurrentdatainsteadandwillalso,mostlikely,befullycoveredbythecache.InthecasethatTovertakesitmidway,evictingtheblocksitwasabouttoread,itwillencountercachemisses.However,sincescanningrowsfromlocallycachedblocksisfasterthanfetchingblocksfromtheremotecache,wedonotexpectorobservethistohappenoften.Infact,wesawrelativelyveryfewcachemissesover-allinourexperiment.Notethatinallcases,anygivenreadrequestisservedeitherentirelyfromthecompacteddataorentirelyfromthecurrentdata.Moreover,notethataregionmaycomprisemultiplecol-umnfamilies,andeachfamilyhasitsownstore le(s).Thealgorithmiteratesovertheregion'scolumnfamilies,warm-ingthemuponeatatime.Therefore,whensucharegionreceivesareadrequestcoveringmultiplecolumnfamiliesduringtheincrementalwarmup,weensurethataconsis-tentresultisreturned,sinceeachfamilyisindividuallyreadconsistentlybeforetheresultsarecombined. Figure7:PerformanceEvaluation:YCSBAsa nalimprovement,wethrottlethewarmupphase.TheresultisshowninFigure6(c).ThisessentiallymeansthatTadvancesslowerthanwithoutthrottling,and,there-fore,thewarmupphaselastslonger.However,asaresult,theperformanceoverheadofthisphaseisvirtuallyelimi-nated.ItreducestheCPUcostsforthedatatransferandreducesthechancesofcachemissescausedbycurrentdatablocksgettingoverwrittenbythenewdatatooquickly.Asaresult,weseethatthereishardlyanynoticeableimpactleftfromthecompactionandwarmupphases.AsummaryofourYCSBperformanceevaluationispre-sentedinFigure7.Foreachapproach,weshowthedegra-dationofreadlatencyduringthecompactionandwarmupphases,respectively,asameasureofthepercentagedif-ferencefromthebaseline,i.e.,theaveragelatencybeforethecompactionstarted.Theimportantimprovementsarehighlightedingreen.Wecanseethatwithourbestap-proach,throttledincrementalwarmup(TIW),theperfor-mancedegradationofgetsisreducedtoonly7%/9%(com-paction/warmup),whilethatofscansisreducedtoonly20%/4%.Thedurationofthecompactionphaseissignif-icantlyshortenedaswell.Althoughthewarmupphaseislongerthanwithsimpleremotecaching(RC),thesigni -cantlysuperiorperformanceofTIWmakesupforthis.5.5TPC­CWeuseTPC-C,astandardOLTPbenchmark,toevaluatetheperformanceofourproposedapproaches.Ontheback-end,werantworegionserversandonecompactionserver,whileatotalof80clientthreadswerelaunchedusingtwofront-endnodes.Werecordedtheaverageresponsetimeofeachtransactiontype,andalsomeasuredthetpmCmetric(New-Ordertransactionsperminute)averagedoverthedu- (a)New-Order (b)Stock-LevelFigure8:PerformanceEvaluation:TPC-Crationofeachexperiment.InordertoobservetheadverseimpactsofcompactionsonthestandardTPC-Cworkload,wetriggeredcompactionsonthetwomostheavilyupdatedtables,StockandOrder-Line,intwoseparatesetsofexper-iments,respectively.Inthe rstset,weobservedtheperformanceofNew-Order,whichisashort,read-writetransaction.SinceitreadstheStocktable,itisimpactedbycompactionsrunningonthistable.Figure8(a)showsthee ectsofthisimpactun-derthestandardsetup(SS)andtheimprovementso eredbyeachofourmainapproaches.Wecanseethatwiththrottledincrementalwarmup(TIW),thedegradationintheaverageresponsetimeofNew-Ordertransactions(againstthebase-line),issigni cantlyreducedinboththecompactionandwarmupphases.Thedurationofthecompactionphaseisalsoconsiderablyshortened.Thewarmupphaseisshortestwhenusingsimpleremotecaching(RO).Overall,ourbestapproach,TIW,providesanimprovementofnearly11%intermsofthetpmCmetric.Inthesecondset,weobservedthelonger-runningStock-Leveltransaction.SinceitreadstheOrder-Linetable,itwasimpactedbycompactionsrunningonthistable.InFigure8(b),weseetheperformanceimprovementprovidedbyeachofourapproaches.Whiletheresponsetimeisonlyslightlybetterinthecompactionphase,itsdurationiscutdownconsiderablybyooadingthecompaction.Onceagain,asigni cantreductioninresponsetimedegradationisseenwithourincrementalwarmup(TIW)approach,eventhoughthewarmupdurationstaysnearlythesame.6.SCALABILITYByusingacompactionmanagerthatoverseestheexecu-tionofcompactionsonallcompactionservers,wecanscaleourapproachinasimilarmannerasHBasecanscaletoasmanyregionserversasneeded.Infact,sinceHBasepar-titionsitsdataintoregions,weconvenientlyusethesamepartitioningschemeforourpurposes.Thus,thedistributeddesignofoursolutioninheritstheelasticityandloaddistri-butionqualitiesofHBase.6.1ElasticityForapplicationworkloadsthat uctuateovertime,HBaseo erstheabilitytoaddorremoveregionserversastheneedarises.Alongthesamelines,ourcompactionmanagerisabletohandlemanycompactionserversatthesametime.ItusesthesameZooKeeper-basedmechanismasHBaseforman-agingthemeta-informationneededformappingsbetweenregionsandcompactionservers.6.2LoadDistributionAstheapplicationdatasetgrows,HBasecreatesnewre-gionsanddistributesthemacrosstheregionservers.Ourcompactionmanagerautomaticallydetectsthesenewre-gionsandassignsthemtotheavailablecompactionservers.WeinheritthemodulardesignofHBase,whichallowsustoplugincustomloadbalancingalgorithmsasrequired.Wecurrentlyuseasimpleround-robinstrategyfordistribut-ingregionsacrosscompactionsservers.However,wecanenvisionmorecomplexalgorithmsthatbalanceregionsdy-namicallybasedonthecurrentCPUandmemoryloadsofcompactionservers{metricsthatwepublishoverthesameinterfacethatHBaseusesforitsothercomponents.6.3CompactionSchedulingSchedulingcompactionsisaninterestingproblem.Cur-rently,welettheregionserverscheduleitsowncompactionsbasedonitsdefaultexploringalgorithm(seeSection2.2.1).However,ourdesignallowsforthecompactionmanagertoperformcompactionschedulingbasedonitsdynamic,globalviewoftheloadsbeinghandledbycompactionservers.Animportantparameterishowmanycompactionsacom-pactionservercanhandleconcurrently.Asweuseitsmainmemoryasaremotecache,thesumofthecompacteddataofallregionsitisconcurrentlycompactingshouldnotbelargerthantheserver'smemory.Aroughestimationofthislimitcanbecalculatedasfollows.Givenanestimateoftheratec(inbytes/s),atwhichacompactionservercanreadandcompactdata,andanes-timateoftheratew(inbytes/s),atwhichthecompacteddataistransferredbacktotheregionserveroverthenet-work(withthrottling),wecancalculatetheduration,D(b)(inseconds),ofacompactionasafunctionofitssize,b(inbytes):D=b=c+b=w.Moreover,acompactionserverwithmbytesofmainmemorycacheatitsdisposalcanhandlelcompactionsofaveragesizebatatime,wherel=bm=bc.Thus,onecompactionserverwillhavethecapacitytocom-pactuptohregionsofaveragesizebperintervaloftsec-onds,whereh=t=D(b)bm=bc.Therefore,givenanupdateworkloadthattriggersTcompactionsperregionperinter-valoftseconds,wecanassignuptobh=Tcregionspercom-pactionserver.And,foradatasetofRregions,wewillneedtoprovisionatleastC=dR=bh=Tceofthesecompactionserversforthegivenapplicationdatasetsizeandworkload.Forexample,considerasetupwiththefollowingparame-ters:c=20MB=s,w=8MB=s,m=6GB,b=4GB,T=1=hour,andR=10regions.ThisgivesusC=2compactionservers. (a)StandardSetup:5RS (b)CompactionOoading:5RS/1CS (c)CompactionOoading:10RS/1CS (d)CompactionOoading:10RS/2CSFigure9:PerformanceEvaluation:Scalability6.4PerformanceEvaluationUsingYCSB,wedemonstratethescalabilityofoursolu-tionbyscalingoursetupfrom veregionserversuptoten.The ve-nodesetupserved10millionrowssplitinto veregionssupportedbyonecompactionserver.Welaunchedtworead/writeclients(with40readthreadsandtwowritethreadseach).Theten-nodesetupdoubledboththedatasetsizeandworkload;i.e.,20millionrowssplitintotenregions,stressedwithfourread/writeclients.At rst,weprovi-sionedonlyonecompactionserver,overloadingitbeyonditsmaximumcapacity.Next,weranthesameexperimentwithtwocompactionserverstodemonstratethecapabilityofourarchitecturetoe ectivelydistributetheloadbetweenthetwoservers.Theexperimentsrunforfourhours;multi-plemajorcompactionsaretriggeredinthisduration.Figures9(a)to9(d)showtheresults.Figure9(a)showstheaverageresponsetimeofreadsoverthefour-hourpe-riodonthe veregionservers(nocompactionservers).Wecanseethesamelatencyspikesasinoursmallerscaleex-perimentswherecompactionswerenotooaded.Figure9(b)showsthe ve-nodesetupwithonecompactionserver,whichcanhandlethecompactionstriggeredbyall veregionservers,eliminatingtheperformanceoverheadseeninthestandardsetup.InFigure9(c),tenregionserversareservedbyasinglecompactionserver.Inthiscase,thecompactionserverbecomesoverloaded.Asourcompactionserveronlyhasenoughmainmemorycacheavailable(6GB)tocom-pactasingleregion'sdata(4GB)atatime,wecannotallowseveralcompactionstorunconcurrently.Thus,compactionsaredelayed,andreadperformanceontheregionserversgetsincreasinglyworse,asmorestore lesarecreatedthathavetobescannedbyreads,andtheregionserversstartrunningoutofblockcachespaceaswell.Finally,wecanobserveinFigure9(d)thatwithtwocompactionservers,wecanhandlethecompactionloadoftenregionserverscomfortably,andresponsetimesremainsmoothovertheentireexecution.7.FAULT­TOLERANCEOurapproacho ersanecientsolutionforooadingcompactionswhileensuringtheircorrectexecutionevenwhencomponentsfail.Thissectionaddressesseveralimportantfailurecasesanddiscussesthefault-toleranceofoursolution.7.1CompactionServerFailureWhenthecompactionmanagerdetectsthatacompactionserverhasfailed,itreassignitsregionstoanotheravailablecompactionserver.Acompactionservercanbeinoneofthreestatesatthetimeoffailure:idle,compactingsomeregion(s),ortransferringcompacteddatabacktotheregionserver(s).Ifitwasperformingacompaction,thenitsfail-urewillcausearemoteexceptionontheregionserverandthecompactionwillbeaborted.Notethatnoactualdatalossoccurs,sincethecompactionserverwaswritingtoatemporary le,andtheregionserverdoesnotswitchovertothecompacted leuntilthecompactionhascompleted.Theregionservercanretrythecompactionanditwillbeassignedtoanothercompactionserver.Ifnocompactionserversareavailable,thentheregionservercansimplyper-formthecompactionitself.Ifthecompactionserverwasintheprocessoftransferringacompacted lebacktotheregionserverwhenthefailureoccurs,thiswillalsocausearemoteexceptionontheotherend.Inthecaseofincrementalwarmup,somerequestswillalreadyhavestartedreadingthepartiallytransferredcom-pacteddata.Therefore,theregionserverneedsto nal-izeloadingthecompacteddata,whichitcandobysimplyreadingthestore lesfromthe lesysteminstead,asthecom-pactionservercompletedwritingthenewstore lestoHDFSbeforebeginningthetransfertotheregionserver.How-ever,sincetheremainingportionofthecompacteddatanowneedstobefetchedfromHDFS,readperformancemightsuf-ferduringtheremainderofthewarmupphase(asunderthestandardsetup). 7.2CompactionManagerFailureInourcurrentimplementation,inordertoooadacom-paction,theregionservermustgothroughthecompactionmanagertobeforwardedtothecompactionserverthatwillhandlethecompaction.Thus,thecompactionmanagerbe-comesasinglepointoffailureinoursetup.However,thisisonlyanimplementationissue.SinceweuseZooKeeperformaintainingthecompactionservers'regionassignments,ourdesigno ersareliablewayforaregionservertocontactacompactionserverdirectly.AswiththeHBaseMaster,ifthecompactionmanagerfails,welosetheabilitytoaddorremovecompactionserversandassignregions,soitwouldneedtoberestartedassoonaspossibletoresumethesefunctions.However,ongoingcompactionsarenota ected,sincetheregionserverandcompactionservercommunicatedirectlyonceconnected.7.3RegionServerFailureIfaregionserverfailswhilewaitingforanooadedcom-pactiontoreturn,thecompactionserverdetectsthediscon-nectioninthecommunicationchannelviaatimeout,andthecompactionisaborted.OncetheHBaseMasterhasas-signedthea ectedregionstoanotherregionserver,theycansimplyretrythecompactionanditwillbehandledbythecompactionserverasanewcompactionrequest.Ifare-gionserverfailsduringtheincrementalwarmupphase,thenewparentregionservermustensurethatitloadsonlythenewlycompacted le(s)fromHDFS,andnotanyoftheolder les,whichshouldbediscardedatthispoint.Althoughwecurrentlydonothandlethisfailurecase,weintendtoim-plementasimplesolutionforitbymodifyingthe lenamespriortoinitiatingtheincrementalwarmup.Inthisway,iftheregionisreopenedbyanotherregionserver,itcandetectwhich lesintheregion'sHDFSdirectorycanbediscardedduetobeingsupersededbythenewercompacted les.8.RELATEDWORKThenumberofscalablekey-valuestores,aswellasmoreadvanceddatastores,providingmorecomplexdatamodelsandtransactionconsistency,hasincreasedveryquicklyoverthelastdecade[2,3,5,7,10,12].Manyofthesedatas-toresrelyoncreatingmultiplevalues/versionsofdataitemsratherthanapplyingupdatesin-place,inordertohandlehighwritethroughputrequirements.However,readperfor-mancecanbeseverelya ectedastryingto ndtherightdataversionforagivenquerytakesincreasinglylongerovertime.Therefore,compactionsareafundamentalfeatureofthesedatastores,helpingtoregularlycleanupexpiredver-sions,andthuskeepreadperformanceatacceptablelevels.Varioustypesofcompactionalgorithmsexist.Inordertomakecompactionsmoreecient,thesealgorithmsgen-erallyattempttolimittheamountofdataprocessedpercompactionbyselecting lesinawaythatavoidstherepet-itivere-compactionofolderdataasmuchaspossible.Forinstance,tieredcompactionswere rstusedbyBigtableandalsoadoptedbyCassandra.Ratherthanselectingaran-domsetofstore lestocompact,thisalgorithmselectsonlya xednumber(usuallyfour)ofstore lesatatime,picking lesthatareallaroundthesamesize.Onee ectofthismechanismisthatlargerstore lesmaybecompactedlessfrequently,therebyreducingthetotalamountofI/Otakenupbycompactionsovertime.Theleveledcompactionsal-gorithmwasintroducedinLevelDB9andwasrecentlyalsoimplementedinCassandra.Theaimofthisalgorithmistoremovetheneedforsearchingthroughmultiplestore lestoanswerareadrequest.Thealgorithmachievesthisgoalsim-plybypreventingupdatedvaluesofagivenrowfromendingupacrossmultiplestore lesinthe rstplace.TheoverallI/Oloadofleveledcompactionsissigni cantlylargerthanstandardcompactions,however,thecompactionsthemselvesaresmallandquick,andsotendtobemuchlessdisrup-tivetothedatastore'sruntimeperformanceovertime.Ontheotherhand,iftheclusterisalreadyI/O-constrained,oriftheworkloadisveryupdate-intensive(e.g.,timese-ries),thenleveledcompactionsbecomecounter-productive.Stripedcompactions10,avariationofleveledcompactions,havebeenprototypedforHBaseasanimprovementoveritscurrentalgorithm.YetanothervariationisimplementedinbLSM[15],whichpresentsasolutionforfullyamortizingthecostofcompactionsintotheworkloadbydynamicallybalancingtherateatwhichtheexistingdataisbeingcom-pactedwiththerateofincomingupdates.Inourapproach,wetakethecompactionapproachitselfasablackbox.Infact,allbuttheincrementalwarm-upapproachdonotcareatallwhatistheactualcontentofthestore les.Thein-crementalwarmupapproachneedsrowstobesortedinkeyorder,butisalsoindependentofthecompactionalgorithm.Otherdatastructuresthatperformperiodicdatamain-tenanceoperationsinthesameveinastheLSMTincludeR-trees[11]anddi erential les[16].AswithLSMTdatas-tores,updatesareinitiallywrittentosomeshort-termstor-agelayer,andsubsequentlyconsolidatedintotheunderlyinglong-termstoragelayerviaperiodicmergeoperations,thusbridgingthegapbetweenOLTPandOLAPfunctionality.SAPHANA[17]isamajorin-memorydatabasethatfallsinthiscategory.AmergeinHANAisaresource-intensiveoperationperformedentirelyin-memory.Thus,theservermusthaveenoughmemorytosimultaneouslyholdthecur-rentandcompacteddata.Inprinciple,ourincrementalwarmupalgorithmo ersthesameperformancebene tsasafullyin-memorysolution,whilerequiringhalfthememory.Bothcomputationooadingaswellassmartcacheman-agementarewell-knowntechniquesinmanydistributedsys-tems.Butwearenotawareofanyotherapproachthatcon-sidersooadingcompactionswiththeaimofrelievingthequeryprocessingserveroftheaddedCPUandmemoryload.However,theconceptofseparatingdi erenttasksthatneedtoworkonthesamedataisprevalentinreplication-basedapproaches,whicha ordsanopportunitytorundi erentkindsofworkloadssimultaneouslyondi erentcopiesofthedata.Aslongaspotentialdatacon ictsareecientlyhan-dled,thishastheadvantagethatthedi erentworkloadsdonotinterferewitheachother.Forinstance,inapproachesthatuseprimarycopyreplication,updatetransactionsareexecutedontheprimarysiteonly,whiletheothercopiesareread-only.IntheGanymedsystem[14],forinstance,thevariousread-onlycopiesareusedforvarioustypesofread-onlyqueries,whiletheprimarycopyisdedicatedtoupdatetransactions.Inasimilarspirit,weseparatecom-pactionsfromstandardtransactionprocessingtominimizeinterferenceofthesetwotasks. 9http://leveldb.googlecode.com/svn/trunk/doc/impl.html10https://issues.apache.org/jira/browse/HBASE-7667 Techniquesforthelivemigrationofvirtualmachines,suchas[4,18],dealwithtransferringamachine'sstateanddatatoanotherandswitchingovertoitwithoutdrasticallyaf-fectingtheworkloadbeingserved.Similarly,techniquesforlivedatabasemigrationdealwithecientlytransferringthecontentsofthecache[6]andpotentiallythediskaswell[8].Thus,similardatatransferconsiderationsariseasforcom-pactionooading.However,inthesemigrationapproaches,onegenerallydoesnotneedtoconsidertheinterferencebe-tweentwoworkloads(i.e.,thequeryprocessingandtheof- oadedcompaction,inourcase).9.CONCLUSIONSInthispaper,wetookafreshapproachtocompactionsinHBase.Ourprimarygoalwastoeliminatethenegativeperformanceimpactsofcompactionsunderupdate-intensiveOLTPworkloads,particularlywithregardstoreadperfor-mance.Weproposedooadingmajorcompactionsfromtheregionservertoadedicatedcompactionserver.Thisallowsustofullyutilizetheregionserver'sresourcesto-wardsservingtheactualworkload.Wealsousethecom-pactionserverasaremotecache,sinceitalreadyholdsthefreshlycompacteddatainitsmainmemory.Theregionserverfetchestheseblocksoverthenetworkratherthanfromitslocaldisk.Finally,weproposedanecientincrementalwarmupalgorithm,whichsmoothlytransitionsfromthecur-rentdataintheregionserver'scachetothecompacteddatafetchedfromtheremotecache.WithYCSBandTPC-C,weshowedthatthislastapproachwasabletoeliminatevirtu-allyallcompaction-relatedperformanceoverheads.Finally,wedemonstratedthatoursystemcanscalebyaddingmorecompactionserversasneeded.Forfuturework,wewouldliketomakethecompactionmanagermoreawareoftheloadbalancingrequirementsofregions,regionservers,andcompactionservers.Ifonecom-pactionserverisassignedmoreregionsthatitcanhandleef-fectively,thecompactionmanagershouldre-balanceregionsaccordinglyamongtheavailablecompactionservers,whiletakingintoconsiderationtheircurrentrespectiveloads.10.ACKNOWLEDGMENTSTheauthorswouldliketothanktheanonymousreview-ersforusefulfeedbacktoimprovethispaper.ThisworkwaspartiallyfundedbytheNaturalSciencesandEngineer-ingResearchCouncilofCanada(NSERC)andMinisteredel'Enseignementsuperieur,Recherche,ScienceetTechnolo-gie,Quebec,Canada(MESRST).11.REFERENCES[1]A.S.Aiyer,M.Bautin,G.J.Chen,P.Damania,P.Khemani,K.Muthukkaruppan,K.Ranganathan,N.Spiegelberg,L.Tang,andM.Vaidya.StorageinfrastructurebehindFacebookMessages:UsingHBaseatscale.IEEEDataEng.Bull.,35(2):4{13,2012.[2]J.Baker,C.Bond,J.C.Corbett,J.J.Furman,A.Khorlin,J.Larson,J.-M.Leon,Y.Li,A.Lloyd,andV.Yushprakh.Megastore:Providingscalable,highlyavailablestorageforinteractiveservices.InCIDR,pages223{234,2011.[3]F.Chang,J.Dean,S.Ghemawat,W.C.Hsieh,D.A.Wallach,M.Burrows,T.Chandra,A.Fikes,andR.Gruber.Bigtable:Adistributedstoragesystemforstructureddata.InOSDI,pages205{218,2006.[4]C.Clark,K.Fraser,S.Hand,J.G.Hansen,E.Jul,C.Limpach,I.Pratt,andA.War eld.Livemigrationofvirtualmachines.InNSDI,2005.[5]J.C.Corbett,J.Dean,M.Epstein,A.Fikes,C.Frost,J.J.Furman,S.Ghemawat,A.Gubarev,C.Heiser,P.Hochschild,W.C.Hsieh,S.Kanthak,E.Kogan,H.Li,A.Lloyd,S.Melnik,D.Mwaura,D.Nagle,S.Quinlan,R.Rao,L.Rolig,Y.Saito,M.Szymaniak,C.Taylor,R.Wang,andD.Woodford.Spanner:Google'sglobally-distributeddatabase.InOSDI,pages261{264,2012.[6]S.Das,S.Nishimura,D.Agrawal,andA.ElAbbadi.Albatross:Lightweightelasticityinsharedstoragedatabasesforthecloudusinglivedatamigration.PVLDB,4(8):494{505,2011.[7]G.DeCandia,D.Hastorun,M.Jampani,G.Kakulapati,A.Lakshman,A.Pilchin,S.Sivasubramanian,P.Vosshall,andW.Vogels.Dynamo:Amazon'shighlyavailablekey-valuestore.InSOSP,pages205{220,2007.[8]A.J.Elmore,S.Das,D.Agrawal,andA.ElAbbadi.Zephyr:livemigrationinsharednothingdatabasesforelasticcloudplatforms.InSIGMOD,pages301{312,2011.[9]S.Ghemawat,H.Gobio ,andS.-T.Leung.TheGoogle lesystem.InSOSP,pages29{43,2003.[10]A.Gupta,F.Yang,J.Govig,A.Kirsch,K.Chan,K.Lai,S.Wu,S.G.Dhoot,A.R.Kumar,A.Agiwal,S.Bhansali,M.Hong,J.Cameron,M.Siddiqi,D.Jones,J.Shute,A.Gubarev,S.Venkataraman,andD.Agrawal.Mesa:Geo-replicated,nearreal-time,scalabledatawarehousing.PVLDB,7(12):1259{1270,2014.[11]C.KolovsonandM.Stonebraker.Indexingtechniquesforhistoricaldatabases.InDataEngineering.[12]A.LakshmanandP.Malik.Cassandra:Adecentralizedstructuredstoragesystem.SIGOPSOper.Syst.Rev.,44(2):35{40,Apr.2010.[13]P.E.O'Neil,E.Cheng,D.Gawlick,andE.J.O'Neil.Thelog-structuredmerge-tree(LSM-tree).ActaInf.,33(4):351{385,1996.[14]C.Plattner,G.Alonso,andM.T.Ozsu.ExtendingDBMSswithsatellitedatabases.VLDBJ.,17(4):657{682,2008.[15]R.SearsandR.Ramakrishnan.bLSM:ageneralpurposelogstructuredmergetree.SIGMOD,pages217{228,2012.[16]D.G.SeveranceandG.M.Lohman.Di erential les:Theirapplicationtothemaintenanceoflargedatabases.ACMTrans.DatabaseSyst.,1(3):256{267,Sept.1976.[17]V.Sikka,F.Farber,W.Lehner,S.K.Cha,T.Peh,andC.Bornhovd.EcienttransactionprocessinginSAPHANAdatabase:Theendofacolumnstoremyth.SIGMOD,pages731{742,2012.[18]T.Wood,P.J.Shenoy,A.Venkataramani,andM.S.Yousif.Black-boxandgray-boxstrategiesforvirtualmachinemigration.InNSDI,2007.