/
Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases Sandeep Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases Sandeep

Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases Sandeep - PDF document

lindy-dunigan
lindy-dunigan . @lindy-dunigan
Follow
509 views
Uploaded On 2014-11-29

Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases Sandeep - PPT Presentation

The theory of distributed systems shunned the notion of time and intro duced causality tracking as a clean abstraction to rea son about concurrency The practical systems employed physical time NTP information but in a best effort man ner due to the ID: 18631

The theory distributed

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "Logical Physical Clocks and Consistent S..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Figure1:LCandVCtimestampingtions,however,asymmetricroutesandnetworkconges-tioncanoccasionallycauseerrorsof100msormore.2)PThasseveralkinkssuchasleapseconds[13,14]andnon-monotonicupdatestoPOSIXtime[8]whichmaycausethetimestampstogobackwards.TrueTime(TT).TrueTimeisproposedrecentlybyGooglefordevelopingSpanner[2],amultiversiondis-tributeddatabase.TTreliesonawellengineeredtightclocksynchronizationavailableatallnodesthankstoGPSclocksandatomicclocksmadeavailableateachcluster.WhileTTavoidssomeofthedisadvantagesofLC/VC/PT,itintroducesnewdisadvantages:1)TTrequiresspecialhardwareandacustom-buildtightclocksynchronizationprotocol,whichisinfeasibleformanysystems(e.g.,us-ingleasednodesfrompubliccloudproviders).2)IfTTisusedfororderingeventsthatrespectcausalitythenitisessentialthatifehbfthentt:ett:f.SinceTTispurelybasedonclocksynchronizationofphysicalclocks,tosat-isfythisconstraint,Spannerdelayseventfwhenneces-sary.Suchdelaysandreducedconcurrencyareprohibitiveespeciallyunderlooserclocksynchronization. Figure2:NotwaitingouttheuncertaintyregionsinTTmayresultininconsistentsnapshotsHybridTime(HT).HT,whichcombinesVCandPTclocks,wasproposedforsolvingthestabilizingcausalde-terministicmergeproblem[10].HTmaintainsaVCateachnodewhichincludesknowledgethisnodehasaboutthePTclocksofothernodes.HTexploitstheclocksyn-chronizationassumptionofPTclockstotrimentriesfromVCandreducestheoverheadofcausalitytracking.InpracticethesizeofHTatanodewouldonlydependonthenumberofnodesthatcommunicatedwiththatnodewithinthelasttime,wheredenotestheclocksynchro-nizationuncertainty.Recently,DemirbasandKulkarni[3]exploredhowHTcanbeadoptedtosolvetheconsistentsnapshotprobleminSpanner[2].1.2ContributionsofthisworkInthispaperweaimtobridgethegapbetweenthetheory(LC)andpractice(PT)oftimekeepingandtimestampingindistributedsystemsandtoprovideguaranteesthatgen-eralizeandimprovethatofTT.WepresentalogicalclockversionofHT,whichwenameasHybridLogicalClocks(HLC).HLCrenesboththephysicalclock(similartoPTandTT)andthelogicalclock(similartoLC).HLCmaintainsitslogicalclocktobealwaysclosetotheNTPclock,andhence,HLCcanbeusedinlieuofphysical/NTPclockinseveralapplicationssuchassnapshotreadsindistributedkeyvaluestoresanddatabases.Mostimportantly,HLCpreservesthepropertyoflogicalclocks(ehbf)hlc:ehlc:f)andassuchHLCcanidentifyandreturnconsistentglobalsnapshotswithoutneedingtowaitoutclocksynchronizationuncertaintiesandwithoutneedingpriorcoordination,inaposteriorifashion.HLCisbackwardscompatiblewithNTP,andtsinthe64bitsNTPtimestampformat.Moreover,HLCworksasasuperpositionontheNTPprotocol(i.e.,HLConlyreadsthephysicalclocksanddoesnotup-datethem)soHLCcanrunalongsideapplicationsusingNTPwithoutanyinterference.FurthermoreHLCisgeneralanddoesnotrequireaserver-clientarchitecture.HLCworksforapeer-to-peernodesetupacrossWANdeployment,andallowsnodestousedifferentNTPservers.2InSection3,wepresenttheHLCalgorithmandproveatightboundonthespacerequirementsofHLCandshowthattheboundsufcesforHLCtocapturetheLCpropertyforcausalreasoning. 2InfactHLCcanworkwithadhocclocksynchronizationproto-cols[17]andisnotboundtoNTP.2 Initiallylc:j:=0Sendorlocaleventl:j:=max(l:j+1;pt:j)Timestampwithl:jReceiveeventofmessageml:j:=max(l:j+1;l:m+1;pt:j)Timestampwithl:jFigure3:NaiveHLCalgorithmfornodejtotakeuncoordinateda-posterioriconsistentsnapshotsofthedistributedsystemstate.3.2DescriptionoftheNaiveAlgorithmGiventhegoalthatl:eshouldbeclosetopt:e,inthenaivealgorithmwebeginwiththerule:foranyevente,l:ept:e.WedesignouralgorithmasshowninFigure3.ThisalgorithmworkssimilartoLC.Initiallyalllvaluesaresetto0.Whenasendevent,sayf,iscreatedonnodej,wesetl:ftobemax(l:e+1;pt:j),whereeisthepre-viouseventonnodej.Thisensuresl:el:f.Italsoen-suresthatl:fpt:f.Likewise,whenareceiveeventfiscreatedonnodej,l:fissettomax(l:e+1;l:m+1;pt:j),wherel:eisthetimestampofthepreviouseventonj,andl:misthetimestampofthemessage(and,hence,thesendevent).Thisensuresthatl:el:fandl:ml:f.ItiseasytoseethatthealgorithminFigure3satisesthersttworequirementsintheproblemstatement.How-ever,thisnaivealgorithmviolatesthefourthrequirement,whichalsoleadstoaviolationofthethirdrequirementforboundedspacerepresentation.Toshowtheviolationofthefourthrequirement,wepointtothecounterexam-pleinFigure4whichshowshowjl:e�pt:ejgrowsinanunboundedfashion.Themessagingloopamongnodes1,2,and3canberepeatedforever,andateachturnoftheloopthedriftbetweenlogicalclockandphysicalclock(thel�ptdifference)willkeepgrowing.Therootoftheunboundeddriftproblemisduetothenaivealgorithmusingltomaintainboththemaximumofptvaluesseensofarandthelogicalclockincrementsfromnewevents(local,send,receive).Thismakestheclocksloseinformation:itbecomesunclearifthenewlvaluecamefrompt(asinthemessagefromnode0tonode1)orfromcausality(asisthecasefortherestofmessages).Assuch,thereisnosuitableplacetoresetlvaluetoboundthel�ptdifference,becauseresettinglmayleadtolosingthehbrelation,and,hence,aviolationofrequirement1. Figure4:CounterexampleNotethatthecounterexampleholdsevenwiththere-quirementthatthephysicalclockofanodeisincrementedbyatleastonebetweenanytwoeventsonthatnode.Figure4satisesthisconstraintbetweenptandl,yetstilljl�ptjkeepsgrowingunboundedly.However,thereareconditionsunderwhichthecounterexampledoesnotwork,andthenaivealgorithmsufcesforsolvingtheHLCproblem.Ifweassumethatthetimeforsendeventandreceiveeventislongenoughsothatthephysicalclockofeverynodeisincrementedbyatleastone,thenthecounterexampleonFigure4fails,andthenaivealgorithmwouldbeabletomaintainjl�ptjbounded.Insteadofdependingonassumptionsonphysicalclockrateandeventgenerationrateacrossallnodesinthesys-temforprovingthecorrectnessandboundednessofHLC,weshowhowtoproperlyimplementHLCnext.3.3HLCAlgorithmAllproblemsincomputersciencecanbesolvedbyanotherlevelofindirection.–DavidWheelerWeuseourobservationsfromthecounterexampletodevelopthecorrectHLCalgorithm.Inthisalgorithm,thel:jinthenaivealgorithmisexpandedtotwoparts:l:jandc:j.Therstpartl:jisintroducedasalevelofindirectiontomaintainthemaximumofptinformationlearnedsofar,andcisusedforcapturingcausalityupdatesonlywhenlvaluesareequal.Incontrasttothenaivealgorithmwheretherewasnosuitableplacetoresetlwithoutviolatinghb,intheHLCalgorithm,wecanresetcwhentheinformation4 UsingTheorem3,wecanshowthatjl�ptjisbounded.Corollary1.Foranyeventf,jl:f�pt:fjProof.Wecannothavetwoeventseandfsuchthatehbfandpt:e�pt:f+duetoclocksynchroniza-tionconstraints.Hence,fromTheorem3,thistheoremfollows. Finally,weproverequirement3,byshowingthatcvalueofHLCisboundedaswell.Tothisend,weex-tendTheorem3toidentifytherelationofcandeventscreatedataparticulartime.AsweshowinTheorem4,c:fcapturesinformationregardingeventscreatedattimel:f.Theorem4.Foranyeventf,c:f=k^k�0)(9g1;g2;;gk:(8j:1jk:gihbgi+1)^(8j:1jk:l:(gi)=l:f)^gkhbf)Proof.Weprovethisbyinduction.Thisistriviallysat-isedintheinitialstate.Also,ifc:fissetto0thenthisstatementistriviallysatised.Increationofsendevent,c:fissettoc:e+1onlyifl:eequalsl:f.Byinduction,thereexistsasequenceoflengthc:ethatsatisesthestatementofthetheorem.Moreover,ehbfand:(ehbe).Hence,thereexistsasequenceofc:e+1(=c:f)thatsatisesthestatementofthetheorem.Asimilaranalysisalsoappliesforthereceiveeventwhenc:fissettoc:e+1orc:m+1. FromTheorem4,thefollowingtwocorollariesfollow.Corollary2.Foranyeventf,c:fjfg:ghbf^l:g=l:f)gj.Corollary3.Foranyeventf,c:fN(+1)Proof.FromCorollary2,foranyeventf,c:fjfg:ghbf^l:g=l:f)gj.Also,fromTheorem2,l:gpt:g.Also,byclocksynchronizationassumptionofghbfthenpt:gpt:f+.Hence,theonlyeventsthatcanfallintothesetfg:ghbf^l:g=l:f)garethosethatwerecreatedwhenphysicaltimeofthenodethatcreatedthemwasbetween[l:f;l:f+].Byourconstraintthatphysicalclockofanodeisincrementedbyatleastonebetweenanytwoeventsonthatnode,thereareatmost+1sucheventsonanyonenode.Hence,thecorollaryfollows. Whiletheaboveboundisalmosttight,addingasmallreasonableassumptioncansubstantiallyreducetheboundonc,andtherebyreducingthespacethatneedstobeallo-catedforthat.Assumptiontoreducetheboundoncfurther:Weassumethatthetimeformessagetransmissionislongenoughsothatthephysicalclockofeverynodeisincre-mentedbyatleastd,wheredisagivenparameter.Now,considerthesituationwherec:f=k;k�0,atnodej.Fromtheaboveassumption,fromTheorem4,wehaveasequenceofkeventsg1;g2;;gkthatsatisfytheconditionsinTheorem4.Inotherwords,l:(g1)=l:f.Letldenotethenodewhereg1wascreated.Hence,wheng1wascreated,pt:lwasatleastequaltol:f.Byassump-tionaboutclocksynchronization,whenfiscreatedpt:lisatleastl:f+(k�1)d.Givenclocksynchronizationconstraints,thismustbelessthanpt:f+.Simplifyingthis,kislessthan=d+1+(pt:f�l:f).FromTheorem2,wehaveCorollary4.Undertheassumptionmadeabove,c:fisatmost=d+1.Recallthatford1,thecounterexampleinFigure4doesnothold,andthenaivealgorithmwouldbecomeboundableandalsosatisfytheHLCrequirements.ThedifferencebetweentheHLCalgorithmandthenaiveal-gorithmisthattheHLCalgorithmdidnotneedthisas-sumptiontoshowthatitisbounded,butonlytoreducethesizeofthebound.3.4PropertiesofHLCHLCalgorithmisdesignedforarbitrarydistributedar-chitectureandisalsoreadilyapplicabletootherenviron-mentssuchastheclient-servermodel.WeintentionallychosetoimplementHLCasasuper-positiononNTP.Inotherwords,HLConlyreadsthephysicalclockbutdoesnotupdateit.Hence,ifanodere-ceivesamessagewhosetimestampishigher,wemaintainthisinformationvialandcinsteadofchangingthephys-icalclock.ThisiscrucialinensuringthatotherprogramsthatuseNTPalonearenotaffected.Thisalsoavoidsthepotentialproblemwhereclocksofnodesaresynchronizedwitheachothereventhoughtheydriftsubstantiallyfromrealwall-clock.Furthermore,thereareimpossibilityre-sultsshowingthatacceptingeventinyunsynchronizationtoadjusttheclockscanleadtodivergingclocks[6].Fi-nally,whileHLCutilizesNTPforsynchronization,itdoesnotdependonit.Inparticular,evenwhenphysicalclocksutilizeanyadhocclocksynchronizationalgorithm[17],6 (acouplehours)tosynchronize,wegetlowerNTPoff-setvalues.Weused“ntpdc-cloopinfo”and“ntpdc-ckerninfo”callstoobtaintheNTPoffsetinformationatthenodes.Using4m1.xlargenodes c offset=5ms offset=1.5ms 0 83.90% 83.66%1 12.12% 12.03%2 3.37% 4.09%3 0.24% 0.21%Theexperimentswith4nodesshowthatthevalueofcremainsverylow,lessthan4.ThisisamuchlowerboundthantheworstcasepossibletheoreticalboundweprovedinSection3.WealsoseethattheimprovedNTPsyn-chronizationhelpsmovethecdistributiontowardlowervalues,butthiseffectbecomesmorevisibleinthe8and16nodeexperiments.WiththelooserNTPsynchroniza-tion,withaverageoffset5ms,themaximuml�ptdif-ferencewasobservedtobe21.7ms.The90thpercentileofl�ptvaluescorrespondto7.8ms,withtheiraveragevaluecomputedtobe0.2ms.WiththetighterNTPsyn-chronization,withaverageoffset1.5ms,themaximuml�ptdifferencewasobservedtobe20.3ms.The90thpercentileofl�ptvaluescorrespondto8.1ms,withtheiraveragevaluecomputedtobe0.2ms.Using8m1.xlargenodes c offset=9ms offset=3ms 0 65.56% 91.18%1 15.39% 8.82%2 8.14% 0%3 5.90% 4 2.74% 5 1.39% 6 0.56% 7 0.20% 8 0.08% 9 0.03% Theexperimentswith8nodeshighlightstheloweredcvaluesduetoimprovedNTPsynchronization.FortheexperimentswithaverageNTPoffset9ms,themaximuml�ptdifferencewasobservedtobe107.9ms.The90thpercentileofl�ptvaluescorrespondto41.4ms,withtheiraveragevaluecomputedtobe4.2ms.Fortheexper-imentswithaverageNTPoffset3ms,themaximuml�ptdifferencewasobservedtobe7.4ms.The90thpercentileofl�ptvaluescorrespondto0.1ms,withtheiraveragevaluecomputedtobe0ms.Using16m1.xlargenodes c offset=16ms offset=6ms 0 66.96% 75.43%1 19.40% 18.51%2 7.50% 3.83%3 4.59% 1.84%4 1.76% 0.32%5 0.61% 0.06%6 0.14% 0.01%7 0.02% The16nodeexperimentsalsoshowedverylowcvaluesdespiteallnodessendingtoeachotheratpracticallyatthewirespeed.FortheexperimentswithaverageNTPoffset16ms,themaximuml�ptdifferencewasobservedtobe90.5ms.The90thpercentileofl�ptvaluescorrespondto25.2ms,withtheiraveragevaluecomputedtobe2.3ms.FortheexperimentswithaverageNTPoffset6ms,themaximuml�ptdifferencewasobservedtobe46.8ms.The90thpercentileofl�ptvaluescorrespondto8.4ms,withtheiraveragevaluecomputedtobe0.3ms.WANdeploymentresults.WedeployedourHLCtestingexperimentsonaWANenvironmentaswell.Specically,weused4m1.xlargeinstanceseachonelo-catedatadifferentAWSregion:Ireland,USEast,USWestandTokyo.Ourresultsshowthatwith3msNTPoffset,thec=0valuesconstituteabout95%ofthecasesandc=1constitutetheremaining5%.Thesevaluesaremuchlowerthanthecorrespondingvaluesforthesingledatacenterdeployment.Themaximuml�ptdifferenceremainedextremelylow,about0.02ms,andthe90thper-centileofl�ptvaluescorrespondedto0.Thesevaluesareagainmuchlowerthanthecorrespondingvaluesforthesingledatacenterdeployment.Thereasonforseeingverylowl�ptandcvaluesintheWANdeploymentisbecausethemessagecommunicationdelaysacrossWANaremuchlargerthanthe,theclocksynchronizationuncertainty.Asaresult,whenamessageisreceived,itsltimestampisalreadyinthepastandissmallerthanthelvalueatthereceiverwhichisupdatedbyitspt.SincethesingleclusterdeploymentwithshortmessagedelaysisthemostdemandingscenariointermsofHLCtestingwefocusedonthoseresultsinourpresen-tation.5.2StresstestingandresilienceevaluationinsimulationTofurtheranalyzetheresiliencyofHLC,weevaluateditinscenarioswhereitwillbestressed,e.g.,wheretheeventrateistoohighandwheretheclocksynchronization8 6.2CompactTimestampingusinglandcNTPuses64-bittimestampswhichconsistofa32-bitpartforsecondsanda32-bitpartforfractionalsecond.(Thisgivesatimescalethatrollsoverevery232seconds—136years—andatheoreticalresolutionof2�32seconds—233picoseconds.)Usingasingle64-bittimestamptorepresentHLCisalsoverydesirableforbackwardscom-patibilitywithNTPclocks.Beingbackwardscompat-iblewithNTPclocksisimportantbecausemanydis-tributeddatabasesystemsanddistributedkey-valuestoresuseNTPclockstotimestampandcomparerecords.Thereare,however,severalchallengesforrepresent-ingHLCasasingle64-bittimestamp.Firstly,theHLCalgorithmmaintainslandcseparately,todifferenti-atebetweenincreasesduetothephysicalclockversussend/receive/localevents.Secondly,bytrackingthept,thesizeoflisbydefault64-bitsastheNTPtimestamps.Weproposethefollowingschemeforcombininglandcandstoringitinsingle64bittimestamp.Thisschemeinvolvesrestrictingltotrackonlythemostsignicant48bitsofptintheHLCalgorithmpresentedinFigure5.Roundingupptvaluesto48bitslvaluesstillgivesusmicrosecondgranularitytrackingofpt.GivenNTPsyn-chronizationlevels,thisissufcientgranularitytorepre-sentNTPtime.Thewayweroundupptistoalwaystaketheceilingtothe48thbit.IntheHLCalgorithminFig-ure5,lisupdatedsimilarlybutisdonefor48bits.Whenthelvaluesremainunchangedinanevent,wecapturethatbyincrementingcfollowingtheHLCalgorithminFigure5.16bitsremainforcandallowsitroomtogrowupto65536,whichismorethanenoughasweshowinourexperimentsinSection5.Usingthiscompactrepresentation,ifweneedtotimes-tamp(messageordataitemfordatabasestorage),wewillconcatenatectoltocreatetheHLCtimestamp.Thedis-tributedconsistentsnapshotndingalgorithmdescribedaboveisunaffectedbythischangetothecompactrepre-sentation.Theonlyadjustmenttobemadeistoroundupthequerytimetto48bitsaswell.6.3OtherrelatedworkDynamo[23]adoptsVCasversionvectorsforcausalitytrackingofupdatestothereplicas.CassandrausesPTandLWW-ruleforupdatingreplicas.Spanner[2]employsTTtoorderdistributedtransac-tionsatglobalscale,andfacilitatereadsnapshotsacrossthedistributeddatabase.Inordertoensureehbf)tt:ett:fandprovideconsistentsnapshots,Spannerre-quireswaiting-outuncertaintyintervalsofTTatthetrans-actioncommittimewhichrestrictsthroughputonwrites.However,these“commit-waits”alsoenableSpannertoprovideastrongerproperty,externalconsistency(a.k.a,strictserializability):ifatransactiont1commits(inab-solutetime)beforeanothertransactiont2starts,thent1'sassignedcommittimestampissmallerthant2's.HLCdoesnotrequirewaitingouttheclockuncertainty,sinceitisabletorecordcausalityrelationswithinthisun-certaintyintervalusingtheHLCupdaterules.HLCcanalsobeadoptedforprovidingexternalconsistencyandstillkeepingthethroughputonwritesunrestrictedbyin-troducingclient-notication-waitafteratransactionends.Analternateapproachfororderingeventsistoestablishexplicitrelationbetweenevents.Thisapproachisexem-pliedintheKronossystem[5],whereeacheventofin-terestisregisteredwiththeKronosservice,andtheappli-cationexplicitlyidentieseventsthatareofinterestfromcausalityperspective.Thisallowsonetocapturecausal-itythatisapplication-dependentattheincreasedcostofsearchingtheeventdependencyrelationgraph.Bycon-trast,LC/VC/PT/HLCassumethatifanodeperformstwoconsecutiveeventsthenthesecondeventcausallydependsupontherstone.Thus,theorderingisbasedsolelyonthetimestampsassignedtotheevents.7ConclusionInthispaper,weintroducedthehybridlogicalclocks(HLC)thatcombinesthebenetsoflogicalclocks(LC)andphysicaltime(PT)whileovercomingtheirshortcom-ings.HLCguaranteesthat(oneway)causalinformationiscaptured,andhence,itcanbeusedinplaceofLC.SinceHLCprovidesnodesalogicaltimethatiswithinpossibleclockdriftofPT,HLCissubstitutableforPTinanyap-plicationthatrequiresit.HLCisstrictlymonotonicand,hence,canbeusedinplaceofapplicationsinordertotolerateNTPkinkssuchasnon-monotonicupdates.HLCcanbeimplementedusing64bitsspace,andisbackwardscompatiblewithNTPclocks.Moreover,HLConlyreadsNTPclockvaluesbutdoesnotchangeit.Hence,applicationsusingHLCdonotaffectotherap-plicationsthatonlyrelyonNTP.HLCishighlyresilient.Sinceitsspacerequirementisboundedbytheoreticalanalysisandisshowntobeevenmoretightlyboundedbyourexperiments,weusethisasafoundationtodesignstabilizingfaulttolerancetoHLC.SinceHLCrenesLC,HLCcanbeusedtoobtainaconsistentsnapshotforasnapshotread.Moreover,since11 [23]W.Vogels.Eventuallyconsistent.CommunicationsoftheACM,52(1):40–44,2009.[24]Z.Wu,M.Butkiewicz,D.Perkins,E.Katz-Bassett,andH.Madhyastha.Spanstore:Cost-effectivegeo-replicatedstoragespanningmultiplecloudservices.InSOSP,pages292–308,2013.[25]Y.Zhang,R.Power,S.Zhou,Y.Sovran,M.Aguil-era,andJ.Li.Transactionchains:Achievingserial-izabilitywithlowlatencyingeo-distributedstoragesystems.InSOSP,pages276–291,2013.13